Chicco Davide, Fabris Alessandro, Jurman Giuseppe
Università di Milano-Bicocca & University of Toronto, Toronto, Canada.
Max Planck Institute for Security and Privacy, Bochum, Germany.
BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x.
Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.
生物医学数据集是计算生物学和健康信息学项目的支柱,可以在多个在线数据平台上找到,也可以从湿实验室生物学家和医生那里获得。然而,这些数据集的质量和可信度有时可能很差,进而产生不良结果,这可能会伤害患者和数据主体。为了解决这个问题,政策制定者、研究人员和联盟提出了各种法规、指南和评分来评估数据集的质量并提高其可靠性。然而,尽管它们通常很有用,但往往不完整且不切实际。特别是,《数据集数据表》的指南过于繁多;Kaggle数据集可用性评分的要求侧重于非科学要求(例如,包括封面图片);欧盟人工智能法案(EU AI Act)提出的数据治理要求稀疏且笼统,我们针对生物医学人工智能的数据集进行了调整。在此背景下,我们引入了新的金星评分来评估生物医学数据集的数据质量和可信度。我们的评分范围从0到10,由十个问题组成,任何开发生物信息学、医学信息学或化学信息学数据集的人在发布之前都应该回答这些问题。在本研究中,我们首先描述了欧盟人工智能法案、《数据集数据表》和Kaggle数据集可用性评分,介绍了它们的要求和缺点。为此,我们首次反向设计了有影响力的Kaggle评分的权重,并在本研究中报告了这些权重。我们将最重要的数据治理要求提炼为十个针对生物医学领域的问题,构成了金星评分。我们将金星评分应用于来自多个子领域的十二个数据集,包括电子健康记录、医学成像、微阵列和批量RNA测序基因表达、化学信息学、生理心电图信号和医学文本。通过分析结果,我们揭示了流行数据集的细粒度优势和劣势以及总体趋势。最值得注意的是,我们发现普遍存在掩盖数据不准确和噪声来源的倾向,这可能会阻碍对数据的可靠利用,进而影响研究结果。总体而言,我们的结果证实了金星评分在评估生物医学数据可信度方面的适用性和实用性。