Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland.
Bioinformatics. 2016 Mar 15;32(6):918-25. doi: 10.1093/bioinformatics/btv644. Epub 2015 Nov 10.
The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday.
We present a new generic methodology to identify problematic records, causing what we describe as 'data hairball' structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses.
Supplementary data are available at Bioinformatics online.
生物医学科学家可获得的数据种类不断增加,有望更好地了解疾病,并为患者发现新的治疗方法。为了全面了解生物医学问题,需要将来自许多不同来源的数据合并为统一的表示形式。在数据集成过程中,初始源中不可避免的错误和歧义会影响数据仓库的质量,并大大降低内容的科学价值。然后需要昂贵且耗时的人工整理来提高信息的质量。但是,随着可用存储库的规模和数量每天都在增长,为数据集成项目分配和优化资源变得越来越困难。
我们提出了一种新的通用方法来识别有问题的记录,从而导致我们所说的“数据乱麻”结构。该方法基于图,依赖于传统上在社会科学中使用的两个指标:图密度和中间中心性。我们评估和讨论了这些措施,并展示了它们在灵活,优化和自动化的数据整理和链接方面的相关性。该方法侧重于信息的一致性和正确性,以提高数据集成工作(例如知识库和大型数据仓库)的科学意义。
补充数据可在“Bioinformatics”在线获取。