Suppr超能文献

整合方法中的质量控制,用于检测生物数据库中的错误和不一致性。

Quality controls in integrative approaches to detect errors and inconsistencies in biological databases.

作者信息

Ghisalberti Giorgio, Masseroli Marco, Tettamanti Luca

机构信息

Electronics and Information Department, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano, Italy.

出版信息

J Integr Bioinform. 2010 Mar 25;7(3):454. doi: 10.2390/biecoll-jib-2010-119.

Abstract

Numerous biomolecular data are available, but they are scattered in many databases and only some of them are curated by experts. Most available data are computationally derived and include errors and inconsistencies. Effective use of available data in order to derive new knowledge hence requires data integration and quality improvement. Many approaches for data integration have been proposed. Data warehousing seams to be the most adequate when comprehensive analysis of integrated data is required. This makes it the most suitable also to implement comprehensive quality controls on integrated data. We previously developed GFINDer (http://www.bioinformatics.polimi.it/GFINDer/), a web system that supports scientists in effectively using available information. It allows comprehensive statistical analysis and mining of functional and phenotypic annotations of gene lists, such as those identified by high-throughput biomolecular experiments. GFINDer backend is composed of a multi-organism genomic and proteomic data warehouse (GPDW). Within the GPDW, several controlled terminologies and ontologies, which describe gene and gene product related biomolecular processes, functions and phenotypes, are imported and integrated, together with their associations with genes and proteins of several organisms. In order to ease maintaining updated the GPDW and to ensure the best possible quality of data integrated in subsequent updating of the data warehouse, we developed several automatic procedures. Within them, we implemented numerous data quality control techniques to test the integrated data for a variety of possible errors and inconsistencies. Among other features, the implemented controls check data structure and completeness, ontological data consistency, ID format and evolution, unexpected data quantification values, and consistency of data from single and multiple sources. We use the implemented controls to analyze the quality of data available from several different biological databases and integrated in the GFINDer data warehouse. By doing so, we identified in these data a variety of different types of errors and inconsistencies; this enables us to ensure good quality of the data in the GFINDer data warehouse. We reported all identified data errors and inconsistencies to the curators of the original databases from where the data were retrieved, who mainly corrected them in subsequent updating of the original database. This contributed to improve the quality of the data available, in the original databases, to the whole scientific community.

摘要

有大量的生物分子数据可供使用,但它们分散在许多数据库中,只有一部分由专家整理。大多数现有数据是通过计算得出的,包含错误和不一致之处。因此,为了获取新知识而有效利用现有数据需要进行数据整合和质量提升。已经提出了许多数据整合方法。当需要对整合数据进行全面分析时,数据仓库似乎是最合适的。这也使得它最适合对整合数据实施全面的质量控制。我们之前开发了GFINDer(http://www.bioinformatics.polimi.it/GFINDer/),这是一个网络系统,可支持科学家有效利用现有信息。它允许对基因列表的功能和表型注释进行全面的统计分析和挖掘,例如通过高通量生物分子实验确定的那些注释。GFINDer的后端由一个多生物体基因组和蛋白质组数据仓库(GPDW)组成。在GPDW中,导入并整合了几个受控术语和本体,它们描述了与基因和基因产物相关的生物分子过程、功能和表型,以及它们与几种生物体的基因和蛋白质的关联。为了便于维护GPDW的更新并确保在数据仓库后续更新中整合的数据具有尽可能高的质量,我们开发了几个自动程序。在这些程序中,我们实施了许多数据质量控制技术,以测试整合数据中各种可能的错误和不一致之处。在其他功能中,实施的控制检查数据结构和完整性、本体数据一致性、ID格式和演变、意外的数据量化值以及来自单个和多个来源的数据的一致性。我们使用实施的控制来分析从几个不同生物数据库获取并整合到GFINDer数据仓库中的数据质量。通过这样做,我们在这些数据中识别出了各种不同类型的错误和不一致之处;这使我们能够确保GFINDer数据仓库中的数据质量良好。我们将所有识别出的数据错误和不一致之处报告给了从中检索数据的原始数据库的管理员,他们主要在原始数据库的后续更新中对其进行了纠正。这有助于提高原始数据库中可供整个科学界使用的数据质量。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验