一种用于识别、监测和管理未完成编目数据集的统计方法。

A statistical approach to identify, monitor, and manage incomplete curated data sets.

机构信息

The Institute of Neuroscience, University of Oregon, Eugene, OR, USA.

出版信息

BMC Bioinformatics. 2018 Apr 2;19(1):110. doi: 10.1186/s12859-018-2121-6.

DOI:10.1186/s12859-018-2121-6

PMID:29609549

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5879614/

Abstract

BACKGROUND

Many biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here.

RESULTS

In this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0.97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95% confidence interval.

CONCLUSIONS

This method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information.

摘要

背景

许多生物知识库通过对已发表文献的专家整理来收集数据。高数据量、有选择的部分整理、访问延迟以及在有能力整理之前发布数据，这些都可能导致已发表数据的整理不完整。了解哪些数据集不完整以及它们的不完整程度仍然是一个挑战。意识到数据集可能不完整对于正确解释、避免产生有缺陷的假设很重要，并且可以证明进一步探索已发表文献以获取更多相关数据是合理的。需要有评估数据集完整性的计算方法。本文介绍了一种这样的方法。

结果

在这项工作中，使用多元线性回归模型来识别 Zebrafish Information Network（ZFIN）数据库中具有不完整 curated 基因表达数据集的基因。从 ZFIN 中的 36655 个基因记录开始，通过数据聚合、清理和过滤，将数据集减少到 9870 个适合训练和测试模型以预测每个基因的表达实验数量的基因记录。特征工程和选择确定了以下预测变量：期刊出版物数量；已经归因于基因表达注释的期刊出版物数量；已经归因于表达数据的期刊出版物百分比；基因符号；以及与每个基因相关的转基因构建体数量。将 25%的基因记录（2483 个基因）用于训练模型。其余的 7387 个基因用于测试模型。根据残差在模型的上下 95%置信区间之外，有 122 个和 165 个测试基因被确定为缺少表达注释。该模型在负 95%置信区间下的精度为 0.97，召回率为 0.71，在正 95%置信区间下的精度为 0.76，召回率为 0.73。