• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

自动检测生物序列数据库中与文献不一致的记录。

Automated detection of records in biological sequence databases that are inconsistent with the literature.

作者信息

Bouadjenek Mohamed Reda, Verspoor Karin, Zobel Justin

机构信息

Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.

出版信息

J Biomed Inform. 2017 Jul;71:229-240. doi: 10.1016/j.jbi.2017.06.015. Epub 2017 Jun 15.

DOI:10.1016/j.jbi.2017.06.015
PMID:28624643
Abstract

We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

摘要

我们对核苷酸序列数据库的数据质量进行调查和分析,目的是自动检测数据异常和可疑记录。具体而言,我们证明了与每个数据记录相关的已发表文献可用于自动评估其质量,方法是将数据库记录的关键内容与参考文献进行交叉核对,以检验其一致性。以GenBank为重点,我们基于信息检索(IR)的相关性范式描述了一组质量指标。然后,我们使用这些质量指标训练一种异常检测算法,将记录分类为“可信”或“可疑”。我们在PubMed Central数据集上的实验表明,通过我们的算法评估文献与数据库记录之间的一致性,是协助管理人员进行数据清理的有效机制。虽然我们数据集中已知有缺陷的记录不到0.25%,但我们预计GenBank中还有更多尚未被识别的记录。通过与文献进行自动比较,这些记录的识别精度可达10%,召回率可达30%,同时性能明显优于多个基线。虽然这些结果还有很大的改进空间,但它们既反映了数据的严重不平衡性质,也反映了可用的明确标记数据有限。总体而言,所获得的结果为开发一种基于文献分析和一致性检测低质量和可疑序列记录的新方法带来了希望。从实际角度来看,这将极大地帮助管理人员通过突出显示可能与文献不一致的记录,在大规模序列数据库中识别不一致的记录。

相似文献

1
Automated detection of records in biological sequence databases that are inconsistent with the literature.自动检测生物序列数据库中与文献不一致的记录。
J Biomed Inform. 2017 Jul;71:229-240. doi: 10.1016/j.jbi.2017.06.015. Epub 2017 Jun 15.
2
Literature consistency of bioinformatics sequence databases is effective for assessing record quality.生物信息学序列数据库的文献一致性对于评估记录质量是有效的。
Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax021.
3
Automated assessment of biological database assertions using the scientific literature.利用科学文献自动评估生物数据库断言。
BMC Bioinformatics. 2019 Apr 29;20(1):216. doi: 10.1186/s12859-019-2801-x.
4
A comparison of the performance of seven key bibliographic databases in identifying all relevant systematic reviews of interventions for hypertension.七个关键文献数据库在识别所有关于高血压干预措施的相关系统评价方面的性能比较。
Syst Rev. 2016 Feb 9;5:27. doi: 10.1186/s13643-016-0197-5.
5
PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database.PubMed 文本相似度模型及其在保守域数据库编目工作中的应用。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz064.
6
Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.利用深度学习扩展数据管理:在基因组变异资源文献分类中的应用。
PLoS Comput Biol. 2018 Aug 13;14(8):e1006390. doi: 10.1371/journal.pcbi.1006390. eCollection 2018 Aug.
7
Textpresso: an ontology-based information retrieval and extraction system for biological literature.Textpresso:一个基于本体的生物文献信息检索与提取系统。
PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.
8
Getting to know journal bibliographic databases.了解期刊书目数据库。
Singapore Med J. 2010 Oct;51(10):757-60; quiz 761.
9
Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation.在PubMed查询中发现生物医学语义关系以进行信息检索和数据库管理。
Database (Oxford). 2016 Mar 25;2016. doi: 10.1093/database/baw025. Print 2016.
10
A method for cohort selection of cardiovascular disease records from an electronic health record system.一种从电子健康记录系统中选择心血管疾病记录队列的方法。
Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30.

引用本文的文献

1
The Constrained-Disorder Principle Assists in Overcoming Significant Challenges in Digital Health: Moving from "Nice to Have" to Mandatory Systems.约束-无序原则助力克服数字健康领域的重大挑战:从“可有可无”迈向强制使用的系统。
Clin Pract. 2023 Aug 20;13(4):994-1014. doi: 10.3390/clinpract13040089.
2
A Universal Approach to Molecular Identification of Rumen Fluke Species Across Hosts, Continents, and Sample Types.一种跨宿主、大陆和样本类型对瘤胃吸虫物种进行分子鉴定的通用方法。
Front Vet Sci. 2021 Mar 4;7:605259. doi: 10.3389/fvets.2020.605259. eCollection 2020.
3
Response score of deep learning for out-of-distribution sample detection of medical images.
深度学习在医学图像分布外样本检测中的响应分数
J Biomed Inform. 2020 Jul;107:103442. doi: 10.1016/j.jbi.2020.103442. Epub 2020 May 22.
4
Automated assessment of biological database assertions using the scientific literature.利用科学文献自动评估生物数据库断言。
BMC Bioinformatics. 2019 Apr 29;20(1):216. doi: 10.1186/s12859-019-2801-x.
5
Neurodevelopmental heterogeneity and computational approaches for understanding autism.神经发育异质性与自闭症理解的计算方法。
Transl Psychiatry. 2019 Feb 4;9(1):63. doi: 10.1038/s41398-019-0390-0.
6
Multi-field query expansion is effective for biomedical dataset retrieval.多字段查询扩展对生物医学数据集检索有效。
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax062.
7
Literature consistency of bioinformatics sequence databases is effective for assessing record quality.生物信息学序列数据库的文献一致性对于评估记录质量是有效的。
Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax021.