Suppr超能文献

从文献中提取数据提交声明:一种自动跟踪研究结果的方法。

Extraction of data deposition statements from the literature: a method for automatically tracking research results.

机构信息

National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA.

出版信息

Bioinformatics. 2011 Dec 1;27(23):3306-12. doi: 10.1093/bioinformatics/btr573. Epub 2011 Oct 13.

Abstract

MOTIVATION

Research in the biomedical domain can have a major impact through open sharing of the data produced. For this reason, it is important to be able to identify instances of data production and deposition for potential re-use. Herein, we report on the automatic identification of data deposition statements in research articles.

RESULTS

We apply machine learning algorithms to sentences extracted from full-text articles in PubMed Central in order to automatically determine whether a given article contains a data deposition statement, and retrieve the specific statements. With an Support Vector Machine classifier using conditional random field determined deposition features, articles containing deposition statements are correctly identified with 81% F-measure. An error analysis shows that almost half of the articles classified as containing a deposition statement by our method but not by the gold standard do indeed contain a deposition statement. In addition, our system was used to process articles in PubMed Central, predicting that a total of 52 932 articles report data deposition, many of which are not currently included in the Secondary Source Identifier [si] field for MEDLINE citations.

AVAILABILITY

All annotated datasets described in this study are freely available from the NLM/NCBI website at http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Neveol/DepositionDataSets.zip

CONTACT

aurelie.neveol@nih.gov; john.wilbur@nih.gov; zhiyong.lu@nih.gov

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

通过开放共享所产生的数据,生物医学领域的研究可以产生重大影响。出于这个原因,能够识别数据产生和存储的实例以便潜在地重用是很重要的。在此,我们报告了在研究文章中自动识别数据存储声明的方法。

结果

我们应用机器学习算法来处理从 PubMed Central 中的全文文章中提取的句子,以便自动确定给定的文章是否包含数据存储声明,并检索特定的声明。使用基于支持向量机的分类器和条件随机场确定的存储特征,包含存储声明的文章的正确识别率为 81%的 F 度量。错误分析表明,我们的方法分类为包含存储声明的文章中,有近一半实际上确实包含存储声明,但这些文章未被黄金标准所识别。此外,我们的系统还用于处理 PubMed Central 中的文章,预测共有 52932 篇文章报告了数据存储,其中许多文章目前并未包含在 MEDLINE 引用的二级来源标识符[si]字段中。

可用性

本研究中描述的所有注释数据集均可从 NLM/NCBI 网站免费获得,网址为 http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Neveol/DepositionDataSets.zip。

联系方式

aurelie.neveol@nih.gov; john.wilbur@nih.gov; zhiyong.lu@nih.gov

补充信息

补充数据可在 Bioinformatics 在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0aa/3223368/6782d4473778/btr573f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验