Suppr超能文献

利用基于深度学习的命名实体识别提高 SRA BioSample 条目的元数据覆盖范围。

Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.

机构信息

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.

Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.

出版信息

Database (Oxford). 2021 Apr 29;2021. doi: 10.1093/database/baab021.

Abstract

High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.

摘要

高质量的元数据注释对于数据的可重复性研究和快速、强大和可扩展的元分析至关重要。目前,国家生物技术信息中心序列读取档案 (SRA) 中的大多数测序样本在多个类别中都缺少元数据。为了提高这些样本的元数据覆盖率,我们利用了来自 SRA BioSample 的近 4400 万属性值对来训练一个可扩展的递归神经网络,通过命名实体识别 (NER) 预测缺失的元数据。该网络首先根据 11 个元数据类别来对短文本短语进行分类,总体准确率和接收者操作特征曲线下的面积分别达到 85.2%和 0.977。然后,我们将我们的分类器应用于从样本的较长 TITLE 属性中预测 11 个元数据类别,在一组从模型训练中保留的样本上评估性能。从 TITLE 中提取样本属/种 (94.85%)、条件/疾病 (95.65%)和菌株 (82.03%)时,预测准确率很高,而其他类别的准确率较低且缺乏预测,突出了 BioSample 中当前元数据注释的多个问题。这些结果表明,递归神经网络在基于 NER 的元数据预测方面具有实用性,并且像这里提出的模型这样的模型具有增加 BioSample 中元数据覆盖率的潜力,同时最大限度地减少对人工策展的需求。数据库 URL:https://github.com/cartercompbio/PredictMEE。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9cf/8083811/8729ce7dfe6b/baab021f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验