Suppr超能文献

利用生物语言模型识别人类蛋白质组中缺失的蛋白质。

Identifying the missing proteins in human proteome by biological language model.

作者信息

Dong Qiwen, Wang Kai, Liu Xuan

机构信息

Institute for Data Science and Engineering, East China Normal University, Shanghai, 200062, People's Republic of China.

Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, 518055, People's Republic of China.

出版信息

BMC Syst Biol. 2016 Dec 23;10(Suppl 4):113. doi: 10.1186/s12918-016-0352-6.

Abstract

BACKGROUND

With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins.

RESULTS

Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the "uncertain" category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases.

CONCLUSION

The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.

摘要

背景

随着高通量测序技术的快速发展,蛋白质组学研究成为后基因组时代的一个热门领域。有必要识别所有天然编码的蛋白质序列,以便进行进一步的功能和通路分析。为此,人类蛋白质组组织于2011年启动了人类蛋白质计划。然而,许多蛋白质难以通过实验方法检测到,这成为人类蛋白质组计划的瓶颈之一。考虑到使用湿实验方法检测这些缺失蛋白质的复杂性,我们在这里使用生物信息学方法对缺失蛋白质进行预筛选。

结果

由于生物序列与自然语言之间存在相似性,自然语言处理领域的n元语法模型已被用于筛选缺失蛋白质。本研究中使用的数据集包含来自neXtProt数据库“不确定”类别的616种缺失蛋白质。n元语法模型推导得出102种蛋白质,它们极有可能是天然的人类蛋白质。我们对这些缺失蛋白质的预测结构和功能进行了详细分析,并将高可能性蛋白质与其他质谱数据集进行了比较。评估表明,这里报告的结果与其他成熟数据库获得的结果高度一致。

结论

分析表明,102种蛋白质可能是天然基因编码的蛋白质,一些缺失蛋白质是膜蛋白或天然无序蛋白,难以通过实验方法检测到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/d3a658bb5ca9/12918_2016_352_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验