• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用生物语言模型识别人类蛋白质组中缺失的蛋白质。

Identifying the missing proteins in human proteome by biological language model.

作者信息

Dong Qiwen, Wang Kai, Liu Xuan

机构信息

Institute for Data Science and Engineering, East China Normal University, Shanghai, 200062, People's Republic of China.

Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, 518055, People's Republic of China.

出版信息

BMC Syst Biol. 2016 Dec 23;10(Suppl 4):113. doi: 10.1186/s12918-016-0352-6.

DOI:10.1186/s12918-016-0352-6
PMID:28155671
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5259966/
Abstract

BACKGROUND

With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins.

RESULTS

Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the "uncertain" category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases.

CONCLUSION

The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.

摘要

背景

随着高通量测序技术的快速发展,蛋白质组学研究成为后基因组时代的一个热门领域。有必要识别所有天然编码的蛋白质序列,以便进行进一步的功能和通路分析。为此,人类蛋白质组组织于2011年启动了人类蛋白质计划。然而,许多蛋白质难以通过实验方法检测到,这成为人类蛋白质组计划的瓶颈之一。考虑到使用湿实验方法检测这些缺失蛋白质的复杂性,我们在这里使用生物信息学方法对缺失蛋白质进行预筛选。

结果

由于生物序列与自然语言之间存在相似性,自然语言处理领域的n元语法模型已被用于筛选缺失蛋白质。本研究中使用的数据集包含来自neXtProt数据库“不确定”类别的616种缺失蛋白质。n元语法模型推导得出102种蛋白质,它们极有可能是天然的人类蛋白质。我们对这些缺失蛋白质的预测结构和功能进行了详细分析,并将高可能性蛋白质与其他质谱数据集进行了比较。评估表明,这里报告的结果与其他成熟数据库获得的结果高度一致。

结论

分析表明,102种蛋白质可能是天然基因编码的蛋白质,一些缺失蛋白质是膜蛋白或天然无序蛋白,难以通过实验方法检测到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/1980d6ff208c/12918_2016_352_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/d3a658bb5ca9/12918_2016_352_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/7c26c214b2f8/12918_2016_352_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/871e429a7800/12918_2016_352_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/1980d6ff208c/12918_2016_352_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/d3a658bb5ca9/12918_2016_352_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/7c26c214b2f8/12918_2016_352_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/871e429a7800/12918_2016_352_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be96/5259966/1980d6ff208c/12918_2016_352_Fig4_HTML.jpg

相似文献

1
Identifying the missing proteins in human proteome by biological language model.利用生物语言模型识别人类蛋白质组中缺失的蛋白质。
BMC Syst Biol. 2016 Dec 23;10(Suppl 4):113. doi: 10.1186/s12918-016-0352-6.
2
Combination of Multiple Spectral Libraries Improves the Current Search Methods Used to Identify Missing Proteins in the Chromosome-Centric Human Proteome Project.多个光谱库的组合改进了当前用于在以染色体为中心的人类蛋白质组计划中识别缺失蛋白质的搜索方法。
J Proteome Res. 2015 Dec 4;14(12):4959-66. doi: 10.1021/acs.jproteome.5b00578. Epub 2015 Sep 14.
3
Informatics View on the Challenges of Identifying Missing Proteins from Shotgun Proteomics.信息学视角下鸟枪法蛋白质组学中鉴定缺失蛋白质的挑战
J Proteome Res. 2015 Dec 4;14(12):5396-407. doi: 10.1021/acs.jproteome.5b00482. Epub 2015 Nov 19.
4
Prediction of a missing protein expression map in the context of the human proteome project.在人类蛋白质组计划背景下对缺失蛋白质表达图谱的预测。
J Proteome Res. 2015 Mar 6;14(3):1350-60. doi: 10.1021/pr500850u. Epub 2015 Feb 5.
5
Identification and Validation of Human Missing Proteins and Peptides in Public Proteome Databases: Data Mining Strategy.公共蛋白质组数据库中人类缺失蛋白和肽的鉴定和验证:数据挖掘策略。
J Proteome Res. 2017 Dec 1;16(12):4403-4414. doi: 10.1021/acs.jproteome.7b00423. Epub 2017 Oct 31.
6
Probing the Missing Human Proteome: A Computational Perspective.探索缺失的人类蛋白质组:计算视角
J Proteome Res. 2015 Dec 4;14(12):4949-58. doi: 10.1021/acs.jproteome.5b00728. Epub 2015 Oct 5.
7
The quest of the human proteome and the missing proteins: digging deeper.人类蛋白质组及缺失蛋白质的探寻:深入挖掘
OMICS. 2015 May;19(5):276-82. doi: 10.1089/omi.2015.0035.
8
Functional annotation and biological interpretation of proteomics data.蛋白质组学数据的功能注释与生物学解读
Biochim Biophys Acta. 2015 Jan;1854(1):46-54. doi: 10.1016/j.bbapap.2014.10.019. Epub 2014 Oct 31.
9
Positional proteomics in the era of the human proteome project on the doorstep of precision medicine.在精准医学即将到来之际,人类蛋白质组计划时代的定位蛋白质组学。
Biochimie. 2016 Mar;122:110-8. doi: 10.1016/j.biochi.2015.10.018. Epub 2015 Nov 14.
10
Protannotator: a semiautomated pipeline for chromosome-wise functional annotation of the "missing" human proteome.Protannotator:一种用于对“缺失”的人类蛋白质组进行染色体水平功能注释的半自动流程。
J Proteome Res. 2014 Jan 3;13(1):76-83. doi: 10.1021/pr400794x. Epub 2013 Dec 13.

引用本文的文献

1
Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.连接人工智能与生物科学:生物信息学中大型语言模型的全面综述
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.
2
Grammar of protein domain architectures.蛋白质结构域架构的语法。
Proc Natl Acad Sci U S A. 2019 Feb 26;116(9):3636-3645. doi: 10.1073/pnas.1814684116. Epub 2019 Feb 7.
3
Effective computational detection of piRNAs using n-gram models and support vector machine.利用 n-gram 模型和支持向量机进行有效的 piRNA 计算检测。

本文引用的文献

1
PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation.PseDNA-Pro:结合周氏伪氨基酸组成和物理化学距离变换的DNA结合蛋白鉴定方法
Mol Inform. 2015 Jan;34(1):8-17. doi: 10.1002/minf.201400025. Epub 2014 Sep 26.
2
Structural Bioinformatics Inspection of neXtProt PE5 Proteins in the Human Proteome.人类蛋白质组中NextProt PE5蛋白的结构生物信息学研究
J Proteome Res. 2015 Sep 4;14(9):3750-61. doi: 10.1021/acs.jproteome.5b00516. Epub 2015 Jul 30.
3
repRNA: a web server for generating various feature vectors of RNA sequences.
BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):517. doi: 10.1186/s12859-017-1896-1.
repRNA:一个用于生成RNA序列各种特征向量的网络服务器。
Mol Genet Genomics. 2016 Feb;291(1):473-81. doi: 10.1007/s00438-015-1078-7. Epub 2015 Jun 18.
4
Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences.Pse-in-One:一个用于生成DNA、RNA和蛋白质序列各种伪组件模式的网络服务器。
Nucleic Acids Res. 2015 Jul 1;43(W1):W65-71. doi: 10.1093/nar/gkv458. Epub 2015 May 9.
5
repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects.repDNA:一个 Python 包,通过结合用户定义的物理化学性质和序列顺序效应,为 DNA 序列生成各种模式的特征向量。
Bioinformatics. 2015 Apr 15;31(8):1307-9. doi: 10.1093/bioinformatics/btu820. Epub 2014 Dec 10.
6
Genenames.org: the HGNC resources in 2015.Genenames.org:2015年的HGNC资源。
Nucleic Acids Res. 2015 Jan;43(Database issue):D1079-85. doi: 10.1093/nar/gku1071. Epub 2014 Oct 31.
7
UniProt: a hub for protein information.通用蛋白质数据库(UniProt):蛋白质信息中心。
Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.
8
iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition.iDNA-Prot|dis:通过将氨基酸距离对和简化字母表概况纳入通用伪氨基酸组成来鉴定DNA结合蛋白。
PLoS One. 2014 Sep 3;9(9):e106691. doi: 10.1371/journal.pone.0106691. eCollection 2014.
9
Mass-spectrometry-based draft of the human proteome.基于质谱的人类蛋白质组草图。
Nature. 2014 May 29;509(7502):582-7. doi: 10.1038/nature13319.
10
A draft map of the human proteome.人类蛋白质组草图。
Nature. 2014 May 29;509(7502):575-81. doi: 10.1038/nature13302.