Dong Qiwen, Wang Kai, Liu Xuan
Institute for Data Science and Engineering, East China Normal University, Shanghai, 200062, People's Republic of China.
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, 518055, People's Republic of China.
BMC Syst Biol. 2016 Dec 23;10(Suppl 4):113. doi: 10.1186/s12918-016-0352-6.
With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins.
Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the "uncertain" category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases.
The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.
随着高通量测序技术的快速发展,蛋白质组学研究成为后基因组时代的一个热门领域。有必要识别所有天然编码的蛋白质序列,以便进行进一步的功能和通路分析。为此,人类蛋白质组组织于2011年启动了人类蛋白质计划。然而,许多蛋白质难以通过实验方法检测到,这成为人类蛋白质组计划的瓶颈之一。考虑到使用湿实验方法检测这些缺失蛋白质的复杂性,我们在这里使用生物信息学方法对缺失蛋白质进行预筛选。
由于生物序列与自然语言之间存在相似性,自然语言处理领域的n元语法模型已被用于筛选缺失蛋白质。本研究中使用的数据集包含来自neXtProt数据库“不确定”类别的616种缺失蛋白质。n元语法模型推导得出102种蛋白质,它们极有可能是天然的人类蛋白质。我们对这些缺失蛋白质的预测结构和功能进行了详细分析,并将高可能性蛋白质与其他质谱数据集进行了比较。评估表明,这里报告的结果与其他成熟数据库获得的结果高度一致。
分析表明,102种蛋白质可能是天然基因编码的蛋白质,一些缺失蛋白质是膜蛋白或天然无序蛋白,难以通过实验方法检测到。