IEEE/ACM Trans Comput Biol Bioinform. 2021 Sep-Oct;18(5):1996-2007. doi: 10.1109/TCBB.2020.2966633. Epub 2021 Oct 8.
Next-generation sequencing techniques provide us with an opportunity for generating sequenced proteins and identifying the biological families and functions of these proteins. However, compared with identified proteins, uncharacterized proteins consist of a notable percentage of the overall proteins in the bioinformatics research field. Traditional family classification methods often devote themselves to extracting N-Gram features from sequences while ignoring motif information as well as affinity information between motifs and adjacent amino acids. Previous clustering-based algorithms have typically been used to define protein features with domain knowledge and annotate protein families based on extensive data samples. In this paper, we apply CNN based amino acid representation learning with limited characterized proteins to explore the performances of annotated protein families by taking into account the amino acid location information. Additionally, we apply the method to all reviewed protein sequences with their families retrieved from the UniProt database to evaluate our approach. Last but not least, we verify our model using those unreviewed protein records, which is typically ignored by other methods.
下一代测序技术为我们提供了生成测序蛋白质的机会,并确定这些蛋白质的生物家族和功能。然而,与已鉴定的蛋白质相比,未鉴定的蛋白质在生物信息学研究领域中占蛋白质总量的相当大的比例。传统的家族分类方法往往致力于从序列中提取 N-gram 特征,而忽略了基序信息以及基序与相邻氨基酸之间的亲和力信息。之前基于聚类的算法通常用于使用领域知识定义蛋白质特征,并根据大量数据样本对蛋白质家族进行注释。在本文中,我们应用基于 CNN 的氨基酸表示学习,并利用有限的已鉴定蛋白质来探索注释蛋白质家族的性能,同时考虑氨基酸位置信息。此外,我们将该方法应用于从 UniProt 数据库中检索到的具有家族信息的所有已审查蛋白质序列,以评估我们的方法。最后但同样重要的是,我们使用其他方法通常忽略的未审查蛋白质记录来验证我们的模型。