David Kyle T, Halanych Kenneth M
Department of Biological Sciences, Auburn University, Auburn, AL, USA.
Center for Marine Sciences, University of North Carolina Wilmington, NC, USA.
Genome Biol Evol. 2023 May 22;15(5). doi: 10.1093/gbe/evad084.
Interpreting protein function from sequence data is a fundamental goal of bioinformatics. However, our current understanding of protein diversity is bottlenecked by the fact that most proteins have only been functionally validated in model organisms, limiting our understanding of how function varies with gene sequence diversity. Thus, accuracy of inferences in clades without model representatives is questionable. Unsupervised learning may help to ameliorate this bias by identifying highly complex patterns and structure from large datasets without external labels. Here we present DeepSeqProt, an unsupervised deep learning program for exploring large protein sequence datasets. DeepSeqProt is a clustering tool capable of distinguishing between broad classes of proteins while learning local and global structure of functional space. DeepSeqProt is capable of learning salient biological features from unaligned, unannotated sequences. DeepSeqProt is more likely to capture complete protein families and statistically significant shared ontologies within proteomes than other clustering methods. We hope this framework will prove of use to researchers and provide a preliminary step in further developing unsupervised deep learning in molecular biology.
从序列数据中解读蛋白质功能是生物信息学的一个基本目标。然而,我们目前对蛋白质多样性的理解受到这样一个事实的限制,即大多数蛋白质仅在模式生物中得到功能验证,这限制了我们对功能如何随基因序列多样性而变化的理解。因此,在没有模式代表的进化枝中进行推断的准确性值得怀疑。无监督学习可能有助于通过从没有外部标签的大型数据集中识别高度复杂的模式和结构来改善这种偏差。在这里,我们展示了DeepSeqProt,这是一个用于探索大型蛋白质序列数据集的无监督深度学习程序。DeepSeqProt是一种聚类工具,能够在学习功能空间的局部和全局结构的同时区分广泛的蛋白质类别。DeepSeqProt能够从未比对、未注释的序列中学习显著的生物学特征。与其他聚类方法相比,DeepSeqProt更有可能在蛋白质组中捕获完整的蛋白质家族和具有统计学意义的共享本体。我们希望这个框架将被证明对研究人员有用,并为进一步发展分子生物学中的无监督深度学习提供一个初步步骤。