• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

无监督深度学习可从未比对序列中识别蛋白质功能基团。

Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences.

作者信息

David Kyle T, Halanych Kenneth M

机构信息

Department of Biological Sciences, Auburn University, Auburn, AL, USA.

Center for Marine Sciences, University of North Carolina Wilmington, NC, USA.

出版信息

Genome Biol Evol. 2023 May 22;15(5). doi: 10.1093/gbe/evad084.

DOI:10.1093/gbe/evad084
PMID:37217837
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10231473/
Abstract

Interpreting protein function from sequence data is a fundamental goal of bioinformatics. However, our current understanding of protein diversity is bottlenecked by the fact that most proteins have only been functionally validated in model organisms, limiting our understanding of how function varies with gene sequence diversity. Thus, accuracy of inferences in clades without model representatives is questionable. Unsupervised learning may help to ameliorate this bias by identifying highly complex patterns and structure from large datasets without external labels. Here we present DeepSeqProt, an unsupervised deep learning program for exploring large protein sequence datasets. DeepSeqProt is a clustering tool capable of distinguishing between broad classes of proteins while learning local and global structure of functional space. DeepSeqProt is capable of learning salient biological features from unaligned, unannotated sequences. DeepSeqProt is more likely to capture complete protein families and statistically significant shared ontologies within proteomes than other clustering methods. We hope this framework will prove of use to researchers and provide a preliminary step in further developing unsupervised deep learning in molecular biology.

摘要

从序列数据中解读蛋白质功能是生物信息学的一个基本目标。然而,我们目前对蛋白质多样性的理解受到这样一个事实的限制,即大多数蛋白质仅在模式生物中得到功能验证,这限制了我们对功能如何随基因序列多样性而变化的理解。因此,在没有模式代表的进化枝中进行推断的准确性值得怀疑。无监督学习可能有助于通过从没有外部标签的大型数据集中识别高度复杂的模式和结构来改善这种偏差。在这里,我们展示了DeepSeqProt,这是一个用于探索大型蛋白质序列数据集的无监督深度学习程序。DeepSeqProt是一种聚类工具,能够在学习功能空间的局部和全局结构的同时区分广泛的蛋白质类别。DeepSeqProt能够从未比对、未注释的序列中学习显著的生物学特征。与其他聚类方法相比,DeepSeqProt更有可能在蛋白质组中捕获完整的蛋白质家族和具有统计学意义的共享本体。我们希望这个框架将被证明对研究人员有用,并为进一步发展分子生物学中的无监督深度学习提供一个初步步骤。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a97/10231473/3bc5aee34d5f/evad084f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a97/10231473/26fa8cad3849/evad084f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a97/10231473/591f07286ad8/evad084f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a97/10231473/3bc5aee34d5f/evad084f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a97/10231473/26fa8cad3849/evad084f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a97/10231473/591f07286ad8/evad084f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a97/10231473/3bc5aee34d5f/evad084f3.jpg

相似文献

1
Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences.无监督深度学习可从未比对序列中识别蛋白质功能基团。
Genome Biol Evol. 2023 May 22;15(5). doi: 10.1093/gbe/evad084.
2
3
Integrating Deep Supervised, Self-Supervised and Unsupervised Learning for Single-Cell RNA-seq Clustering and Annotation.将深度监督学习、自监督学习和无监督学习相结合进行单细胞 RNA-seq 聚类和注释。
Genes (Basel). 2020 Jul 14;11(7):792. doi: 10.3390/genes11070792.
4
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象:化学与物理邂逅生物学(瑞士阿斯科纳,2012年6月10日至14日)
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.
5
Exploration of the mechanism of traditional Chinese medicine by AI approach using unsupervised machine learning for cellular functional similarity of compounds in heterogeneous networks, XiaoErFuPi granules as an example.基于无监督机器学习的 AI 方法探索中药作用机制——以小二扶脾颗粒为例,研究化合物在异质网络中细胞功能相似性。
Pharmacol Res. 2020 Oct;160:105077. doi: 10.1016/j.phrs.2020.105077. Epub 2020 Jul 17.
6
Navigating the amino acid sequence space between functional proteins using a deep learning framework.使用深度学习框架探索功能蛋白之间的氨基酸序列空间。
PeerJ Comput Sci. 2021 Sep 17;7:e684. doi: 10.7717/peerj-cs.684. eCollection 2021.
7
Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity.解锁聚类和分类方法的潜力:探索有监督和无监督的化学相似性。
Environ Health Perspect. 2024 Aug;132(8):85002. doi: 10.1289/EHP14001. Epub 2024 Aug 6.
8
[Unsupervised deep learning for identifying the O -carboxymethyl guanine by nanopore sequencing].[用于通过纳米孔测序鉴定O-羧甲基鸟嘌呤的无监督深度学习]
Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2022 Feb 25;39(1):139-148. doi: 10.7507/1001-5515.202104068.
9
Organizing the bacterial annotation space with amino acid sequence embeddings.利用氨基酸序列嵌入来组织细菌注释空间。
BMC Bioinformatics. 2022 Sep 23;23(1):385. doi: 10.1186/s12859-022-04930-5.
10
Deep learning-based clustering approaches for bioinformatics.基于深度学习的生物信息学聚类方法。
Brief Bioinform. 2021 Jan 18;22(1):393-415. doi: 10.1093/bib/bbz170.

引用本文的文献

1
Revealing arginine-cysteine and glycine-cysteine NOS linkages by a systematic re-evaluation of protein structures.通过对蛋白质结构进行系统的重新评估来揭示精氨酸-半胱氨酸和甘氨酸-半胱氨酸一氧化氮合酶连接
Commun Chem. 2025 May 13;8(1):146. doi: 10.1038/s42004-025-01535-w.
2
Disentangling cobionts and contamination in long-read genomic data using sequence composition.利用序列组成解缠长读基因组数据中的共生物和污染。
G3 (Bethesda). 2024 Nov 6;14(11). doi: 10.1093/g3journal/jkae187.
3
PRIEST: predicting viral mutations with immune escape capability of SARS-CoV-2 using temporal evolutionary information.

本文引用的文献

1
The specious art of single-cell genomics.单细胞基因组学的似是而非的艺术。
PLoS Comput Biol. 2023 Aug 17;19(8):e1011288. doi: 10.1371/journal.pcbi.1011288. eCollection 2023 Aug.
2
Using deep learning to annotate the protein universe.利用深度学习标注蛋白质宇宙。
Nat Biotechnol. 2022 Jun;40(6):932-937. doi: 10.1038/s41587-021-01179-w. Epub 2022 Feb 21.
3
TIAMMAt: Leveraging Biodiversity to Revise Protein Domain Models, Evidence from Innate Immunity.TIAMMAt:利用生物多样性修正蛋白质结构域模型,先天免疫的证据。
PRIEST:利用 SARS-CoV-2 的时间进化信息预测具有免疫逃逸能力的病毒突变。
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae218.
Mol Biol Evol. 2021 Dec 9;38(12):5806-5818. doi: 10.1093/molbev/msab258.
4
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
5
Sensitive protein alignments at tree-of-life scale using DIAMOND.使用 DIAMOND 进行生命之树尺度上的敏感蛋白质比对。
Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7.
6
Visualizing population structure with variational autoencoders.使用变分自动编码器进行人口结构可视化。
G3 (Bethesda). 2021 Jan 18;11(1). doi: 10.1093/g3journal/jkaa036.
7
DeepNOG: fast and accurate protein orthologous group assignment.DeepNOG:快速准确的蛋白质直系同源组分配
Bioinformatics. 2021 Apr 1;36(22-23):5304-5312. doi: 10.1093/bioinformatics/btaa1051.
8
The Gene Ontology resource: enriching a GOld mine.基因本体论资源:丰富一个 GOld 矿。
Nucleic Acids Res. 2021 Jan 8;49(D1):D325-D334. doi: 10.1093/nar/gkaa1113.
9
UniProt: the universal protein knowledgebase in 2021.UniProt:2021 年的通用蛋白质知识库。
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100.
10
Phylogenetics is the New Genetics (for Most of Biodiversity).系统发生学是新遗传学(对于大多数生物多样性而言)。
Trends Ecol Evol. 2020 May;35(5):415-425. doi: 10.1016/j.tree.2020.01.005. Epub 2020 Mar 21.