Durham Jesse, Zhang Jing, Schaeffer Richard D, Cong Qian
Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States.
Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae740.
Due to the breakthrough in protein structure prediction by AlphaFold, the scientific community has access to 200 million predicted protein structures with near-atomic accuracy from the AlphaFold protein structure DataBase (AFDB), covering nearly the entire protein universe. Segmenting these models into domains and classifying them into an evolutionary hierarchy hold tremendous potential for unraveling essential insights into protein function.
We introduce DPAM-AI, a Domain Parser for AlphaFold Models based on Artificial Intelligence. DPAM-AI utilizes a convolutional neural network trained with previously classified domains in the Evolutionary Classification Of protein Domains (ECOD) database. DPAM-AI integrates inter-residue distances, predicted aligned errors, and sequence and structural alignments to previously classified domains detected via sequence (HHsuite) and structural (Dali) similarity searches. DPAM-AI has demonstrated its power through rigorous tests, excelling in several benchmark sets compared to its predecessor, DPAM, and other recently published domain parsers, Merizo and Chainsaw. We applied DPAM-AI to representative AFDB models for proteins classified in Pfam. We obtained representative 3D structures for 18 487 (89%) of the 20 795 Pfam families. The remaining families either (i) belong to viral proteins that were excluded from AFDB or (ii) do not adopt globular 3D structures. Our structure-aware domain delineation uncovered a considerable fraction (15%) of Pfam domains containing multiple structural and evolutionary units and refined the boundaries for over half.
Pfam and corresponding DPAM-AI domains are at http://prodata.swmed.edu/DPAM-pfam/. Our code is deposited at https://github.com/Jsauce5p/DPAM/tree/dpam_ai, and updates will be released through https://github.com/CongLabCode/DPAM.
由于AlphaFold在蛋白质结构预测方面取得的突破,科学界可以从AlphaFold蛋白质结构数据库(AFDB)中获取近2亿个具有近乎原子精度的预测蛋白质结构,几乎涵盖了整个蛋白质领域。将这些模型分割成结构域并将它们分类到一个进化层次结构中,对于揭示蛋白质功能的基本见解具有巨大潜力。
我们引入了DPAM-AI,一种基于人工智能的AlphaFold模型结构域解析器。DPAM-AI利用在蛋白质结构域进化分类(ECOD)数据库中预先分类的结构域训练的卷积神经网络。DPAM-AI整合了残基间距离、预测的比对误差以及通过序列(HHsuite)和结构(Dali)相似性搜索检测到的与预先分类结构域的序列和结构比对。通过严格测试,DPAM-AI展示了其强大功能,与它的前身DPAM以及其他最近发表的结构域解析器Merizo和Chainsaw相比,在几个基准测试集中表现出色。我们将DPAM-AI应用于Pfam中分类的蛋白质的代表性AFDB模型。我们为20795个Pfam家族中的18487个(89%)获得了代表性的三维结构。其余家族要么(i)属于被排除在AFDB之外的病毒蛋白,要么(ii)不采用球状三维结构。我们基于结构的结构域划分发现了相当一部分(15%)包含多个结构和进化单元的Pfam结构域,并细化了超过一半结构域的边界。
Pfam和相应的DPAM-AI结构域可在http://prodata.swmed.edu/DPAM-pfam/获取。我们的代码存放在https://github.com/Jsauce5p/DPAM/tree/dpam_ai,更新将通过https://github.com/CongLabCode/DPAM发布。