从不完整训练数据中发现功能位点：以核酸结合蛋白为例的研究

Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid-Binding Proteins.

作者信息

Wang Wenchuan, Langlois Robert, Langlois Marina, Genchev Georgi Z, Wang Xiaolei, Lu Hui

机构信息

SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, Chinas.

Department of Bioengineering and Department of Computer Science, University of Illinois at Chicago, Chicago, IL, United States.

出版信息

Front Genet. 2019 Aug 30;10:729. doi: 10.3389/fgene.2019.00729. eCollection 2019.

DOI:10.3389/fgene.2019.00729

PMID:31543893

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6729729/

Abstract

Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNA-binding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multiple-instance learning, which only requires knowledge of the protein's function yet identifies functionally relevant residues and need not rely on homology. We developed a new multiple-instance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification.

摘要

功能注释工作为我们理解细胞过程和活细胞的功能提供了基础。这推动了高通量计算方法来表征具有特定功能的新蛋白质成员。研究工作主要集中在有判别力的机器学习方法上，这些方法有望对蛋白质功能做出高效预测。此外，现有的功能注释主要针对单个蛋白质，而非残基，而对于特定功能的传递而言，只有一部分残基是必需的。这限制了有判别力的方法用于预测那些有足够残基水平注释的功能，例如识别DNA结合蛋白，或者可以推断出出色全局表示的情况。对蛋白质各种功能的全面理解需要在残基水平上进行发现和功能注释。在此，我们将这个问题转化为多实例学习的框架，该框架只需要知道蛋白质的功能，就能识别出功能相关的残基，且无需依赖同源性。我们开发了一种源自AdaBoost的新多实例学习算法，并针对两项研究充分的蛋白质功能预测任务对该算法进行了基准测试：注释与DNA和RNA结合的蛋白质。该算法在注释蛋白质功能的同时，能识别出参与DNA和RNA结合的功能相关残基，优于某些先前的方法，并且在一个蛋白质-DNA基准测试中，它实现了近乎完美的分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/abe2/6729729/3a5d4c3dc810/fgene-10-00729-g001.jpg

相似文献

Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid-Binding Proteins.从不完整训练数据中发现功能位点：以核酸结合蛋白为例的研究

Front Genet. 2019 Aug 30;10:729. doi: 10.3389/fgene.2019.00729. eCollection 2019.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法：一种用于判别式多类别蛋白质折叠和超家族识别的工具。

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins.学习将序列和结构转化为功能：识别DNA结合蛋白和膜结合蛋白。

Ann Biomed Eng. 2007 Jun;35(6):1043-52. doi: 10.1007/s10439-007-9312-z. Epub 2007 Apr 13.

A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues.基于序列的DNA和RNA结合残基预测因子的全面比较综述。

Brief Bioinform. 2016 Jan;17(1):88-105. doi: 10.1093/bib/bbv023. Epub 2015 May 1.

ProNA2020 predicts protein-DNA, protein-RNA, and protein-protein binding proteins and residues from sequence.ProNA2020 可从序列预测蛋白质-DNA、蛋白质-RNA 和蛋白质-蛋白质结合蛋白及残基。

J Mol Biol. 2020 Mar 27;432(7):2428-2443. doi: 10.1016/j.jmb.2020.02.026. Epub 2020 Mar 4.

Explainable protein function annotation using local structure embeddings.使用局部结构嵌入进行可解释的蛋白质功能注释。

bioRxiv. 2023 Oct 16:2023.10.13.562298. doi: 10.1101/2023.10.13.562298.

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines.DNAPred：基于超平面距离集成支持向量机的蛋白质序列 DNA 结合位点准确识别。

J Chem Inf Model. 2019 Jun 24;59(6):3057-3071. doi: 10.1021/acs.jcim.8b00749. Epub 2019 Apr 16.

Prediction of protein-RNA binding sites by a random forest method with combined features.基于组合特征的随机森林方法预测蛋白质-RNA 结合位点。

Bioinformatics. 2010 Jul 1;26(13):1616-22. doi: 10.1093/bioinformatics/btq253. Epub 2010 May 18.

Semi-supervised multi-label collective classification ensemble for functional genomics.用于功能基因组学的半监督多标签集体分类集成方法

BMC Genomics. 2014;15 Suppl 9(Suppl 9):S17. doi: 10.1186/1471-2164-15-S9-S17. Epub 2014 Dec 8.

Protein promiscuity: drug resistance and native functions--HIV-1 case.蛋白质的多特异性：耐药性与天然功能——以HIV-1为例

J Biomol Struct Dyn. 2005 Jun;22(6):615-24. doi: 10.1080/07391102.2005.10531228.

引用本文的文献

Proteomic analysis revealed T cell hyporesponsiveness induced by Haemonchus contortus excretory and secretory proteins.蛋白质组学分析显示，旋毛虫排泄分泌蛋白诱导 T 细胞反应低下。

Vet Res. 2020 May 13;51(1):65. doi: 10.1186/s13567-020-00790-0.

本文引用的文献

Transcription Factors Contribute to Differential Expression in Cellular Pathways in Lung Adenocarcinoma and Lung Squamous Cell Carcinoma.转录因子导致肺腺癌和肺鳞癌细胞通路中的差异表达。

Interdiscip Sci. 2018 Dec;10(4):836-847. doi: 10.1007/s12539-018-0300-9. Epub 2018 Jul 23.

Gene microarray analysis of the circular RNAs expression profile in human gastric cancer.人胃癌中环状RNA表达谱的基因芯片分析

Oncol Lett. 2018 Jun;15(6):9965-9972. doi: 10.3892/ol.2018.8590. Epub 2018 Apr 26.

A new method to measure the semantic similarity from query phenotypic abnormalities to diseases based on the human phenotype ontology.一种基于人类表型本体的新方法，用于测量从查询表型异常到疾病的语义相似度。

BMC Bioinformatics. 2018 May 8;19(Suppl 4):162. doi: 10.1186/s12859-018-2064-y.

Mass detection in digital breast tomosynthesis data using convolutional neural networks and multiple instance learning.使用卷积神经网络和多实例学习进行数字乳腺断层合成数据中的肿块检测。

Comput Biol Med. 2018 May 1;96:283-293. doi: 10.1016/j.compbiomed.2018.04.004. Epub 2018 Apr 12.

A Systematic Review on Popularity, Application and Characteristics of Protein Secondary Structure Prediction Tools.蛋白质二级结构预测工具的流行度、应用及特征的系统综述

Curr Drug Discov Technol. 2019;16(2):159-172. doi: 10.2174/1570163815666180227162157.

A novel joint analysis framework improves identification of differentially expressed genes in cross disease transcriptomic analysis.一种新型联合分析框架改进了跨疾病转录组分析中差异表达基因的识别。

BioData Min. 2018 Feb 20;11:3. doi: 10.1186/s13040-018-0163-y. eCollection 2018.

Identification of DNA-protein Binding Sites through Multi-Scale Local Average Blocks on Sequence Information.基于序列信息的多尺度局部平均块识别 DNA-蛋白质结合位点。

Molecules. 2017 Nov 28;22(12):2079. doi: 10.3390/molecules22122079.

RNA-seq Based Transcription Characterization of Fusion Breakpoints as a Potential Estimator for Its Oncogenic Potential.基于RNA测序的融合断点转录特征分析作为其致癌潜力的潜在评估指标

Biomed Res Int. 2017;2017:9829175. doi: 10.1155/2017/9829175. Epub 2017 Oct 17.

Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL).基于聚类和多重实例学习组装的宏基因组数据表型预测（CAMIL）。

IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):828-840. doi: 10.1109/TCBB.2017.2758782. Epub 2017 Oct 4.

Multi-Instance Multi-Label Learning for Multi-Class Classification of Whole Slide Breast Histopathology Images.多实例多标签学习在全切片乳腺组织病理学图像多类分类中的应用。

IEEE Trans Med Imaging. 2018 Jan;37(1):316-325. doi: 10.1109/TMI.2017.2758580. Epub 2017 Oct 2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从不完整训练数据中发现功能位点：以核酸结合蛋白为例的研究

Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid-Binding Proteins.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献