Denger Andreas, Helms Volkhard
Center for Bioinformatics, Saarland University, Saarbrücken, Germany.
PLoS One. 2024 Dec 19;19(12):e0315330. doi: 10.1371/journal.pone.0315330. eCollection 2024.
Membrane transporters are responsible for moving a wide variety of molecules across biological membranes, making them integral to key biological pathways in all organisms. Identifying all membrane transporters within a (meta-)proteome, along with their specific substrates, provides important information for various research fields, including biotechnology, pharmacology, and metabolomics. Protein datasets are frequently annotated with thousands of molecular functions that form complex networks, often with partial or full redundancy and hierarchical relationships. This complexity, along with the low sample count for more specific functions, makes them unsuitable as classes for supervised learning methods, meaning that the creation of an optimal subset of annotations is required. However, selection of this subset requires extensive manual effort, along with knowledge about the biology behind the respective functions. Here, we present an automated pipeline to address this problem. Unlike previous approaches for reducing redundancy in GO datasets, we employ machine learning to identify a subset of functional annotations in a training dataset. Classes in the resulting predictive model meet four essential criteria: sufficient sample size for training predictive models, minimal redundancy, strong class separability, and relevance to substrate transport. Furthermore, we implemented a pipeline for creating training datasets of transmembrane transporters that cover a wide range of organisms, including plants, bacteria, mammals, and single-cell eukaryotes. For a dataset containing 98.1% of transporters from S. cerevisiae, the pipeline automatically reduced the number of functional annotations from 287 to 11 GO terms that could be classified with a median pairwise F1 score of 0.87±0.16. For a meta-organism dataset containing 96% of all transport proteins from S. cerevisiae, A. thaliana, E. coli and human, the number of classes was reduced from 695 to 49, with a median F1 score of 0.92±0.10 between pairs of GO terms. When lowering the percentage of covered proteins down to 67%, the pipeline found a subset of 30 GO terms with a median F1 score of 0.95±0.06.
膜转运蛋白负责将各种各样的分子转运穿过生物膜,使其成为所有生物体关键生物途径中不可或缺的一部分。识别(宏)蛋白质组中的所有膜转运蛋白及其特定底物,为包括生物技术、药理学和代谢组学在内的各个研究领域提供了重要信息。蛋白质数据集经常被注释有成千上万种分子功能,这些功能形成复杂的网络,通常具有部分或完全冗余以及层次关系。这种复杂性,再加上更特定功能的样本数量较少,使得它们不适合作为监督学习方法的类别,这意味着需要创建一个最佳注释子集。然而,选择这个子集需要大量的人工努力,以及对各个功能背后生物学知识的了解。在此,我们提出了一个自动化流程来解决这个问题。与之前减少基因本体(GO)数据集中冗余的方法不同,我们采用机器学习来识别训练数据集中的功能注释子集。所得预测模型中的类别满足四个基本标准:有足够的样本量用于训练预测模型、冗余最小、类别可分离性强以及与底物转运相关。此外,我们实现了一个用于创建跨膜转运蛋白训练数据集的流程,该数据集涵盖了广泛的生物体,包括植物、细菌、哺乳动物和单细胞真核生物。对于一个包含来自酿酒酵母98.1%转运蛋白的数据集,该流程自动将功能注释的数量从287个减少到11个GO术语,这些术语分类的中位数成对F1分数为0.87±0.16。对于一个包含来自酿酒酵母、拟南芥、大肠杆菌和人类所有转运蛋白96%的宏生物体数据集,类别数量从695个减少到49个,GO术语对之间的中位数F1分数为0.92±0.10。当将覆盖蛋白质的百分比降低到67%时,该流程找到了一个包含30个GO术语的子集,中位数F1分数为0.95±