Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, British Columbia, Canada.
Department of Microbiology & Immunology, University of British Columbia, 2552-2350 Health Sciences Mall, Vancouver, British Columbia, Canada.
PLoS Comput Biol. 2020 Oct 1;16(10):e1008174. doi: 10.1371/journal.pcbi.1008174. eCollection 2020 Oct.
Metabolic inference from genomic sequence information is a necessary step in determining the capacity of cells to make a living in the world at different levels of biological organization. A common method for determining the metabolic potential encoded in genomes is to map conceptually translated open reading frames onto a database containing known product descriptions. Such gene-centric methods are limited in their capacity to predict pathway presence or absence and do not support standardized rule sets for automated and reproducible research. Pathway-centric methods based on defined rule sets or machine learning algorithms provide an adjunct or alternative inference method that supports hypothesis generation and testing of metabolic relationships within and between cells. Here, we present mlLGPR, multi-label based on logistic regression for pathway prediction, a software package that uses supervised multi-label classification and rich pathway features to infer metabolic networks in organismal and multi-organismal datasets. We evaluated mlLGPR performance using a corpora of 12 experimental datasets manifesting diverse multi-label properties, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previous reports for organismal genomes and identify specific challenges associated with features engineering and training data for community-level metabolic inference.
从基因组序列信息中进行代谢推断是确定细胞在不同生物组织层次上适应环境的能力的必要步骤。一种常见的方法是将概念上翻译的开放阅读框映射到包含已知产物描述的数据库上来确定基因组中编码的代谢潜力。这种基于基因的方法在预测途径的存在或不存在方面能力有限,并且不支持用于自动化和可重复研究的标准化规则集。基于定义的规则集或机器学习算法的途径中心方法提供了一种辅助或替代推断方法,支持在细胞内和细胞之间生成和测试代谢关系的假设。在这里,我们提出了 mlLGPR,即基于逻辑回归的多标签通路预测,这是一个软件包,它使用监督多标签分类和丰富的通路特征来推断生物体系和多生物体系数据集中的代谢网络。我们使用 12 个表现出不同多标签特性的实验数据集的语料库来评估 mlLGPR 的性能,包括经过精心编辑的生物体系基因组、合成微生物群落和低复杂度微生物群落。所得的性能指标与生物体基因组的先前报告持平或超过,并且确定了与社区水平代谢推断的特征工程和训练数据相关的特定挑战。