Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg 412 96, Sweden.
Department of Life Sciences, Chalmers University of Technology, Gothenburg 412 96, Sweden.
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae050.
Proteomic profiles reflect the functional readout of the physiological state of an organism. An increased understanding of what controls and defines protein abundances is of high scientific interest. Saccharomyces cerevisiae is a well-studied model organism, and there is a large amount of structured knowledge on yeast systems biology in databases such as the Saccharomyces Genome Database, and highly curated genome-scale metabolic models like Yeast8. These datasets, the result of decades of experiments, are abundant in information, and adhere to semantically meaningful ontologies.
By representing this knowledge in an expressive Datalog database we generated data descriptors using relational learning that, when combined with supervised machine learning, enables us to predict protein abundances in an explainable manner. We learnt predictive relationships between protein abundances, function and phenotype; such as α-amino acid accumulations and deviations in chronological lifespan. We further demonstrate the power of this methodology on the proteins His4 and Ilv2, connecting qualitative biological concepts to quantified abundances.
All data and processing scripts are available at the following Github repository: https://github.com/DanielBrunnsaker/ProtPredict.
蛋白质组学谱反映了生物体生理状态的功能读出。深入了解控制和定义蛋白质丰度的因素是具有高度科学意义的。酿酒酵母是一种研究得很好的模式生物,在数据库中如酵母基因组数据库(Saccharomyces Genome Database)和高度编纂的基因组规模代谢模型(如 Yeast8)中有大量关于酵母系统生物学的结构化知识。这些数据集是数十年实验的结果,信息丰富,并遵守语义上有意义的本体。
通过在一个有表现力的 Datalog 数据库中表示这些知识,我们使用关系学习生成了数据描述符,当与监督机器学习结合使用时,使我们能够以可解释的方式预测蛋白质丰度。我们学习了蛋白质丰度、功能和表型之间的预测关系;例如,α-氨基酸积累和时序寿命偏差。我们进一步通过 His4 和 Ilv2 这两种蛋白质来证明这种方法的强大功能,将定性的生物学概念与定量的丰度联系起来。
所有数据和处理脚本都可在以下 Github 存储库中获得:https://github.com/DanielBrunnsaker/ProtPredict。