Pedersen Bjørn P, Ifrim Georgiana, Liboriussen Poul, Axelsen Kristian B, Palmgren Michael G, Nissen Poul, Wiuf Carsten, Pedersen Christian N S
Centre for Membrane Pumps in Cells and Disease - PUMPKIN, Danish National Research Foundation, Aarhus C, Denmark ; Department of Molecular Biology, Aarhus University, Aarhus C, Denmark.
INSIGHT Centre for Data Analytics, University College Dublin, Dublin, Ireland.
PLoS One. 2014 Jan 20;9(1):e85139. doi: 10.1371/journal.pone.0085139. eCollection 2014.
Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem.
Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases.
Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
结构化逻辑回归(SLR)是一种新开发的机器学习工具,最初是在文本分类的背景下提出的。目前广泛的蛋白质序列数据库的可用性要求有一种自动化方法来可靠地对序列进行分类,而SLR似乎非常适合这项任务。P型ATP酶是一类由ATP驱动的膜泵大家族,负责运输必需的阳离子,其分类被选作一个测试案例,该案例将产生重要的生物学信息,并为将SLR应用于大规模生物信息学问题提供概念验证。
使用SLR,我们构建了分类器,以识别P型ATP酶并将其自动分类到11个预定义类别中的一个。将SLR分类器与隐马尔可夫模型方法进行比较,结果表明其具有高度准确性和可扩展性。我们分析了UniProtKB中的930万个序列,这些序列代表了目前已知序列的大部分,并试图对大量P型ATP酶进行分类。为了研究泵在生物体上的分布,我们还将SLR应用于Entrez基因组数据库中的1123个完整基因组。最后,我们分析了已识别的P型ATP酶的预测膜拓扑结构。
使用基于SLR的分类工具,我们能够对P型ATP酶进行大规模研究。这项研究为将SLR应用于生物信息学问题提供了概念验证,并且对P型ATP酶的分析确定了新的有趣靶点,可用于进一步的生化表征和结构分析。