Heider Dominik, Verheyen Jens, Hoffmann Daniel
Department of Bioinformatics, Center of Medical Biotechnology, University of Duisburg-Essen, Universitaetsstr, 2, 45117 Essen, Germany.
BMC Res Notes. 2011 Mar 31;4:94. doi: 10.1186/1756-0500-4-94.
Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.
We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.
We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.
机器学习技术已广泛应用于生物序列,例如从药物靶蛋白序列和蛋白质功能类别预测HIV-1中的耐药性。由于生物序列中缺失和插入频繁出现,当前方法的一个主要限制是无法处理不同的序列长度。
我们建议将序列归一化为统一长度。为此,我们测试了一种线性和四种不同的非线性插值方法,用于对19个分类数据集的序列长度进行归一化。分类任务包括从药物靶序列预测HIV-1耐药性以及基于序列的蛋白质功能预测。我们将随机森林应用于将序列分类为“阳性”和“阴性”样本。统计测试表明,在大多数分析数据集中,线性插值优于非线性插值方法,而在少数情况下,非线性方法具有小但显著的优势。与其他已发表的方法相比,我们的预测方案可将预测准确率提高多达14%。
我们发现,与现有技术相比,对通过简单线性插值归一化的序列进行机器学习可产生更好或至少具有竞争力的结果,因此,是现有方法的一种有前途的替代方法,特别是对于可变长度的蛋白质序列。