Tan Aik Choon, Gilbert David, Deville Yves
Bioinformatics Research Centre, Department of Computing Science, University of Glasgow, 17 Lilybank Gardens, Glasgow, G12 8QQ, Scotland, United Kingdom.
Genome Inform. 2003;14:206-17.
Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recent structural genomics initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. The amount of structural data has made traditional methods such as manual inspection of the protein structure become impossible. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. This work proposes a novel ensemble machine learning method that improves the coverage of the classifiers under the multi-class imbalanced sample sets by integrating knowledge induced from different base classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have compared our approach with PART and show that our method improves the sensitivity of the classifier in protein fold classification. Furthermore, we have extended this method to learning over multiple data types, preserving the independence of their corresponding data sources, and show that our new approach performs at least as well as the traditional technique over a single joined data source. These experimental results are encouraging, and can be applied to other bioinformatics problems similarly characterised by multi-class imbalanced data sets held in multiple data sources.
蛋白质结构分类是理解序列与结构之间的关联以及可能的功能和进化关系的重要过程。近期的结构基因组学计划和其他高通量实验使生物数据库迅速充实。结构数据的数量使得诸如人工检查蛋白质结构等传统方法变得不可能。机器学习已广泛应用于生物信息学,并在该研究领域取得了诸多成功。这项工作提出了一种新颖的集成机器学习方法,该方法通过整合从不同基分类器中归纳出的知识来提高多类不平衡样本集下分类器的覆盖率,并且我们在对多类SCOP蛋白质折叠数据进行分类时阐述了这一理念。我们已将我们的方法与PART进行了比较,结果表明我们的方法提高了蛋白质折叠分类中分类器的灵敏度。此外,我们已将此方法扩展到对多种数据类型进行学习,保持其相应数据源的独立性,并且表明我们的新方法在单个联合数据源上的表现至少与传统技术一样好。这些实验结果令人鼓舞,并且可类似地应用于其他以多个数据源中存在的多类不平衡数据集为特征的生物信息学问题。