Hastings Janna, Glauer Martin, Memariani Adel, Neuhaus Fabian, Mossakowski Till
Department of Computer Science, Otto-von-Guericke University of Magdeburg, Magdeburg, Germany.
J Cheminform. 2021 Mar 16;13(1):23. doi: 10.1186/s13321-021-00500-8.
Chemical data is increasingly openly available in databases such as PubChem, which contains approximately 110 million compound entries as of February 2021. With the availability of data at such scale, the burden has shifted to organisation, analysis and interpretation. Chemical ontologies provide structured classifications of chemical entities that can be used for navigation and filtering of the large chemical space. ChEBI is a prominent example of a chemical ontology, widely used in life science contexts. However, ChEBI is manually maintained and as such cannot easily scale to the full scope of public chemical data. There is a need for tools that are able to automatically classify chemical data into chemical ontologies, which can be framed as a hierarchical multi-class classification problem. In this paper we evaluate machine learning approaches for this task, comparing different learning frameworks including logistic regression, decision trees and long short-term memory artificial neural networks, and different encoding approaches for the chemical structures, including cheminformatics fingerprints and character-based encoding from chemical line notation representations. We find that classical learning approaches such as logistic regression perform well with sets of relatively specific, disjoint chemical classes, while the neural network is able to handle larger sets of overlapping classes but needs more examples per class to learn from, and is not able to make a class prediction for every molecule. Future work will explore hybrid and ensemble approaches, as well as alternative network architectures including neuro-symbolic approaches.
化学数据在诸如PubChem等数据库中越来越容易公开获取,截至2021年2月,该数据库包含约1.1亿条化合物条目。随着如此大规模数据的可得性,负担已转移到组织、分析和解释上。化学本体提供化学实体的结构化分类,可用于在大型化学空间中进行导航和筛选。ChEBI是化学本体的一个突出例子,在生命科学领域广泛使用。然而,ChEBI是人工维护的,因此不容易扩展到公共化学数据的全部范围。需要能够将化学数据自动分类到化学本体中的工具,这可以被构建为一个分层多类分类问题。在本文中,我们评估了用于此任务的机器学习方法,比较了不同的学习框架,包括逻辑回归、决策树和长短期记忆人工神经网络,以及化学结构的不同编码方法,包括化学信息学指纹和基于化学线符号表示的基于字符的编码。我们发现,诸如逻辑回归等经典学习方法在相对特定、不相交的化学类集合上表现良好,而神经网络能够处理更大的重叠类集合,但每个类需要更多示例来学习,并且不能对每个分子进行类预测。未来的工作将探索混合和集成方法,以及包括神经符号方法在内的替代网络架构。