Department of Biology, Carleton University, Ottawa, Canada.
BMC Bioinformatics. 2012 Jan 6;13:3. doi: 10.1186/1471-2105-13-3.
The advent of high-throughput experimentation in biochemistry has led to the generation of vast amounts of chemical data, necessitating the development of novel analysis, characterization, and cataloguing techniques and tools. Recently, a movement to publically release such data has advanced biochemical structure-activity relationship research, while providing new challenges, the biggest being the curation, annotation, and classification of this information to facilitate useful biochemical pattern analysis. Unfortunately, the human resources currently employed by the organizations supporting these efforts (e.g. ChEBI) are expanding linearly, while new useful scientific information is being released in a seemingly exponential fashion. Compounding this, currently existing chemical classification and annotation systems are not amenable to automated classification, formal and transparent chemical class definition axiomatization, facile class redefinition, or novel class integration, thus further limiting chemical ontology growth by necessitating human involvement in curation. Clearly, there is a need for the automation of this process, especially for novel chemical entities of biological interest.
To address this, we present a formal framework based on Semantic Web technologies for the automatic design of chemical ontology which can be used for automated classification of novel entities. We demonstrate the automatic self-assembly of a structure-based chemical ontology based on 60 MeSH and 40 ChEBI chemical classes. This ontology is then used to classify 200 compounds with an accuracy of 92.7%. We extend these structure-based classes with molecular feature information and demonstrate the utility of our framework for classification of functionally relevant chemicals. Finally, we discuss an iterative approach that we envision for future biochemical ontology development.
We conclude that the proposed methodology can ease the burden of chemical data annotators and dramatically increase their productivity. We anticipate that the use of formal logic in our proposed framework will make chemical classification criteria more transparent to humans and machines alike and will thus facilitate predictive and integrative bioactivity model development.
生物化学高通量实验的出现导致了大量化学数据的产生,这就需要开发新的分析、特征描述和编目技术和工具。最近,公开发布此类数据的举措推动了生化结构-活性关系研究的发展,同时也带来了新的挑战,其中最大的挑战是对这些信息进行编目、注释和分类,以促进有用的生化模式分析。不幸的是,支持这些工作的组织(如 ChEBI)目前所拥有的人力资源正在以线性方式扩展,而新的有用科学信息的发布速度似乎呈指数级增长。此外,目前现有的化学分类和注释系统不适用于自动化分类、形式化和透明的化学类定义公理化、方便的类重新定义或新类集成,因此通过需要人工参与编目进一步限制了化学本体的增长。显然,需要实现这一过程的自动化,特别是对于具有生物意义的新型化学实体。
为了解决这个问题,我们提出了一个基于语义网技术的化学本体自动设计的正式框架,可用于新型实体的自动化分类。我们演示了基于 60 个 MeSH 和 40 个 ChEBI 化学类的基于结构的化学本体的自动自组装。然后,我们使用这个本体对 200 种化合物进行分类,准确率为 92.7%。我们将这些基于结构的类与分子特征信息相结合,并演示了我们的框架在功能相关化学物质分类中的实用性。最后,我们讨论了我们设想的用于未来生化本体开发的迭代方法。
我们得出结论,所提出的方法可以减轻化学数据注释者的负担,并极大地提高他们的工作效率。我们预计,在我们提出的框架中使用形式逻辑将使化学分类标准对人类和机器更加透明,并因此促进预测和综合生物活性模型的开发。