Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow 119121, Russia.
J Chem Inf Model. 2021 Apr 26;61(4):1683-1690. doi: 10.1021/acs.jcim.0c01054. Epub 2021 Mar 16.
The growing amount of experimental data on chemical objects includes properties of small molecules, results of studies of their interaction with human and animal proteins, and methods of synthesis of organic compounds (OCs). The data obtained can be used to identify the names of OCs automatically, including all possible synonyms and relevant data on the molecular properties and biological activity. Utilization of different synonymic names of chemical compounds allows researchers to increase the completeness of data on their properties available from publications. Enrichment of the data on the names of chemical compounds by information about their possible metabolites can help estimate the biological effects of parent compounds and their metabolites more thoroughly. Therefore, an attempt at automated extraction of the names of parent compounds and their metabolites from the texts is a rather important task. In our study, we aimed at developing a method that provides the extraction of the named entities (NEs) of parent compounds and their metabolites from abstracts of scientific publications. Based on the application of the conditional random fields' algorithm, we extracted the NEs of chemical compounds. We developed a set of rules allowing identification of parent compound NEs and their metabolites in the texts. We evaluated the possibility of extracting the names of potential metabolites based on cosine similarity between strings representing names of parent compounds and all other chemical NEs found in the text. Additionally, we used conditional random fields to fetch the names of parent compounds and their metabolites from the texts based on the corpus of texts labeled manually. Our computational experiments showed that usage of rules in combination with cosine similarity could increase the accuracy of recognition of the names of metabolites compared to the rule-based algorithm and application of a machine-learning algorithm (conditional random fields).
越来越多的化学物质实验数据包括小分子的性质、它们与人和动物蛋白质相互作用的研究结果以及有机化合物 (OC) 的合成方法。获得的数据可用于自动识别 OC 的名称,包括所有可能的同义词以及有关分子性质和生物活性的相关数据。利用化学化合物的不同同义词可以使研究人员增加出版物中可用的关于其性质的完整数据。通过有关其可能代谢物的信息丰富化学化合物名称的数据,可以帮助更全面地估计母体化合物及其代谢物的生物效应。因此,从文本中自动提取母体化合物及其代谢物名称的尝试是一项非常重要的任务。在我们的研究中,我们旨在开发一种从科学出版物摘要中提取母体化合物及其代谢物命名实体 (NE) 的方法。基于条件随机场算法的应用,我们提取了化学化合物的 NE。我们开发了一组规则,用于识别文本中母体化合物的 NE 和它们的代谢物。我们评估了基于表示母体化合物名称的字符串与在文本中找到的所有其他化学 NE 之间的余弦相似性来提取潜在代谢物名称的可能性。此外,我们还使用条件随机场根据手动标记的文本语料库从文本中提取母体化合物及其代谢物的名称。我们的计算实验表明,与基于规则的算法和应用机器学习算法(条件随机场)相比,使用规则结合余弦相似性可以提高代谢物名称识别的准确性。