Institute of Polish Language, Polish Academy of Sciences, Cracow, Poland.
Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland.
Sci Rep. 2018 May 15;8(1):7598. doi: 10.1038/s41598-018-25440-6.
Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections ("corpora"), including those deposited on the Internet - indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic "chemical words" that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular "keywords" by which such collections are best characterized and annotated.
计算机语言分析在比较和搜索大型文本集(“语料库”)方面非常有价值,包括那些存储在互联网上的语料库——事实上,如果没有例如从文档中提取最合适关键词的搜索算法,现在很难想象浏览网页。本文描述了如何基于跨越传统功能基团的特征“化学词”将这种语料库语言学概念扩展到化学领域,而是着眼于分子共享的常见结构片段。使用这些词,可以以新的方式量化化学集合/数据库的多样性,并定义分子“关键词”,通过这些关键词可以最好地描述和注释这些集合。