Suppr超能文献

通过短语级预处理和词嵌入表示多词化学术语

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.

作者信息

Huang Liyuan, Ling Chen

机构信息

Toyota Research Institute of North America, 1555 Woodridge Avenue, Ann Arbor, Michigan 48105, United States.

出版信息

ACS Omega. 2019 Oct 31;4(20):18510-18519. doi: 10.1021/acsomega.9b02060. eCollection 2019 Nov 12.

Abstract

In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing techniques have established successes, the chemical literature contains a large portion of phrases of multiple words that create additional challenges for accurate identification and representation. Here, we introduce a chemistry domain suitable approach to identify multiword chemical terms and train word representations at the phrase level. Through a series of special-designed experiments, we demonstrate that our multiword identifying and representing method effectively and accurately identifies multiword chemical terms from 119, 166 chemical patents and is more robust and precise to preserve the semantic meaning of chemical phrases compared to the conventional approach, which represents constituent single words first and combine them afterward. Because the accurate representation of chemical terms is the first and essential step to provide learning features for downstream natural language processing tasks, our results pave the road to utilize the large volume of chemical literature in future data-driven studies.

摘要

近年来,数据驱动方法和人工智能已在化学信息学和材料信息学领域广泛应用,其成功与否关键取决于高质量和大量训练数据的可用性。突破这一瓶颈的一种潜在方法是利用化学文献(如论文和专利)作为高通量实验和模拟的替代数据资源。与自然语言处理技术已取得成功的其他领域相比,化学文献包含大量多词短语,这给准确识别和表示带来了额外挑战。在此,我们介绍一种适用于化学领域的方法,用于识别多词化学术语并在短语层面训练词表示。通过一系列特别设计的实验,我们证明,与传统方法(先表示组成单字,然后再将它们组合起来)相比,我们的多词识别和表示方法能有效且准确地从119,166篇化学专利中识别多词化学术语,并且在保留化学短语语义方面更稳健、精确。由于化学术语的准确表示是为下游自然语言处理任务提供学习特征的首要且关键步骤,我们的结果为在未来数据驱动研究中利用大量化学文献铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7720/6854573/4e8bf82525c7/ao9b02060_0004.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验