通过短语级预处理和词嵌入表示多词化学术语

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.

作者信息

Huang Liyuan, Ling Chen

机构信息

Toyota Research Institute of North America, 1555 Woodridge Avenue, Ann Arbor, Michigan 48105, United States.

出版信息

ACS Omega. 2019 Oct 31;4(20):18510-18519. doi: 10.1021/acsomega.9b02060. eCollection 2019 Nov 12.

DOI:10.1021/acsomega.9b02060

PMID:31737809

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6854573/

Abstract

In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing techniques have established successes, the chemical literature contains a large portion of phrases of multiple words that create additional challenges for accurate identification and representation. Here, we introduce a chemistry domain suitable approach to identify multiword chemical terms and train word representations at the phrase level. Through a series of special-designed experiments, we demonstrate that our multiword identifying and representing method effectively and accurately identifies multiword chemical terms from 119, 166 chemical patents and is more robust and precise to preserve the semantic meaning of chemical phrases compared to the conventional approach, which represents constituent single words first and combine them afterward. Because the accurate representation of chemical terms is the first and essential step to provide learning features for downstream natural language processing tasks, our results pave the road to utilize the large volume of chemical literature in future data-driven studies.

摘要

近年来，数据驱动方法和人工智能已在化学信息学和材料信息学领域广泛应用，其成功与否关键取决于高质量和大量训练数据的可用性。突破这一瓶颈的一种潜在方法是利用化学文献（如论文和专利）作为高通量实验和模拟的替代数据资源。与自然语言处理技术已取得成功的其他领域相比，化学文献包含大量多词短语，这给准确识别和表示带来了额外挑战。在此，我们介绍一种适用于化学领域的方法，用于识别多词化学术语并在短语层面训练词表示。通过一系列特别设计的实验，我们证明，与传统方法（先表示组成单字，然后再将它们组合起来）相比，我们的多词识别和表示方法能有效且准确地从119,166篇化学专利中识别多词化学术语，并且在保留化学短语语义方面更稳健、精确。由于化学术语的准确表示是为下游自然语言处理任务提供学习特征的首要且关键步骤，我们的结果为在未来数据驱动研究中利用大量化学文献铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7720/6854573/4e8bf82525c7/ao9b02060_0004.jpg

相似文献

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.通过短语级预处理和词嵌入表示多词化学术语

ACS Omega. 2019 Oct 31;4(20):18510-18519. doi: 10.1021/acsomega.9b02060. eCollection 2019 Nov 12.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Meaningfulness Beats Frequency in Multiword Chunk Processing.多词词汇加工中，有意义胜过高频。

Cogn Sci. 2020 Oct;44(10):e12885. doi: 10.1111/cogs.12885.

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。

J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.

Determining the importance of frequency and contextual diversity in the lexical organization of multiword expressions.确定频率和语境多样性在多词表达式词汇组织中的重要性。

Can J Exp Psychol. 2022 Jun;76(2):87-98. doi: 10.1037/cep0000271. Epub 2022 Feb 10.

More Than Words: The Role of Multiword Sequences in Language Learning and Use.不止于词汇：多词序列在语言学习与运用中的作用

Top Cogn Sci. 2017 Jul;9(3):542-551. doi: 10.1111/tops.12274. Epub 2017 May 14.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象：化学与物理邂逅生物学（瑞士阿斯科纳，2012年6月10日至14日）

Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

I will write about: Investigating multiword expressions in prospective students' argumentative writing.研究准大学生议论文写作中的多词表达。

PLoS One. 2020 Dec 3;15(12):e0242843. doi: 10.1371/journal.pone.0242843. eCollection 2020.

Concreteness ratings for 62,000 English multiword expressions.62000 个英语多词表达的具体性评级。

Behav Res Methods. 2023 Aug;55(5):2522-2531. doi: 10.3758/s13428-022-01912-6. Epub 2022 Jul 22.

引用本文的文献

Opportunities and challenges of text mining in aterials research.材料研究中文本挖掘的机遇与挑战。（注：原英文中“aterials”有误，正确应为“materials”）

iScience. 2021 Feb 6;24(3):102155. doi: 10.1016/j.isci.2021.102155. eCollection 2021 Mar 19.

本文引用的文献

Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks.基于文献训练的神经网络的无机材料合成规划。

J Chem Inf Model. 2020 Mar 23;60(3):1194-1201. doi: 10.1021/acs.jcim.9b00995. Epub 2020 Jan 28.

Unsupervised word embeddings capture latent knowledge from materials science literature.无监督词嵌入方法可以从材料科学文献中提取潜在知识。

Nature. 2019 Jul;571(7763):95-98. doi: 10.1038/s41586-019-1335-8. Epub 2019 Jul 3.

A Machine Learning Approach to Zeolite Synthesis Enabled by Automatic Literature Data Extraction.一种通过自动文献数据提取实现的用于沸石合成的机器学习方法。

ACS Cent Sci. 2019 May 22;5(5):892-899. doi: 10.1021/acscentsci.9b00193. Epub 2019 Apr 19.

ElemNet: Deep Learning the Chemistry of Materials From Only Elemental Composition.ElemNet：仅从元素组成深度学习材料化学

Sci Rep. 2018 Dec 4;8(1):17593. doi: 10.1038/s41598-018-35934-y.

PubChem 2019 update: improved access to chemical data.PubChem 2019 年更新：改善化学数据获取。

Nucleic Acids Res. 2019 Jan 8;47(D1):D1102-D1109. doi: 10.1093/nar/gky1033.

"Found in Translation": predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models.《翻译中的发现》：使用神经序列到序列模型预测复杂有机化学反应的结果。

Chem Sci. 2018 Jun 22;9(28):6091-6098. doi: 10.1039/c8sc02339e. eCollection 2018 Jul 28.

Machine learning for molecular and materials science.机器学习在分子和材料科学中的应用。

Nature. 2018 Jul;559(7715):547-555. doi: 10.1038/s41586-018-0337-2. Epub 2018 Jul 25.

Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties.晶体图卷积神经网络实现材料属性的精确和可解释预测。

Phys Rev Lett. 2018 Apr 6;120(14):145301. doi: 10.1103/PhysRevLett.120.145301.

Mastering the game of Go without human knowledge.无需人类知识即可掌握围棋游戏。

Nature. 2017 Oct 18;550(7676):354-359. doi: 10.1038/nature24270.

Machine-learned and codified synthesis parameters of oxide materials.机器学习和编码的氧化物材料合成参数。

Sci Data. 2017 Sep 12;4:170127. doi: 10.1038/sdata.2017.127.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过短语级预处理和词嵌入表示多词化学术语

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献