利用词嵌入识别新知识元素。

Identify novel elements of knowledge with word embedding.

机构信息

School of Economics and Management, Harbin Institute of Technology (Shenzhen), Shenzhen, China.

World Intellectual Property Organization, Geneva, Switzerland.

出版信息

PLoS One. 2023 Jun 20;18(6):e0284567. doi: 10.1371/journal.pone.0284567. eCollection 2023.

DOI:10.1371/journal.pone.0284567

PMID:37339138

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10281565/

Abstract

As novelty is a core value in science, a reliable approach to measuring the novelty of scientific documents is critical. Previous novelty measures however had a few limitations. First, the majority of previous measures are based on recombinant novelty concept, attempting to identify a novel combination of knowledge elements, but insufficient effort has been made to identify a novel element itself (element novelty). Second, most previous measures are not validated, and it is unclear what aspect of newness is measured. Third, some of the previous measures can be computed only in certain scientific fields for technical constraints. This study thus aims to provide a validated and field-universal approach to computing element novelty. We drew on machine learning to develop a word embedding model, which allows us to extract semantic information from text data. Our validation analyses suggest that our word embedding model does convey semantic information. Based on the trained word embedding, we quantified the element novelty of a document by measuring its distance from the rest of the document universe. We then carried out a questionnaire survey to obtain self-reported novelty scores from 800 scientists. We found that our element novelty measure is significantly correlated with self-reported novelty in terms of discovering and identifying new phenomena, substances, molecules, etc. and that this correlation is observed across different scientific fields.

摘要

新颖性是科学的核心价值，因此，开发一种可靠的方法来衡量科学文献的新颖性至关重要。然而，以前的新颖性度量方法存在一些局限性。首先，大多数以前的度量方法都基于重组新颖性概念，试图识别知识元素的新颖组合，但尚未充分努力识别新颖元素本身（元素新颖性）。其次，大多数以前的度量方法都未经验证，并且不清楚要衡量新颖性的哪个方面。第三，由于技术限制，一些以前的度量方法只能在某些科学领域中计算。因此，本研究旨在提供一种经过验证且适用于所有领域的计算元素新颖性的方法。我们利用机器学习开发了一种词嵌入模型，该模型使我们能够从文本数据中提取语义信息。我们的验证分析表明，我们的词嵌入模型确实传达了语义信息。基于训练有素的词嵌入，我们通过测量文档与文档宇宙其余部分的距离来量化文档的元素新颖性。然后，我们进行了问卷调查，从 800 名科学家那里获得了自我报告的新颖性评分。我们发现，就发现和识别新现象、物质、分子等而言，我们的元素新颖性度量与自我报告的新颖性显著相关，而且这种相关性在不同的科学领域都存在。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/239c/10281565/b3d8c4345499/pone.0284567.g001.jpg

相似文献

Identify novel elements of knowledge with word embedding.利用词嵌入识别新知识元素。

PLoS One. 2023 Jun 20;18(6):e0284567. doi: 10.1371/journal.pone.0284567. eCollection 2023.

Measuring novelty in science with word embedding.用词嵌入测量科学中的新颖性。

PLoS One. 2021 Jul 2;16(7):e0254034. doi: 10.1371/journal.pone.0254034. eCollection 2021.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。

J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.

A Topic Recognition Method of News Text Based on Word Embedding Enhancement.基于词向量增强的新闻文本主题识别方法。

Comput Intell Neurosci. 2022 Feb 16;2022:4582480. doi: 10.1155/2022/4582480. eCollection 2022.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.使用词和图嵌入来衡量统一医学语言系统概念之间的语义相关性。

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts.多本体精炼嵌入模型（MORE）：一种基于混合多本体和语料库的生物医学概念语义表示模型。

J Biomed Inform. 2020 Nov;111:103581. doi: 10.1016/j.jbi.2020.103581. Epub 2020 Oct 1.

Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象：化学与物理邂逅生物学（瑞士阿斯科纳，2012年6月10日至14日）

Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

Defect Severity Identification for a Catenary System Based on Deep Semantic Learning.基于深度学习的接触网系统缺陷严重度识别。

Sensors (Basel). 2022 Dec 16;22(24):9922. doi: 10.3390/s22249922.

The future of Cochrane Neonatal.考克兰新生儿协作网的未来。

Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

引用本文的文献

Optimizing quarantine in pandemic control: a multi-stage SEIQR modeling approach to COVID-19 transmission dynamics.优化大流行控制中的隔离措施：一种用于新冠病毒传播动力学的多阶段SEIQR建模方法

BMC Infect Dis. 2025 Jul 1;25(1):877. doi: 10.1186/s12879-025-11253-2.

Enhancing chemical synthesis research with NLP: Word embeddings for chemical reagent identification-A case study on nano-FeCu.利用自然语言处理技术加强化学合成研究：用于化学试剂识别的词嵌入——以纳米铁铜为例

iScience. 2024 Aug 29;27(10):110780. doi: 10.1016/j.isci.2024.110780. eCollection 2024 Oct 18.

Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy's rule-based and machine learning-based methods.利用GPT-4在电子健康记录中识别癌症表型：GPT-4、GPT-3.5-turbo、Flan-T5、Llama-3-8B与spaCy基于规则和基于机器学习的方法之间的性能比较。

JAMIA Open. 2024 Jul 3;7(3):ooae060. doi: 10.1093/jamiaopen/ooae060. eCollection 2024 Oct.

本文引用的文献

Is novel research worth doing? Evidence from peer review at 49 journals.新型研究是否值得开展？来自 49 种期刊同行评审的证据。

Proc Natl Acad Sci U S A. 2022 Nov 22;119(47):e2118046119. doi: 10.1073/pnas.2118046119. Epub 2022 Nov 17.

Measuring novelty in science with word embedding.用词嵌入测量科学中的新颖性。

PLoS One. 2021 Jul 2;16(7):e0254034. doi: 10.1371/journal.pone.0254034. eCollection 2021.

Age and the Trying Out of New Ideas.年龄与新观念的尝试

J Hum Cap. 2019 Summer;13(2):341-373. doi: 10.1086/703160.

Unsupervised word embeddings capture latent knowledge from materials science literature.无监督词嵌入方法可以从材料科学文献中提取潜在知识。

Nature. 2019 Jul;571(7763):95-98. doi: 10.1038/s41586-019-1335-8. Epub 2019 Jul 3.

BioWordVec, improving biomedical word embeddings with subword information and MeSH.BioWordVec，利用子词信息和 MeSH 改进生物医学词向量。

Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.

Machine-learned and codified synthesis parameters of oxide materials.机器学习和编码的氧化物材料合成参数。

Sci Data. 2017 Sep 12;4:170127. doi: 10.1038/sdata.2017.127.

Looking Across and Looking Beyond the Knowledge Frontier: Intellectual Distance, Novelty, and Resource Allocation in Science.跨越并超越知识前沿：科学中的知识距离、新颖性与资源分配

Manage Sci. 2016 Oct;62(10):2765-2783. doi: 10.1287/mnsc.2015.2285. Epub 2016 Jan 8.

Interdisciplinary research by the numbers.从数字看跨学科研究。

Nature. 2015 Sep 17;525(7569):306-7. doi: 10.1038/525306a.

Atypical combinations and scientific impact.非典型组合和科学影响。

Science. 2013 Oct 25;342(6157):468-72. doi: 10.1126/science.1240474.

The associative basis of the creative process.创造性过程的联想基础。

Psychol Rev. 1962 May;69:220-32. doi: 10.1037/h0048850.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用词嵌入识别新知识元素。

Identify novel elements of knowledge with word embedding.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献