School of Software, Hallym University, Chuncheon, South Korea.
Bio-IT Research Center, Hallym University, Chuncheon, South Korea.
Biomed Eng Online. 2018 Nov 6;17(Suppl 2):155. doi: 10.1186/s12938-018-0583-4.
One of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the other hand, it is known that word embedding has a great effect in estimating the similarity between words because it expresses the meaning of the word well. In this study, we try to clarify the correlation between various terms in the biomedical texts based on the excellent ability of estimating similarity between words shown by word embedding. Therefore, we used word embedding to find new biomarkers and microorganisms related to a specific diseases.
In this study, we try to analyze the correlation between diseases-markers and diseases-microorganisms. First, we need to construct a corpus that seems to be related to them. To do this, we extract the titles and abstracts from the biomedical texts on the PubMed site. Second, we express diseases, markers, and microorganisms' terms in word embedding using Canonical Correlation Analysis (CCA). CCA is a statistical based methodology that has a very good performance on vector dimension reduction. Finally, we tried to estimate the relationship between diseases-markers pairs and diseases-microorganisms pairs by measuring their similarity.
In the experiment, we tried to confirm the correlation derived through word embedding using Google Scholar search results. Of the top 20 highly correlated disease-marker pairs, about 85% of the pairs have actually undergone a lot of research as a result of Google Scholars search. Conversely, for 85% of the 20 pairs with the lowest correlation, we could not actually find any other study to determine the relationship between the disease and the marker. This trend was similar for disease-microbe pairs.
The correlation between diseases and markers and diseases and microorganisms calculated through word embedding reflects actual research trends. If the word-embedding correlation is high, but there are not many published actual studies, additional research can be proposed for the pair.
基于机器学习的自然语言处理中最重要的过程之一是表示单词。常用的独热表示法的向量大小很大,并假设构成向量的特征彼此独立。另一方面,众所周知,词嵌入在估计单词之间的相似性方面具有很大的作用,因为它很好地表达了单词的含义。在这项研究中,我们试图基于词嵌入对单词之间的相似性的出色估计能力,阐明生物医学文本中各种术语之间的相关性。因此,我们使用词嵌入来寻找与特定疾病相关的新生物标志物和微生物。
在这项研究中,我们试图分析疾病-标志物和疾病-微生物之间的相关性。首先,我们需要构建一个似乎与它们相关的语料库。为此,我们从 PubMed 网站上的生物医学文本中提取标题和摘要。其次,我们使用典型相关分析(CCA)将疾病、标志物和微生物的术语表示为词嵌入。CCA 是一种基于统计学的方法,在向量降维方面具有非常好的性能。最后,我们试图通过测量它们的相似性来估计疾病-标志物对和疾病-微生物对之间的关系。
在实验中,我们试图通过 Google Scholar 搜索结果来确认通过词嵌入得出的相关性。在排名前 20 的高度相关的疾病-标志物对中,大约 85%的对实际上已经进行了大量的研究,这是 Google Scholar 搜索的结果。相反,对于相关性最低的 20 对中的 85%,我们实际上无法找到任何其他研究来确定疾病与标志物之间的关系。疾病-微生物对也存在类似的趋势。
通过词嵌入计算出的疾病与标志物之间以及疾病与微生物之间的相关性反映了实际的研究趋势。如果词嵌入的相关性很高,但实际发表的研究却很少,则可以为该对提出额外的研究。