Parwez Md Aslam, Fazil Mohd, Arif Muhammad, Nafis Md Tabrez, Auwul Md Rabiul
Department of Computer Science & Engineering, Jamia Hamdard, New Delhi, India.
University of Limerick, Limerick, Ireland.
Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.
Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientific literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. These massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classification models has gained popularity which takes numeric vectors () of training data as the input to train classification models. Better the input vectors, more accurate would be the classification. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classification. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classification. The proposed approach is evaluated over multiple benchmark datasets, and the efficacy of the learned word representations is tested in terms of and tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classification models, and the classification accuracy further confirms the effectiveness of the learned word representations by our proposed approach.
由于生物医学专家、研究人员、公共卫生机构和医疗保健专业人员对信息技术的使用不断增加,大量的科学文献、临床记录以及其他结构化和非结构化文本资源正在迅速增加,并存储在诸如PubMed等各种数据源中。这些海量文本资源可利用机器学习技术来提取有价值的知识和见解。基于神经网络的分类模型的最新进展颇受关注,该模型将训练数据的数值向量()作为输入来训练分类模型。输入向量越好,分类就越准确。词表示是作为词在嵌入空间中的分布来学习的,其中每个词都有其向量,并且基于上下文语义相似的词会出现在彼此附近。然而,这种分布式词表示无法封装远距离词之间的关系语义。在生物医学领域,是一个经过充分研究的问题,旨在提取关系词,这些关系词将通常代表句子主语和宾语的远距离实体关联起来。我们的目标是从大型语料库中捕捉远距离词之间的关系语义信息,以学习增强的词表示,并将学习到的词表示用于各种自然语言处理任务,如文本分类。在本文中,我们提出了一种生物医学关系三元组的应用,通过将关系语义信息纳入词的分布式表示中来学习词表示。换句话说,所提出的方法旨在捕捉词的分布式和关系上下文,以便从文本语料库中学习它们的数值向量。我们还提出了将学习到的词表示应用于文本分类。所提出的方法在多个基准数据集上进行了评估,并在和任务方面测试了学习到的词表示的功效。与最先进的GloVe模型相比,我们提出的方法具有更好的性能。此外,我们已将学习到的词表示应用于使用四个基于神经网络的分类模型对生物医学文本进行分类,分类准确率进一步证实了我们提出的方法所学习到的词表示的有效性。