Koutsomitropoulos Dimitrios A, Andriopoulos Andreas D
Department of Computer Engineering and Informatics, School of Engineering, University of Patras, Patras, Greece.
Neural Comput Appl. 2022;34(2):937-950. doi: 10.1007/s00521-021-06053-z. Epub 2021 May 11.
The special nature, volume and broadness of biomedical literature pose barriers for automated classification methods. On the other hand, manually indexing is time-consuming, costly and error prone. We argue that current word embedding algorithms can be efficiently used to support the task of biomedical text classification even in a multilabel setting, with many distinct labels. The ontology representation of Medical Subject Headings provides machine-readable labels and specifies the dimensionality of the problem space. Both deep- and shallow network approaches are implemented. Predictions are determined by the similarity between extracted features from contextualized representations of abstracts and headings. The addition of a separate classifier for transfer learning is also proposed and evaluated. Large datasets of biomedical citations are harvested for their metadata and used for training and testing. These automated approaches are still far from entirely substituting human experts, yet they can be useful as a mechanism for validation and recommendation. Dataset balancing, distributed processing and training parallelization in GPUs, all play an important part regarding the effectiveness and performance of proposed methods.
生物医学文献的特殊性质、数量和广度给自动分类方法带来了障碍。另一方面,人工索引既耗时又昂贵,还容易出错。我们认为,即使在多标签设置且有许多不同标签的情况下,当前的词嵌入算法也可以有效地用于支持生物医学文本分类任务。医学主题词表的本体表示提供了机器可读的标签,并指定了问题空间的维度。同时实现了深度和浅层网络方法。预测由从摘要和标题的上下文表示中提取的特征之间的相似度决定。还提出并评估了用于迁移学习的单独分类器。收集大量生物医学文献引用数据集的元数据,并将其用于训练和测试。这些自动化方法仍远不能完全替代人类专家,但它们可作为一种验证和推荐机制发挥作用。数据集平衡、分布式处理以及GPU中的训练并行化,对于所提出方法的有效性和性能都起着重要作用。