将机器学习与语言特征相结合：一种用于中文文本中时间表达式提取与规范化的通用方法。

Integrating machine learning with linguistic features: A universal method for extraction and normalization of temporal expressions in Chinese texts.

作者信息

Wang Shunli, Li Rui, Wu Huayi

机构信息

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China; Hubei Luojia Laboratory, Wuhan, China; Collaborative Innovation Center of Geospatial Technology, Wuhan, China.

出版信息

Comput Methods Programs Biomed. 2023 May;233:107474. doi: 10.1016/j.cmpb.2023.107474. Epub 2023 Mar 11.

DOI:10.1016/j.cmpb.2023.107474

PMID:36931017

Abstract

BACKGROUND AND OBJECTIVE

With the rapid development of information dissemination technology, the amount of events information contained in massive texts now far exceeds the intuitive cognition of humans, and it is hard to understand the progress of events in order of time. Temporal information runs through the whole process of beginning, proceeding, and ending of events, and plays an important role in many natural language processing applications, such as information extraction, question answering, and text summary. Accurately extracting temporal information from Chinese texts and automatically mapping the temporal expressions in natural language to the time axis are crucial to understanding the development of events and dynamic changes in them.

METHODS

This study proposes a method integrating machine learning with linguistic features (IMLLF) for extraction and normalization of temporal expressions in Chinese texts to achieve the above objectives. Linguistic features are constructed by analyzing the expression rules of temporal information, and are combined with machine learning to map the natural language form of time onto a one-dimensional timeline. The web text dataset we build is divided into five parts for five-fold cross-validation, to compare the influence of different combinations of linguistic features and different methods. In the open medical dialog dataset, based on the training model obtained from the web text dataset, 200 disease descriptions are randomly selected each time for three rounds of experiments.

RESULTS

The F1 of multi-feature fusion is 95.2%, which is better than the single-feature and double-feature combination. The results of experiments showed that the proposed IMLLF method can improve the accuracy of recognition of temporal information in Chinese to a greater extent than classical methods, with an F1-score of over 95% on the web text dataset and medical conversation dataset. In terms of the normalization of time expressions, the accuracy of the IMLLF method is higher than 93%.

CONCLUSIONS

IMLLF has better results in extracting and normalizing time expressions on the web text dataset and the medical conversation dataset, which verifies the universality of IMLLF to identify and quantify temporal information. IMLLF method can accurately map the time information to the time axis, which is convenient for doctors to intuitively see when and what happened to the patient, and helps to make better medical decisions.

摘要

背景与目的

随着信息传播技术的飞速发展，海量文本中包含的事件信息量如今已远远超出人类的直观认知，难以按时间顺序理解事件的进展。时间信息贯穿事件从开始、进行到结束的全过程，在许多自然语言处理应用中发挥着重要作用，如信息抽取、问答和文本摘要。从中文文本中准确提取时间信息并将自然语言中的时间表达式自动映射到时间轴上，对于理解事件的发展及其动态变化至关重要。

方法

本研究提出一种将机器学习与语言特征相结合的方法（IMLLF），用于中文文本中时间表达式的提取和规范化，以实现上述目标。通过分析时间信息的表达规则构建语言特征，并与机器学习相结合，将时间的自然语言形式映射到一维时间轴上。我们构建的网络文本数据集分为五个部分进行五折交叉验证，以比较不同语言特征组合和不同方法的影响。在开放医学对话数据集中，基于从网络文本数据集获得的训练模型，每次随机选择200个疾病描述进行三轮实验。