Wang Shunli, Li Rui, Wu Huayi
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China; Hubei Luojia Laboratory, Wuhan, China; Collaborative Innovation Center of Geospatial Technology, Wuhan, China.
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China; Hubei Luojia Laboratory, Wuhan, China; Collaborative Innovation Center of Geospatial Technology, Wuhan, China.
Comput Methods Programs Biomed. 2023 May;233:107474. doi: 10.1016/j.cmpb.2023.107474. Epub 2023 Mar 11.
With the rapid development of information dissemination technology, the amount of events information contained in massive texts now far exceeds the intuitive cognition of humans, and it is hard to understand the progress of events in order of time. Temporal information runs through the whole process of beginning, proceeding, and ending of events, and plays an important role in many natural language processing applications, such as information extraction, question answering, and text summary. Accurately extracting temporal information from Chinese texts and automatically mapping the temporal expressions in natural language to the time axis are crucial to understanding the development of events and dynamic changes in them.
This study proposes a method integrating machine learning with linguistic features (IMLLF) for extraction and normalization of temporal expressions in Chinese texts to achieve the above objectives. Linguistic features are constructed by analyzing the expression rules of temporal information, and are combined with machine learning to map the natural language form of time onto a one-dimensional timeline. The web text dataset we build is divided into five parts for five-fold cross-validation, to compare the influence of different combinations of linguistic features and different methods. In the open medical dialog dataset, based on the training model obtained from the web text dataset, 200 disease descriptions are randomly selected each time for three rounds of experiments.
The F1 of multi-feature fusion is 95.2%, which is better than the single-feature and double-feature combination. The results of experiments showed that the proposed IMLLF method can improve the accuracy of recognition of temporal information in Chinese to a greater extent than classical methods, with an F1-score of over 95% on the web text dataset and medical conversation dataset. In terms of the normalization of time expressions, the accuracy of the IMLLF method is higher than 93%.
IMLLF has better results in extracting and normalizing time expressions on the web text dataset and the medical conversation dataset, which verifies the universality of IMLLF to identify and quantify temporal information. IMLLF method can accurately map the time information to the time axis, which is convenient for doctors to intuitively see when and what happened to the patient, and helps to make better medical decisions.
随着信息传播技术的飞速发展,海量文本中包含的事件信息量如今已远远超出人类的直观认知,难以按时间顺序理解事件的进展。时间信息贯穿事件从开始、进行到结束的全过程,在许多自然语言处理应用中发挥着重要作用,如信息抽取、问答和文本摘要。从中文文本中准确提取时间信息并将自然语言中的时间表达式自动映射到时间轴上,对于理解事件的发展及其动态变化至关重要。
本研究提出一种将机器学习与语言特征相结合的方法(IMLLF),用于中文文本中时间表达式的提取和规范化,以实现上述目标。通过分析时间信息的表达规则构建语言特征,并与机器学习相结合,将时间的自然语言形式映射到一维时间轴上。我们构建的网络文本数据集分为五个部分进行五折交叉验证,以比较不同语言特征组合和不同方法的影响。在开放医学对话数据集中,基于从网络文本数据集获得的训练模型,每次随机选择200个疾病描述进行三轮实验。
多特征融合的F1值为95.2%,优于单特征和双特征组合。实验结果表明,所提出的IMLLF方法比经典方法能在更大程度上提高中文时间信息识别的准确性,在网络文本数据集和医学对话数据集上的F1分数超过95%。在时间表达式规范化方面,IMLLF方法的准确率高于93%。
IMLLF在网络文本数据集和医学对话数据集上的时间表达式提取和规范化方面有较好的结果,验证了IMLLF识别和量化时间信息的通用性。IMLLF方法能将时间信息准确映射到时间轴上,便于医生直观地了解患者何时发生了什么情况,有助于做出更好的医疗决策。