Suppr超能文献

利用序列基序发现工具识别表型叙述的语言模式对中国电子健康记录进行深度表型分析:算法开发与验证

Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation.

作者信息

Li Shicheng, Deng Lizong, Zhang Xu, Chen Luming, Yang Tao, Qi Yifan, Jiang Taijiao

机构信息

Institute of Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.

Suzhou Institute of Systems Medicine, Suzhou, China.

出版信息

J Med Internet Res. 2022 Jun 3;24(6):e37213. doi: 10.2196/37213.

Abstract

BACKGROUND

Phenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep-phenotyping method for non-English EHRs (ie, Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data that are suitable for developing deep-phenotyping methods are limited. It is challenging to develop a deep-phenotyping method for Chinese EHRs in such a low-resource scenario.

OBJECTIVE

In this study, we aimed to develop a deep-phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data.

METHODS

The core of the methodology was to identify linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and perform deep phenotyping of Chinese EHRs by recognizing linguistic patterns in free text. Specifically, 1000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (Semantic Structured Unit of Phenotypes). The annotation data set was randomly divided into a training set (n=700, 70%) and a testing set (n=300, 30%). The process for mining linguistic patterns was divided into three steps. First, free text in the training set was encoded as single-letter sequences (P: phenotype, A: attribute). Second, a biological sequence analysis tool-MEME (Multiple Expectation Maximums for Motif Elicitation)-was used to identify motifs in the single-letter sequences. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep-phenotyping method for Chinese EHRs, including a deep learning-based method for named entity recognition and a pattern recognition-based method for attribute prediction.

RESULTS

In total, 51 sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions could be learned from a mean of 134 (SD 9.7) annotated EHRs in the training set. The deep-phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1 score of 0.898 with the Bidirectional Encoder Representations from Transformers-bidirectional long short-term memory and conditional random field model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern-based method.

CONCLUSIONS

We developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non-English-speaking countries.

摘要

背景

电子健康记录(EHR)中的表型信息主要记录在非结构化的自由文本中,无法直接用于临床研究。基于EHR的深度表型分析方法能够高保真地构建EHR中的表型信息,使其成为医学信息学的研究重点。然而,开发针对非英文EHR(即中文EHR)的深度表型分析方法具有挑战性。尽管中国存在大量的EHR资源,但适合用于开发深度表型分析方法的细粒度注释数据却很有限。在这种低资源场景下开发针对中文EHR的深度表型分析方法具有挑战性。

目的

在本研究中,我们旨在基于有限的细粒度注释数据,开发一种对中文EHR具有良好泛化能力的深度表型分析方法。

方法

该方法的核心是使用序列基序发现工具识别中文EHR中表型描述的语言模式,并通过识别自由文本中的语言模式对中文EHR进行深度表型分析。具体而言,基于细粒度信息模型PhenoSSU(表型语义结构化单元)对1000份中文EHR进行人工注释。注释数据集被随机分为训练集(n = 700,70%)和测试集(n = 300,30%)。挖掘语言模式的过程分为三个步骤。首先,将训练集中的自由文本编码为单字母序列(P:表型,A:属性)。其次,使用生物序列分析工具MEME(用于基序引出的多重期望最大化)在单字母序列中识别基序。最后,将识别出的基序简化为一系列表示中文EHR中PhenoSSU实例语言模式的正则表达式。基于发现的语言模式,我们开发了一种针对中文EHR的深度表型分析方法,包括基于深度学习的命名实体识别方法和基于模式识别的属性预测方法。

结果

从训练集中的700份中文EHR中总共挖掘出51个具有统计学意义的序列基序,并将其组合成六个正则表达式。发现这六个正则表达式可以从训练集中平均134份(标准差9.7)注释的EHR中学习得到。针对中文EHR的深度表型分析算法在测试集上识别PhenoSSU实例的总体准确率为0.844。对于实体识别子任务,该算法使用基于变换器的双向编码器表示 - 双向长短期记忆和条件随机场模型实现了F1分数为0.898;对于属性预测子任务,该算法使用基于语言模式的方法实现了加权准确率为0.940。

结论

我们开发了一种简单但有效的策略,用于在有限的细粒度注释数据下对中文EHR进行深度表型分析。我们的工作将促进中文EHR的二次利用,并为其他非英语国家提供启示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7460/9206202/0ec57f10a891/jmir_v24i6e37213_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验