• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

构建中文临床文本的综合句法和语义语料库。

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.

作者信息

He Bin, Dong Bin, Guan Yi, Yang Jinfeng, Jiang Zhipeng, Yu Qiubin, Cheng Jianyi, Qu Chunyan

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.

Ricoh Software Research Center (Beijing), Beijing, China.

出版信息

J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.

DOI:10.1016/j.jbi.2017.04.006
PMID:28404537
Abstract

OBJECTIVE

To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain.

MATERIALS AND METHODS

An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus.

RESULTS

The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective.

DISCUSSION

The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency.

CONCLUSIONS

In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain.

摘要

目的

构建一个涵盖中文临床文本句法和语义标注的综合语料库,并制定相应的标注指南和方法,同时开发基于该标注语料库训练的工具,为临床领域中文文本的研究提供基线。

材料与方法

提出一种迭代标注方法来培训标注人员并制定标注指南。然后,通过使用标注质量保证措施,构建了一个综合语料库,其中包含词性(POS)标签、句法标签、实体、断言和关系的标注。计算了标注者间一致性(IAA)以评估标注质量,并基于我们的标注语料库开发了一个中文临床文本处理与信息提取系统(CCTPIES)。

结果

句法语料库由138篇中文临床文档组成,有47426个词元以及2612个完整的句法剖析树,而语义语料库包括992篇文档,这些文档标注了39511个带有断言的实体和7693种关系。IAA评估表明这个综合语料库质量良好,并且系统模块是有效的。

讨论

该标注语料库对临床领域中文文本的自然语言处理(NLP)研究做出了相当大的贡献。然而,这个语料库有一些局限性。应该引入一些其他类型的临床文本以提高语料库的覆盖范围,并且应该利用主动学习方法来提高标注效率。

结论

在本研究中,提出了几种针对中文临床文本的标注指南和一种标注方法,并构建了一个带有其NLP模块的综合语料库,为进一步研究将NLP技术应用于临床领域的中文文本奠定了基础。

相似文献

1
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.构建中文临床文本的综合句法和语义语料库。
J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.
2
Towards comprehensive syntactic and semantic annotations of the clinical narrative.朝着临床叙述的全面句法和语义标注努力。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):922-30. doi: 10.1136/amiajnl-2012-001317. Epub 2013 Jan 25.
3
Developing a cardiovascular disease risk factor annotated corpus of Chinese electronic medical records.开发具有心血管疾病风险因素注释的中文电子病历语料库。
BMC Med Inform Decis Mak. 2017 Aug 8;17(1):117. doi: 10.1186/s12911-017-0512-7.
4
Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.临床文本的句法分析:处理不规范句子的指南和语料库开发。
J Am Med Inform Assoc. 2013 Nov-Dec;20(6):1168-77. doi: 10.1136/amiajnl-2013-001810. Epub 2013 Aug 1.
5
RysannMD: A biomedical semantic annotator balancing speed and accuracy.RysannMD:一款兼顾速度与准确性的生物医学语义注释工具。
J Biomed Inform. 2017 Jul;71:91-109. doi: 10.1016/j.jbi.2017.05.016. Epub 2017 May 26.
6
Corpus annotation for mining biomedical events from literature.用于从文献中挖掘生物医学事件的语料库标注。
BMC Bioinformatics. 2008 Jan 8;9:10. doi: 10.1186/1471-2105-9-10.
7
Building a semantically annotated corpus of clinical texts.构建临床文本语义标注语料库。
J Biomed Inform. 2009 Oct;42(5):950-66. doi: 10.1016/j.jbi.2008.12.013. Epub 2009 Jan 23.
8
Automatic Annotation of French Medical Narratives with SNOMED CT Concepts.使用SNOMED CT概念对法语医学叙述进行自动标注
Stud Health Technol Inform. 2018;247:710-714.
9
RCorp: a resource for chemical disease semantic extraction in Chinese.RCorp:一个用于中文化学疾病语义提取的资源。
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):234. doi: 10.1186/s12911-019-0936-3.
10
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.用于生物医学概念识别的多语言金标准语料库:Mantra GSC。
J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.

引用本文的文献

1
Construction, evaluation, and application of an electronic medical record corpus for cerebral palsy rehabilitation.用于脑瘫康复的电子病历语料库的构建、评估及应用
Digit Health. 2024 Sep 27;10:20552076241286260. doi: 10.1177/20552076241286260. eCollection 2024 Jan-Dec.
2
Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop.基于有限标注的英文和日文病例/放射学报告的跨语言自然语言处理:来自Real-MedNLP研讨会的见解。
Methods Inf Med. 2024 Oct 29. doi: 10.1055/a-2405-2489.
3
A scoping review of preprocessing methods for unstructured text data to assess data quality.
对非结构化文本数据进行预处理以评估数据质量的范围回顾。
Int J Popul Data Sci. 2022 Oct 4;7(1):1757. doi: 10.23889/ijpds.v6i1.1757. eCollection 2022.
4
The application value of the Modified Early Warning Score combined with age and injury site scores in the evaluation of injuries in emergency trauma patients.改良早期预警评分联合年龄和损伤部位评分在急诊创伤患者伤情评估中的应用价值。
Front Public Health. 2022 Nov 23;10:914825. doi: 10.3389/fpubh.2022.914825. eCollection 2022.
5
Constructing fine-grained entity recognition corpora based on clinical records of traditional Chinese medicine.基于中医临床记录构建细粒度实体识别语料库。
BMC Med Inform Decis Mak. 2020 Apr 6;20(1):64. doi: 10.1186/s12911-020-1079-2.
6
Deep learning for named entity recognition on Chinese electronic medical records: Combining deep transfer learning with multitask bi-directional LSTM RNN.基于深度学习的中文电子病历命名实体识别:深度迁移学习与多任务双向 LSTM RNN 结合。
PLoS One. 2019 May 2;14(5):e0216046. doi: 10.1371/journal.pone.0216046. eCollection 2019.
7
A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records.一种用于中文电子病历命名实体识别的多任务双向 RNN 模型。
BMC Bioinformatics. 2018 Dec 28;19(Suppl 17):499. doi: 10.1186/s12859-018-2467-9.
8
Developing a cardiovascular disease risk factor annotated corpus of Chinese electronic medical records.开发具有心血管疾病风险因素注释的中文电子病历语料库。
BMC Med Inform Decis Mak. 2017 Aug 8;17(1):117. doi: 10.1186/s12911-017-0512-7.