Syed Shorabuddin, Angel Adam Jackson, Syeda Hafsa Bareen, Jennings Carole Franc, VanScoy Joseph, Syed Mahanazuddin, Greer Melody, Bhattacharyya Sudeepa, Al-Shukri Shaymaa, Zozus Meredith, Prior Fred, Tharian Benjamin
Department of Biomedical Informatics, University of Arkansas for Medical Sciences, U.S.A.
Department of Internal Medicine, Washington University, U.S.A.
Biomed Eng Syst Technol Int Jt Conf BIOSTEC Revis Sel Pap. 2022 Feb;2022:162-169. doi: 10.5220/0010876100003123.
Colonoscopy plays a critical role in screening of colorectal carcinomas (CC). Unfortunately, the data related to this procedure are stored in disparate documents, colonoscopy, pathology, and radiology reports respectively. The lack of integrated standardized documentation is impeding accurate reporting of quality metrics and clinical and translational research. Natural language processing (NLP) has been used as an alternative to manual data abstraction. Performance of Machine Learning (ML) based NLP solutions is heavily dependent on the accuracy of annotated corpora. Availability of large volume annotated corpora is limited due to data privacy laws and the cost and effort required. In addition, the manual annotation process is error-prone, making the lack of quality annotated corpora the largest bottleneck in deploying ML solutions. The objective of this study is to identify clinical entities critical to colonoscopy quality, and build a high-quality annotated corpus using domain specific taxonomies following standardized annotation guidelines. The annotated corpus can be used to train ML models for a variety of downstream tasks.
结肠镜检查在结直肠癌(CC)筛查中起着关键作用。不幸的是,与该检查相关的数据分别存储在不同的文档中,即结肠镜检查报告、病理报告和放射学报告。缺乏集成的标准化文档阻碍了质量指标的准确报告以及临床和转化研究。自然语言处理(NLP)已被用作手动数据提取的替代方法。基于机器学习(ML)的NLP解决方案的性能在很大程度上取决于注释语料库的准确性。由于数据隐私法以及所需的成本和精力,大量注释语料库的可用性有限。此外,手动注释过程容易出错,使得缺乏高质量注释语料库成为部署ML解决方案的最大瓶颈。本研究的目的是识别对结肠镜检查质量至关重要的临床实体,并按照标准化注释指南使用特定领域的分类法构建高质量的注释语料库。该注释语料库可用于训练各种下游任务的ML模型。