Suppr超能文献

为可靠的人工智能清理和协调医学图像数据:从纵向口腔癌自然史研究数据中吸取的经验教训。

Cleaning and Harmonizing Medical Image Data for Reliable AI: Lessons Learned from Longitudinal Oral Cancer Natural History Study Data.

作者信息

Xue Zhiyun, Oguguo Tochi, Yu Kelly J, Chen Tseng-Cheng, Hua Chun-Hung, Kang Chung Jan, Chien Chih-Yen, Tsai Ming-Hsui, Wang Cheng-Ping, Chaturvedi Anil K, Antani Sameer

机构信息

National Library of Medicine, National Institutes of Health, Maryland, USA.

National Cancer Institute, National Institutes of Health, Maryland, USA.

出版信息

Proc SPIE Int Soc Opt Eng. 2024 Feb;12931. doi: 10.1117/12.3005875. Epub 2024 Apr 2.

Abstract

For deep learning-based machine learning, not only are large and sufficiently diverse data crucial but their good qualities are equally important. However, in real-world applications, it is very common that raw source data may contain incorrect, noisy, inconsistent, improperly formatted and sometimes missing elements, particularly, when the datasets are large and sourced from many sites. In this paper, we present our work towards preparing and making image data ready for the development of AI-driven approaches for studying various aspects of the natural history of oral cancer. Specifically, we focus on two aspects: 1) cleaning the image data; and 2) extracting the annotation information. Data cleaning includes removing duplicates, identifying missing data, correcting errors, standardizing data sets, and removing personal sensitive information, toward combining data sourced from different study sites. These steps are often collectively referred to as data harmonization. Annotation information extraction includes identifying crucial or valuable texts that are manually entered by clinical providers related to the image paths/names and standardizing of the texts of labels. Both are important for the successful deep learning algorithm development and data analyses. Specifically, we provide details on the data under consideration, describe the challenges and issues we observed that motivated our work, present specific approaches and methods that we used to clean and standardize the image data and extract labelling information. Further, we discuss the ways to increase efficiency of the process and the lessons learned. Research ideas on automating the process with ML-driven techniques are also presented and discussed. Our intent in reporting and discussing such work in detail is to help provide insights in automating or, minimally, increasing the efficiency of these critical yet often under-reported processes.

摘要

对于基于深度学习的机器学习而言,不仅大量且足够多样的数据至关重要,其良好的质量同样重要。然而,在实际应用中,原始源数据可能包含不正确、有噪声、不一致、格式不当以及有时缺失的元素,这是非常常见的情况,尤其是当数据集规模庞大且来源于多个站点时。在本文中,我们展示了我们为准备图像数据并使其适用于开发人工智能驱动的方法以研究口腔癌自然史的各个方面所做的工作。具体而言,我们关注两个方面:1)清理图像数据;2)提取注释信息。数据清理包括去除重复项、识别缺失数据、纠正错误、使数据集标准化以及去除个人敏感信息,以合并来自不同研究站点的数据。这些步骤通常统称为数据协调。注释信息提取包括识别临床提供者手动输入的与图像路径/名称相关的关键或有价值的文本以及标签文本的标准化。这两者对于深度学习算法的成功开发和数据分析都很重要。具体来说,我们提供了所考虑数据的详细信息,描述了促使我们开展工作的观察到的挑战和问题,介绍了我们用于清理和标准化图像数据以及提取标签信息的具体方法。此外,我们讨论了提高该过程效率的方法以及所学到的经验教训。还提出并讨论了使用机器学习驱动技术使该过程自动化的研究思路。我们详细报告和讨论此类工作的目的是帮助提供有关自动化或至少提高这些关键但往往未充分报道的过程效率的见解。

相似文献

本文引用的文献

1
Data Characterization for Reliable AI in Medicine.医学中可靠人工智能的数据特征描述
Recent Trends Image Process Pattern Recogn (2022). 2023;1704:3-11. doi: 10.1007/978-3-031-23599-3_1. Epub 2023 Jan 11.
2
Extraction of Ruler Markings For Estimating Physical Size of Oral Lesions.提取用于估计口腔病变物理大小的标尺标记
Proc IAPR Int Conf Pattern Recogn. 2022 Aug;2022:4241-4247. doi: 10.1109/icpr56361.2022.9956251. Epub 2022 Nov 29.
3
Image Quality Classification for Automated Visual Evaluation of Cervical Precancer.用于子宫颈癌前病变自动视觉评估的图像质量分类
Med Image Learn Ltd Noisy Data (2022). 2022 Sep;13559:206-217. doi: 10.1007/978-3-031-16760-7_20. Epub 2022 Sep 15.
5
Oral Cavity Anatomical Site Image Classification and Analysis.口腔解剖部位图像分类与分析
Proc SPIE Int Soc Opt Eng. 2022 Feb-Mar;12037. doi: 10.1117/12.2611541. Epub 2022 Apr 4.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验