Department of Computer Science and Engineering, JSS Science and Technology University, Mysuru, Karnataka, India.
Department of Information Science and Engineering, JSS Science and Technology University, Mysuru, Karnataka, India.
JCO Clin Cancer Inform. 2022 Sep;6:e2200036. doi: 10.1200/CCI.22.00036.
The extensive growth and use of electronic health records (EHRs) and extending medical literature have led to huge opportunities to automate the extraction of relevant clinical information that helps in concise and effective clinical decision support. However, processing such information has traditionally been dependent on labor-intensive processes with human errors such as fatigue, oversight, and interobserver variability. Hence, this study aims at the processing of EHRs and performing multilevel and multiclass classification by fetching dominant characteristic features that are sufficient to detect and differentiate various types of breast lesions.
In this study, unstructured EHRs on breast lesions obtained through fine-needle aspiration cytology technique are considered. The raw text was normalized into structured tabular form and converted to scores by performing sentiment analysis that helps to decide the total polarity or class label of the EHR. Supervised machine learning approaches, namely random forest and feed-forward neural network trained using Levenberg-Marquardt training function, are used for classification of the collected EHR data set containing 2,879 records that are split in the ratio of 80:20 as training and testing data sets, respectively.
Random forest and feed-forward neural network classifiers gave the best performance with an accuracy of 99.36%, an overall receiver operating characteristic-area under the curve of 99.2%, a correlation with ground truth of 98.3%, and a histopathologic correlation of 98.6%.
Natural language processing has huge potential to automate the extraction of clinical features from breast lesions. The proposed multilevel and multiclass classification approach is used to classify 13 different types of breast lesions with 20 different labels into five classes to decide the type of treatment that should be given to patients by a physician or oncologist.
电子健康记录(EHR)的广泛发展和使用以及医学文献的扩展为自动提取有助于简明有效的临床决策支持的相关临床信息提供了巨大的机会。然而,传统上处理此类信息一直依赖于劳动密集型过程,存在人为错误,例如疲劳、疏忽和观察者间变异性。因此,本研究旨在处理 EHR 并通过提取足以检测和区分各种类型的乳腺病变的主要特征来进行多层次和多类分类。
在这项研究中,考虑了通过细针抽吸细胞学技术获得的乳腺病变的非结构化 EHR。原始文本被规范化为结构化表格形式,并通过执行情感分析将其转换为分数,这有助于确定 EHR 的总极性或类别标签。使用监督机器学习方法,即使用 Levenberg-Marquardt 训练函数训练的随机森林和前馈神经网络,对包含 2879 条记录的采集 EHR 数据集进行分类,这些数据集分别以 80:20 的比例分为训练集和测试集。
随机森林和前馈神经网络分类器的性能最佳,准确率为 99.36%,整体接收器工作特征曲线下面积为 99.2%,与真实情况的相关性为 98.3%,与组织病理学的相关性为 98.6%。
自然语言处理具有从乳腺病变中自动提取临床特征的巨大潜力。所提出的多层次多类分类方法用于将 20 个不同标签的 13 种不同类型的乳腺病变分为五类,以决定医生或肿瘤学家应给予患者的治疗类型。