Yale School of Medicine, New Haven, CT, USA; VA Connecticut Healthcare System, West Haven, CT, USA.
Yale School of Medicine, New Haven, CT, USA; VA Connecticut Healthcare System, West Haven, CT, USA.
J Biomed Inform. 2024 Jun;154:104654. doi: 10.1016/j.jbi.2024.104654. Epub 2024 May 11.
We evaluated methods for preparing electronic health record data to reduce bias before applying artificial intelligence (AI).
We created methods for transforming raw data into a data framework for applying machine learning and natural language processing techniques for predicting falls and fractures. Strategies such as inclusion and reporting for multiple races, mixed data sources such as outpatient, inpatient, structured codes, and unstructured notes, and addressing missingness were applied to raw data to promote a reduction in bias. The raw data was carefully curated using validated definitions to create data variables such as age, race, gender, and healthcare utilization. For the formation of these variables, clinical, statistical, and data expertise were used. The research team included a variety of experts with diverse professional and demographic backgrounds to include diverse perspectives.
For the prediction of falls, information extracted from radiology reports was converted to a matrix for applying machine learning. The processing of the data resulted in an input of 5,377,673 reports to the machine learning algorithm, out of which 45,304 were flagged as positive and 5,332,369 as negative for falls. Processed data resulted in lower missingness and a better representation of race and diagnosis codes. For fractures, specialized algorithms extracted snippets of text around keywork "femoral" from dual x-ray absorptiometry (DXA) scans to identify femoral neck T-scores that are important for predicting fracture risk. The natural language processing algorithms yielded 98% accuracy and 2% error rate The methods to prepare data for input to artificial intelligence processes are reproducible and can be applied to other studies.
The life cycle of data from raw to analytic form includes data governance, cleaning, management, and analysis. When applying artificial intelligence methods, input data must be prepared optimally to reduce algorithmic bias, as biased output is harmful. Building AI-ready data frameworks that improve efficiency can contribute to transparency and reproducibility. The roadmap for the application of AI involves applying specialized techniques to input data, some of which are suggested here. This study highlights data curation aspects to be considered when preparing data for the application of artificial intelligence to reduce bias.
我们评估了在应用人工智能(AI)之前准备电子健康记录数据以减少偏差的方法。
我们创建了将原始数据转换为用于应用机器学习和自然语言处理技术来预测跌倒和骨折的数据框架的方法。为了减少偏差,我们对原始数据应用了多种策略,例如包含和报告多种种族、多种数据源(门诊、住院、结构化代码和非结构化笔记)以及处理缺失值。原始数据经过仔细的管理和验证定义,创建了诸如年龄、种族、性别和医疗保健利用等数据变量。这些变量的形成使用了临床、统计和数据专业知识。研究团队包括各种具有不同专业和人口统计学背景的专家,以纳入不同的观点。
对于跌倒预测,从放射学报告中提取的信息被转换为一个矩阵,以便应用机器学习。数据处理的结果是将 5377673 份报告输入到机器学习算法中,其中 45304 份报告标记为阳性,5332369 份报告标记为阴性。处理后的数据缺失较少,种族和诊断代码的表示更好。对于骨折,专门的算法从双能 X 射线吸收法(DXA)扫描中提取围绕关键字“股骨”的文本片段,以识别对预测骨折风险很重要的股骨颈 T 评分。自然语言处理算法的准确率为 98%,错误率为 2%。将数据准备为人工智能处理输入的方法是可重复的,可以应用于其他研究。
从原始数据到分析形式的数据生命周期包括数据治理、清理、管理和分析。在应用人工智能方法时,必须优化输入数据以减少算法偏差,因为有偏差的输出是有害的。构建可提高效率的人工智能就绪数据框架有助于提高透明度和可重复性。人工智能应用的路线图涉及应用专门技术到输入数据,其中一些技术在此处提出。本研究强调了在应用人工智能以减少偏差时准备数据时要考虑的数据管理方面。