Smith Leslie A, Cahill James A, Graim Kiley
Department of Computer & Information Science & Engineering, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA.
Environmental Engineering Sciences Department, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA.
Res Sq. 2023 Jul 27:rs.3.rs-3168446. doi: 10.21203/rs.3.rs-3168446/v1.
Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease [1-8]. Therapeutics and outcomes remain hidden because we lack insights that we could gain from analyzing ancestry-unbiased genomic data. To address this significant gap, we present PhyloFrame, the first-ever machine learning method for equitable genomic precision medicine. PhyloFrame corrects for ancestral bias by integrating big data tissue-specific functional interaction networks, global population variation data, and disease-relevant transcriptomic data. Application of PhyloFrame to breast, thyroid, and uterine cancers shows marked improvements in predictive power across all ancestries, less model overfitting, and a higher likelihood of identifying known cancer-related genes. The ability to provide accurate predictions for underrepresented groups, in particular, is substantially increased. These results demonstrate how AI can mitigate ancestral bias in training data and contribute to equitable representation in medical research.
金标准基因组数据集严重缺乏非欧洲人群的代表性,导致了不平等现象以及对人类疾病的有限理解[1-8]。由于缺乏通过分析无祖先偏差的基因组数据所能获得的见解,治疗方法和治疗结果仍然不为人知。为了弥补这一重大差距,我们推出了PhyloFrame,这是有史以来第一种用于公平基因组精准医学的机器学习方法。PhyloFrame通过整合大数据组织特异性功能相互作用网络、全球人群变异数据和疾病相关转录组数据来纠正祖先偏差。将PhyloFrame应用于乳腺癌、甲状腺癌和子宫癌显示,在所有祖先群体中预测能力都有显著提高,模型过拟合减少,并且识别已知癌症相关基因的可能性更高。特别是,为代表性不足群体提供准确预测的能力大幅增强。这些结果证明了人工智能如何能够减轻训练数据中的祖先偏差,并有助于在医学研究中实现公平的代表性。