Suppr超能文献

改善代表性不足人群表型预测的机器学习策略

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations.

作者信息

Bonet David, Levin May, Montserrat Daniel Mas, Ioannidis Alexander G

机构信息

Stanford University, Stanford, CA, US.

Universitat Politècnica de Catalunya, Barcelona, Spain.

出版信息

bioRxiv. 2023 Oct 17:2023.10.12.561949. doi: 10.1101/2023.10.12.561949.

Abstract

Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.

摘要

由于欧洲血统人群在构建预测模型所依据的基因组数据集和大规模生物样本库中占比过高,精准医学模型在欧洲血统人群中通常表现得更好。因此,预测模型可能会对代表性不足的人群产生误判,或给出不够准确的治疗建议,从而加剧健康差距。本研究引入了一种适应性强的机器学习工具包,该工具包整合了多种现有方法和新技术,以提高基因组数据集中代表性不足人群的预测准确性。通过利用梯度提升和自动化方法等机器学习技术,结合新颖的群体条件重采样技术,我们的方法显著提高了从单核苷酸多态性(SNP)数据对不同人群进行表型预测的准确性。我们使用英国生物样本库对我们的方法进行评估,该样本库主要由具有欧洲血统的英国个体组成,同时也有少量亚洲和非洲血统群体的样本。性能指标表明,对于代表性不足的群体,表型预测有了显著改善,达到了与多数群体相当的预测准确率。在当前数据集多样性挑战的背景下,这种方法朝着提高预测准确性迈出了重要一步。通过整合一个定制的流程,我们的方法促进了统计遗传学方法更公平的有效性和实用性,为更具包容性的模型和结果铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acc1/10614800/2cf475b305b6/nihpp-2023.10.12.561949v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验