基于机器学习的方法预测欧洲内部的生物地理起源。

A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe.

机构信息

Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland.

Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland.

出版信息

Int J Mol Sci. 2023 Oct 11;24(20):15095. doi: 10.3390/ijms242015095.

DOI:10.3390/ijms242015095

PMID:37894775

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10606184/

Abstract

Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.

摘要

利用大规模并行测序（MPS）获得的数据可用于群体遗传学研究。特别是，此类数据具有区分来自不同群体的样本的潜力，特别是那些来自共同起源的相邻群体的样本。机器学习（ML）技术似乎特别适合分析使用 MPS 获得的大型数据集。斯拉夫族群约占欧洲人口的三分之一，居住在欧洲大陆的大片地区，而在人口遗传学方面，他们的关系相对密切。在这项概念验证研究中，使用了各种 ML 技术来对斯拉夫和非斯拉夫个体的 DNA 样本进行分类。本研究的主要目的是从遗传相似的角度来实证评估辨别具有斯拉夫血统的个体的遗传来源的可行性，总体目标是对来自不同斯拉夫人群代表的 DNA 标本进行分类。原始测序数据进行了预处理，得到了一个 1200 个字符长的二进制向量。总共使用了三种分类器-随机森林，支持向量机（SVM）和 XGBoost。使用带有线性核的 SVM 获得了最有希望的结果，所有类别的准确率为 99.9％，F1 得分为 0.9846-1.000。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d279/10606184/13c7da461a22/ijms-24-15095-g001.jpg

相似文献

A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe.基于机器学习的方法预测欧洲内部的生物地理起源。

Int J Mol Sci. 2023 Oct 11;24(20):15095. doi: 10.3390/ijms242015095.

A single nucleotide polymorphism panel for individual identification and ancestry assignment in Caucasians and four East and Southeast Asian populations using a machine learning classifier.使用机器学习分类器的单核苷酸多态性面板用于白种人和四个东亚及东南亚人群的个体识别和血统归属。

Forensic Sci Med Pathol. 2019 Mar;15(1):67-74. doi: 10.1007/s12024-018-0071-y. Epub 2019 Jan 16.

A Fast Reduced Kernel Extreme Learning Machine.一种快速简化核极限学习机。

Neural Netw. 2016 Apr;76:29-38. doi: 10.1016/j.neunet.2015.10.006. Epub 2016 Jan 6.

SVM and SVM Ensembles in Breast Cancer Prediction.支持向量机及其集成方法在乳腺癌预测中的应用

PLoS One. 2017 Jan 6;12(1):e0161501. doi: 10.1371/journal.pone.0161501. eCollection 2017.

Machine learning algorithms to predict early pregnancy loss after in vitro fertilization-embryo transfer with fetal heart rate as a strong predictor.以胎儿心率作为强预测指标，用于预测体外受精-胚胎移植后早期妊娠丢失的机器学习算法。

Comput Methods Programs Biomed. 2020 Nov;196:105624. doi: 10.1016/j.cmpb.2020.105624. Epub 2020 Jun 25.

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.我们是否需要不同的机器学习算法来进行定量构效关系建模？对 16 种机器学习算法在 14 个定量构效关系数据集上的综合评估。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa321.

J Clin Pharm Ther. 2019 Apr;44(2):268-275. doi: 10.1111/jcpt.12786. Epub 2018 Dec 18.

Identifying the Risk Factors Associated with Nursing Home Residents' Pressure Ulcers Using Machine Learning Methods.利用机器学习方法识别与养老院居民压疮相关的风险因素。

Int J Environ Res Public Health. 2021 Mar 13;18(6):2954. doi: 10.3390/ijerph18062954.

Machine learning-based classification of the movements of children with profound or severe intellectual or multiple disabilities using environment data features.基于机器学习的使用环境数据特征对患有严重智力或多重残疾的儿童运动进行分类。

PLoS One. 2022 Jun 30;17(6):e0269472. doi: 10.1371/journal.pone.0269472. eCollection 2022.

A multivariate statistical approach for the estimation of the ethnic origin of unknown genetic profiles in forensic genetics.多变量统计方法在法医遗传学中用于估计未知遗传谱的种族来源。

Forensic Sci Int Genet. 2020 Mar;45:102209. doi: 10.1016/j.fsigen.2019.102209. Epub 2019 Nov 27.

引用本文的文献

A machine learning approach for estimating Eastern Asian origins from massive screening of Y chromosomal short tandem repeats polymorphisms.一种通过大规模筛查Y染色体短串联重复序列多态性来估计东亚血统的机器学习方法。

Int J Legal Med. 2025 Mar;139(2):531-540. doi: 10.1007/s00414-024-03406-w. Epub 2025 Jan 8.

本文引用的文献

Is There a Role for Large Exome Sequencing in the Management of Metastatic Non-Small Cell Lung Cancer: A Brief Report of Real Life.大外显子测序在转移性非小细胞肺癌管理中是否发挥作用：真实病例简短报告

Front Oncol. 2022 Mar 7;12:863057. doi: 10.3389/fonc.2022.863057. eCollection 2022.

Dissecting polygenic signals from genome-wide association studies on human behaviour.从全基因组关联研究中解析人类行为的多基因信号。

Nat Hum Behav. 2021 Jun;5(6):686-694. doi: 10.1038/s41562-021-01110-y. Epub 2021 May 13.

A review on genetic algorithm: past, present, and future.关于遗传算法的综述：过去、现在与未来。

Multimed Tools Appl. 2021;80(5):8091-8126. doi: 10.1007/s11042-020-10139-6. Epub 2020 Oct 31.

Predicting geographic location from genetic variation with deep neural networks.利用深度神经网络从遗传变异中预测地理位置。

Elife. 2020 Jun 8;9:e54507. doi: 10.7554/eLife.54507.

ImaGene: a convolutional neural network to quantify natural selection from genomic data.ImaGene：一种从基因组数据中定量自然选择的卷积神经网络。

BMC Bioinformatics. 2019 Nov 22;20(Suppl 9):337. doi: 10.1186/s12859-019-2927-x.

A Guide for Using Deep Learning for Complex Trait Genomic Prediction.深度学习在复杂性状基因组预测中的应用指南。

Genes (Basel). 2019 Jul 20;10(7):553. doi: 10.3390/genes10070553.

Genetic Landscape of Slovenians: Past Admixture and Natural Selection Pattern.斯洛文尼亚人的遗传图谱：过去的混合与自然选择模式。

Front Genet. 2018 Nov 19;9:551. doi: 10.3389/fgene.2018.00551. eCollection 2018.

A primer on deep learning in genomics.深度学习在基因组学中的应用简介。

Nat Genet. 2019 Jan;51(1):12-18. doi: 10.1038/s41588-018-0295-5. Epub 2018 Nov 26.

Can Deep Learning Improve Genomic Prediction of Complex Human Traits?深度学习能否提高复杂人类性状的基因组预测？

Genetics. 2018 Nov;210(3):809-819. doi: 10.1534/genetics.118.301298. Epub 2018 Aug 31.

Using Machine Learning to Aid the Interpretation of Urine Steroid Profiles.利用机器学习辅助解读尿液类固醇谱。

Clin Chem. 2018 Nov;64(11):1586-1595. doi: 10.1373/clinchem.2018.292201. Epub 2018 Aug 10.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于机器学习的方法预测欧洲内部的生物地理起源。

A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献