Suppr超能文献

基于机器学习的方法预测欧洲内部的生物地理起源。

A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe.

机构信息

Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland.

Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland.

出版信息

Int J Mol Sci. 2023 Oct 11;24(20):15095. doi: 10.3390/ijms242015095.

Abstract

Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.

摘要

利用大规模并行测序(MPS)获得的数据可用于群体遗传学研究。特别是,此类数据具有区分来自不同群体的样本的潜力,特别是那些来自共同起源的相邻群体的样本。机器学习(ML)技术似乎特别适合分析使用 MPS 获得的大型数据集。斯拉夫族群约占欧洲人口的三分之一,居住在欧洲大陆的大片地区,而在人口遗传学方面,他们的关系相对密切。在这项概念验证研究中,使用了各种 ML 技术来对斯拉夫和非斯拉夫个体的 DNA 样本进行分类。本研究的主要目的是从遗传相似的角度来实证评估辨别具有斯拉夫血统的个体的遗传来源的可行性,总体目标是对来自不同斯拉夫人群代表的 DNA 标本进行分类。原始测序数据进行了预处理,得到了一个 1200 个字符长的二进制向量。总共使用了三种分类器-随机森林,支持向量机(SVM)和 XGBoost。使用带有线性核的 SVM 获得了最有希望的结果,所有类别的准确率为 99.9%,F1 得分为 0.9846-1.000。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d279/10606184/13c7da461a22/ijms-24-15095-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验