从基因数据到亲属关系明晰：运用机器学习检测乱伦关系

From genetic data to kinship clarity: employing machine learning for detecting incestuous relations.

作者信息

Šorgić Dejan, Stefanović Aleksandra, Popović Mladen, Keckarević Dušan

机构信息

Department of Paternity Identification, Biological and Other Traces, Institute of Forensic Medicine, Niš, Serbia.

Department of English, Faculty of Philosophy, University of Niš, Niš, Serbia.

出版信息

Front Genet. 2025 Jun 2;16:1578581. doi: 10.3389/fgene.2025.1578581. eCollection 2025.

DOI:10.3389/fgene.2025.1578581

PMID:40529809

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12171372/

Abstract

INTRODUCTION

The aim of the study was to develop a predictive model based on STR profiles of mothers and children for the detection of incestuous conception.

METHODS

Based on allele frequency data from the USA and Saudi Arabia, STR profiles were generated and used to simulate offspring profiles corresponding to father-child and brother-sister incest scenarios. Model training and evaluation were performed using the STR profiles of the mother and child. In addition to the baseline model, we examined its performance under a one-step mutation model, as well as its ability to detect incestuous relationships based solely on the child's STR profile. Several machine learning algorithms and neural networks were tested for classification accuracy.

RESULTS

The CatBoost algorithm performed best in the binary classification of Normal Paternity vs. Incest Kinship. For the USA, we achieved the following results: 96.94% for 29 markers and 95% for 21 markers. The same accuracy was obtained with a single-step mutation, while prediction based on child profiles exclusively yielded an accuracy of 90.37% in the U.S. population. When analysing profiles from Saudi Arabia and modified Saudi frequencies, an accuracy of 94% was achieved.

DISCUSSION

It was established that population structure does not affect the model's accuracy and that it can be applied even in isolated populations.

摘要

引言

本研究的目的是基于母亲和孩子的STR图谱开发一种预测模型，用于检测乱伦受孕情况。

方法

根据来自美国和沙特阿拉伯的等位基因频率数据生成STR图谱，并用于模拟对应父子和兄妹乱伦情形的后代图谱。使用母亲和孩子的STR图谱进行模型训练和评估。除了基线模型，我们还研究了其在单步突变模型下的性能，以及仅根据孩子的STR图谱检测乱伦关系的能力。测试了几种机器学习算法和神经网络的分类准确性。

结果

在正常亲子关系与乱伦亲属关系的二元分类中，CatBoost算法表现最佳。对于美国，我们得到以下结果：29个标记物时为96.94%，21个标记物时为95%。单步突变时获得了相同的准确率，而仅基于孩子图谱在美国人群中的预测准确率为90.37%。分析来自沙特阿拉伯的图谱和修正后的沙特频率时，准确率达到了94%。