RaPID-Query 用于快速的血缘关系搜索和系谱分析。

RaPID-Query for fast identity by descent search and genealogical analysis.

机构信息

Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States.

School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

出版信息

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad312.

DOI:10.1093/bioinformatics/btad312

PMID:37166451

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10244210/

Abstract

MOTIVATION

Due to the rapid growth of the genetic database size, genealogical search, a process of inferring familial relatedness by identifying DNA matches, has become a viable approach to help individuals finding missing family members or law enforcement agencies locating suspects. A fast and accurate method is needed to search an out-of-database individual against millions of individuals. Most existing approaches only offer all-versus-all within panel match. Some prototype algorithms offer one-versus-all query from out-of-panel individual, but they do not tolerate errors.

RESULTS

A new method, random projection-based identity-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes. By integrating matches over multiple PBWT indexes, RaPID-Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites. A single query against all UK biobank autosomal chromosomes was completed within 2.76 seconds on average, with the minimum length 7 cM and 700 markers. RaPID-Query achieved a 0.016 false negative rate and a 0.012 false positive rate simultaneously on a chromosome 20 sequencing panel having 86 265 sites. This is comparable to the state-of-the-art IBD detection method TPBWT(out-of-sample) and Hap-IBD. The high-quality IBD segments yielded by RaPID-Query were able to distinguish up to fourth degree of the familial relatedness for a given individual pair, and the area under the receiver operating characteristic curve values are at least 97.28%.

AVAILABILITY AND IMPLEMENTATION

The RaPID-Query program is available at https://github.com/ucfcbb/RaPID-Query.

摘要

动机

由于基因数据库规模的快速增长，通过识别 DNA 匹配来推断家族关系的系谱搜索已成为帮助个人寻找失踪家庭成员或执法机构寻找嫌疑人的可行方法。需要一种快速而准确的方法来搜索数据库外的个体与数百万个体。大多数现有方法仅提供面板内的所有对所有匹配。一些原型算法提供了来自面板外个体的一对一查询，但它们不能容忍错误。

结果

引入了一种新方法，基于随机投影的血缘关系（IBD）检测（RaPID）查询，以使快速系谱搜索成为可能。RaPID-Query 确定查询单倍型与单倍型面板之间的 IBD 段。通过整合多个 PBWT 索引上的匹配，RaPID-Query 设法在给定的截止长度下快速定位 IBD 段，同时允许不匹配的位点。对所有 UK Biobank 常染色体染色体的单个查询平均在 2.76 秒内完成，最小长度为 7 cM 和 700 个标记。RaPID-Query 在具有 86000 个位点的 20 号染色体测序面板上同时实现了 0.016 的假阴性率和 0.012 的假阳性率。这与最先进的 IBD 检测方法 TPBWT(out-of-sample)和 Hap-IBD 相当。RaPID-Query 生成的高质量 IBD 段能够区分给定个体对的第四级家族关系，且接受者操作特征曲线下的面积值至少为 97.28%。