DisVar：一个用于利用大规模个人基因信息识别与疾病相关变异的R语言库。

DisVar: an R library for identifying variants associated with diseases using large-scale personal genetic information.

作者信息

Chanasongkhram Khunanon, Damkliang Kasikrit, Sangket Unitsa

机构信息

Division of Biological Science, Faculty of Science, Prince of Songkla University, Hat Yai, Songkhla, Thailand.

Division of Computational Science, Faculty of Science, Prince of Songkla University, Hat Yai, Songkhla, Thailand.

出版信息

PeerJ. 2023 Sep 28;11:e16086. doi: 10.7717/peerj.16086. eCollection 2023.

DOI:10.7717/peerj.16086

PMID:37790633

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10542659/

Abstract

BACKGROUND

Genetic variants may potentially play a contributing factor in the development of diseases. Several genetic disease databases are used in medical research and diagnosis but the web applications used to search these databases for disease-associated variants have limitations. The application may not be able to search for large-scale genetic variants, the results of searches may be difficult to interpret and variants mapped from the latest reference genome (GRCH38/hg38) may not be supported.

METHODS

In this study, we developed a novel R library called "DisVar" to identify disease-associated genetic variants in large-scale individual genomic data. This R library is compatible with variants from the latest reference genome version. DisVar uses five databases of disease-associated variants. Over 100 million variants can be simultaneously searched for specific associated diseases.

RESULTS

The package was evaluated using 24 Variant Call Format (VCF) files (215,054 to 11,346,899 sites) from the 1000 Genomes Project. Disease-associated variants were detected in 298,227 hits across all the VCF files, taking a total of 63.58 m to complete. The package was also tested on ClinVar's VCF file (2,120,558 variants), where 20,657 hits associated with diseases were identified with an estimated elapsed time of 45.98 s.

CONCLUSIONS

DisVar can overcome the limitations of existing tools and is a fast and effective diagnostic and preventive tool that identifies disease-associated variations from large-scale genetic variants against the latest reference genome.

摘要

背景

基因变异可能在疾病发展中发挥促成作用。医学研究和诊断中使用了多个遗传疾病数据库，但用于在这些数据库中搜索疾病相关变异的网络应用存在局限性。该应用可能无法搜索大规模基因变异，搜索结果可能难以解释，并且可能不支持从最新参考基因组（GRCH38/hg38）映射的变异。

方法

在本研究中，我们开发了一个名为“DisVar”的新型R库，用于在大规模个体基因组数据中识别疾病相关的基因变异。这个R库与最新参考基因组版本的变异兼容。DisVar使用五个疾病相关变异数据库。可以同时搜索超过1亿个变异以查找特定的相关疾病。

结果

使用来自千人基因组计划的24个变异调用格式（VCF）文件（215,054至11,346,899个位点）对该软件包进行了评估。在所有VCF文件中的298,227次命中中检测到了疾病相关变异，总共耗时63.58分钟完成。该软件包还在ClinVar的VCF文件（2,120,558个变异）上进行了测试，在该文件中识别出了20,657个与疾病相关的命中，估计耗时45.98秒。