一种用于无参考变异检测的可扩展分布式流程。

A scalable distributed pipeline for reference-free variants calling.

作者信息

Di Rocco Lorenzo, Ferraro Petrillo Umberto

机构信息

Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy.

出版信息

BMC Genomics. 2025 Jun 3;26(Suppl 1):557. doi: 10.1186/s12864-025-11722-7.

DOI:10.1186/s12864-025-11722-7

PMID:40461964

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12131334/

Abstract

BACKGROUND

Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. However, the timely analysis of large-scale sequencing data may be beyond the capabilities of single workstations, requiring alternative computational approaches.

RESULTS

We introduce the first-known distributed pipeline for detecting isolated SNPs (Single Nucleotide Polymorphisms), by leveraging the computational resources of multiple machines in parallel. Our pipeline efficiently analyzes large datasets thanks to the usage of a distributed De Bruijn graph representation. Furthermore, we introduce a cluster-driven algorithm to partition the De Bruijn graph across multiple independent machines according to the inner structure of the sequences under analysis, thus further improving the scalability of our pipeline.

CONCLUSIONS

The results of our experiments, conducted on real-world datasets, show the good performance of our pipeline in terms of efficiency, output quality and scalability. Moreover, the reported results also confirm that the adoption of a specialized partitioning algorithm for the distributed representation of the De Bruijn graph leads to a relevant performance speed-up compared to using standard partitioning techniques.

摘要