Department of Biology and Medical Genetics, Second Faculty of Medicine, Charles University and University Hospital Motol, V Úvalu 84, 15006, Prague, Czech Republic.
Department of Biology and Medical Genetics, First Faculty of Medicine, Charles University and General University Hospital in Prague, Albertov 4, 128 00, Prague, Czech Republic.
BMC Bioinformatics. 2021 Sep 27;22(1):464. doi: 10.1186/s12859-021-04374-3.
Structural variants (SVs) represent an important source of genetic variation. One of the most critical problems in their detection is breakpoint uncertainty associated with the inability to determine their exact genomic position. Breakpoint uncertainty is a characteristic issue of structural variants detected via short-read sequencing methods and complicates subsequent population analyses. The commonly used heuristic strategy reduces this issue by clustering/merging nearby structural variants of the same type before the data from individual samples are merged.
We compared the two most used dissimilarity measures for SV clustering in terms of Mendelian inheritance errors (MIE), kinship prediction, and deviation from Hardy-Weinberg equilibrium. We analyzed the occurrence of Mendelian-inconsistent SV clusters that can be collapsed into one Mendelian-consistent SV as a new measure of dataset consistency. We also developed a new method based on constrained clustering that explicitly identifies these types of clusters.
We found that the dissimilarity measure based on the distance between SVs breakpoints produces slightly better results than the measure based on SVs overlap. This difference is evident in trivial and corrected clustering strategy, but not in constrained clustering strategy. However, constrained clustering strategy provided the best results in all aspects, regardless of the dissimilarity measure used.
结构变异(SVs)是遗传变异的重要来源之一。在检测它们时,最关键的问题之一是由于无法确定其确切的基因组位置而导致的断点不确定性。断点不确定性是通过短读测序方法检测到的结构变异的一个特征问题,这使得后续的群体分析变得复杂。常用的启发式策略通过在合并来自个体样本的数据之前,对同一类型的附近结构变异进行聚类/合并,来减少这个问题。
我们比较了两种最常用的用于 SV 聚类的距离度量方法,即孟德尔遗传错误(MIE)、亲缘关系预测和偏离哈迪-温伯格平衡。我们分析了可以合并为一个孟德尔一致的 SV 的 Mendelian 不一致的 SV 聚类的发生情况,作为数据集一致性的新度量。我们还开发了一种基于约束聚类的新方法,该方法可以明确识别这些类型的聚类。
我们发现,基于 SV 断点之间距离的距离度量方法比基于 SV 重叠的度量方法产生的结果略好。这种差异在琐碎和校正的聚类策略中很明显,但在约束聚类策略中不明显。然而,无论使用哪种距离度量方法,约束聚类策略都在所有方面都提供了最佳的结果。