Department of Biology, Pennsylvania State University, University Park, PA 16802, USA.
CNRS, CRIStAL, 59655 Villeneuve d'Ascq, France.
Bioinformatics. 2018 Apr 1;34(7):1125-1131. doi: 10.1093/bioinformatics/btx771.
The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies.
We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection.
Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY.
kmakova@bx.psu.edu or pashadag@cse.psu.edu.
Supplementary data are available at Bioinformatics online.
由于高度重复的序列和单倍体的性质导致的低深度,单倍体哺乳动物的 Y 染色体在基因组组装中通常代表性不足。一种改善 Y 序列低覆盖度的策略是在组装前通过实验富集 Y 特异性材料。由于富集过程并不完美,因此需要算法在下游组装之前识别可能的 Y 特异性读取。使用 k-mer 丰度来识别此类读取的策略被用于组装大猩猩 Y 染色体。然而,该策略需要手动设置关键参数,这是一个耗时的过程,导致组装结果不理想。
我们开发了一种名为 RecoverY 的方法,该方法通过自动选择 k-mer 被认为来自 Y 的丰度水平来选择 Y 特异性读取。该算法使用了来自相关物种的 Y 染色体或已知的 Y 转录本序列的先验知识。我们在人类和大猩猩的模拟和真实数据上评估了 RecoverY,并研究了其对重要参数的稳健性。我们表明,RecoverY 导致的组装质量远远优于过滤读取或 contigs 的替代策略。与 Tomaszkiewicz 等人使用的初步策略相比,我们在组装大小上提高了 33%,在 NG50 上提高了 20%,这证明了自动参数选择的强大功能。
我们的工具 RecoverY 可在 https://github.com/makovalab-psu/RecoverY 上免费获得。
kmakova@bx.psu.edu 或 pashadag@cse.psu.edu。
补充数据可在 Bioinformatics 在线获取。