Wierzbicki Filip, Schwarz Florian, Cannalonga Odontsetseg, Kofler Robert
Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria.
Vienna Graduate School of Population Genetics, Vetmeduni Vienna, Vienna, Austria.
Mol Ecol Resour. 2022 Jan;22(1):102-121. doi: 10.1111/1755-0998.13455. Epub 2021 Aug 28.
In most animals, it is thought that the proliferation of a transposable element (TE) is stopped when the TE jumps into a piRNA cluster. Despite this central importance, little is known about the composition and the evolutionary dynamics of piRNA clusters. This is largely because piRNA clusters are notoriously difficult to assemble as they are frequently composed of highly repetitive DNA. With long reads, we may finally be able to obtain reliable assemblies of piRNA clusters. Unfortunately, it is unclear how to generate and identify the best assemblies, as many assembly strategies exist and standard quality metrics are ignorant of TEs. To address these problems, we introduce several novel quality metrics that assess: (a) the fraction of completely assembled piRNA clusters, (b) the quality of the assembled clusters and (c) whether an assembly captures the overall TE landscape of an organisms (i.e. the abundance, the number of SNPs and internal deletions of all TE families). The requirements for computing these metrics vary, ranging from annotations of piRNA clusters to consensus sequences of TEs and genomic sequencing data. Using these novel metrics, we evaluate the effect of assembly algorithm, polishing, read length, coverage, residual polymorphisms and finally identify strategies that yield reliable assemblies of piRNA clusters. Based on an optimized approach, we provide assemblies for the two Drosophila melanogaster strains Canton-S and Pi2. About 80% of known piRNA clusters were assembled in both strains. Finally, we demonstrate the generality of our approach by extending our metrics to humans and Arabidopsis thaliana.
在大多数动物中,人们认为当转座元件(TE)跳入piRNA簇时,其增殖就会停止。尽管这一点至关重要,但对于piRNA簇的组成和进化动态却知之甚少。这主要是因为piRNA簇 notoriously很难组装,因为它们通常由高度重复的DNA组成。有了长读长,我们最终或许能够获得piRNA簇的可靠组装。不幸的是,目前尚不清楚如何生成和识别最佳组装,因为存在多种组装策略,而且标准质量指标对TE并不了解。为了解决这些问题,我们引入了几个新的质量指标来评估:(a)完全组装的piRNA簇的比例,(b)组装簇的质量,以及(c)一个组装是否捕捉到了生物体的整体TE景观(即所有TE家族的丰度、单核苷酸多态性数量和内部缺失)。计算这些指标的要求各不相同,从piRNA簇的注释到TE的共有序列和基因组测序数据。使用这些新指标,我们评估了组装算法、polishing、读长、覆盖度、残留多态性的影响,最终确定了能产生可靠的piRNA簇组装的策略。基于一种优化方法,我们提供了两种黑腹果蝇品系Canton-S和Pi2的组装。在这两个品系中,约80%的已知piRNA簇被组装出来。最后,我们通过将指标扩展到人类和拟南芥,证明了我们方法的通用性。