Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia, 199034.
Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, California 92093, USA.
Genome Res. 2022 Jun;32(6):1137-1151. doi: 10.1101/gr.276362.121. Epub 2022 May 11.
Recent advances in long-read sequencing opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. They also emphasized the need for centromere annotation (partitioning human centromeres into monomers and higher-order repeats [HORs]). Although there was a half-century-long series of semi-manual studies of centromere architecture, a rigorous centromere annotation algorithm is still lacking. Moreover, an automated centromere annotation is a prerequisite for studies of genetic diseases associated with centromeres and evolutionary studies of centromeres across multiple species. Although the monomer decomposition (transforming a centromere into a monocentromere written in the monomer alphabet) and the HOR decomposition (representing a monocentromere in the alphabet of HORs) are currently viewed as two separate problems, we show that they should be integrated into a single framework in such a way that HOR (monomer) inference affects monomer (HOR) inference. We thus developed the HORmon algorithm that integrates the monomer/HOR inference and automatically generates the human monomers/HORs that are largely consistent with the previous semi-manual inference.
近年来,长读测序技术的进展为解决人类着丝粒结构和进化的长期存在的问题提供了可能。它们还强调了需要对着丝粒进行注释(将人类着丝粒划分为单体和更高阶重复[HOR])。尽管对半自动的着丝粒结构研究已经进行了半个世纪,但仍然缺乏严格的着丝粒注释算法。此外,自动化的着丝粒注释是研究与着丝粒相关的遗传疾病和跨多种物种的着丝粒进化研究的前提。尽管单体分解(将着丝粒转化为单体字母书写的单着丝粒)和 HOR 分解(用 HOR 字母表示单着丝粒)目前被视为两个独立的问题,但我们表明它们应该整合到一个单一的框架中,使得 HOR(单体)推断影响单体(HOR)推断。因此,我们开发了 HORmon 算法,该算法集成了单体/HOR 推断,并自动生成与之前的半自动推断基本一致的人类单体/HOR。