Genovese Loredana M, Geraci Filippo, Corrado Lucia, Mangano Eleonora, D'Aurizio Romina, Bordoni Roberta, Severgnini Marco, Manzini Giovanni, De Bellis Gianluca, D'Alfonso Sandra, Pellegrini Marco
Institute for Informatics and Telematics of CNR, Pisa, Italy.
Department of Health Sciences, University of Eastern Piedmont Amedeo Avogadro, Novara, Italy.
Front Genet. 2018 May 2;9:155. doi: 10.3389/fgene.2018.00155. eCollection 2018.
Polymorphic Tandem Repeat (PTR) is a common form of polymorphism in the human genome. A PTR consists in a variation found in an individual (or in a population) of the number of repeating units of a Tandem Repeat (TR) locus of the genome with respect to the reference genome. Several phenotypic traits and diseases have been discovered to be strongly associated with or caused by specific PTR loci. PTR are further distinguished in two main classes: Short Tandem Repeats (STR) when the repeating unit has size up to 6 base pairs, and Variable Number Tandem Repeats (VNTR) for repeating units of size above 6 base pairs. As larger and larger populations are screened via high throughput sequencing projects, it becomes technically feasible and desirable to explore the association between PTR and a panoply of such traits and conditions. In order to facilitate these studies, we have devised a method for compiling catalogs of PTR from assembled genomes, and we have produced a catalog of PTR for genic regions (exons, introns, UTR and adjacent regions) of the human genome (GRCh38). We applied four different TR discovery software tools to uncover in the first phase 55,223,485 TR (after duplicate removal) in GRCh38, of which 373,173 were determined to be PTR in the second phase by comparison with five assembled human genomes. Of these, 263,266 are not included by state-of-the-art PTR catalogs. The new methodology is mainly based on a hierarchical and systematic application of alignment-based sequence comparisons to identify and measure the polymorphism of TR. While previous catalogs focus on the class of STR of small total size, we remove any size restrictions, aiming at the more general class of PTR, and we also target fuzzy TR by using specific detection tools. Similarly to other previous catalogs of human polymorphic loci, we focus our catalog toward applications in the discovery of disease-associated loci. Validation by cross-referencing with existing catalogs on common clinically-relevant loci shows good concordance. Overall, this proposed census of human PTR in genic regions is a shared resource (web accessible), complementary to existing catalogs, facilitating future genome-wide studies involving PTR.
多态串联重复序列(PTR)是人类基因组中常见的多态形式。PTR表现为个体(或群体)基因组中串联重复序列(TR)位点的重复单元数量相对于参考基因组的变化。已发现多种表型性状和疾病与特定的PTR位点密切相关或由其引起。PTR可进一步分为两大类:重复单元大小达6个碱基对的短串联重复序列(STR),以及重复单元大小超过6个碱基对的可变数目串联重复序列(VNTR)。随着越来越多的人群通过高通量测序项目进行筛查,探索PTR与一系列此类性状和疾病之间的关联在技术上变得可行且很有必要。为了推动这些研究,我们设计了一种从组装基因组中编制PTR目录的方法,并生成了人类基因组(GRCh38)基因区域(外显子、内含子、UTR及相邻区域)的PTR目录。我们应用了四种不同的TR发现软件工具,在第一阶段于GRCh38中发现了55,223,485个TR(去除重复后),其中在第二阶段通过与五个组装好的人类基因组进行比较,确定有373,173个为PTR。其中,263,266个未被现有最先进的PTR目录收录。新方法主要基于基于比对的序列比较的分层系统应用,以识别和测量TR的多态性。虽然之前的目录侧重于总大小较小的STR类别,但我们取消了任何大小限制,目标是更一般的PTR类别,并且我们还通过使用特定检测工具针对模糊TR。与之前其他人类多态位点目录类似,我们的目录侧重于疾病相关位点发现方面的应用。通过与常见临床相关位点的现有目录交叉引用进行验证,结果显示出良好的一致性。总体而言,此次提出的人类基因区域PTR普查是一种共享资源(可通过网络访问),是对现有目录的补充,有助于未来涉及PTR的全基因组研究。