Mouratidis Ioannis, Konnaris Maxwell A, Chantzi Nikol, Chan Candace S Y, Patsakis Michail, Provatas Kimonas, Montgomery Austin, Baltoumas Fotis A, Sha Congzhou M, Mareboina Manvita, Pavlopoulos Georgios A, Chartoumpekis Dionysios V, Georgakopoulos-Soares Ilias
Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA.
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
Genome Res. 2025 Feb 14;35(2):279-295. doi: 10.1101/gr.280070.124.
Despite the exponential increase in sequencing information driven by massively parallel DNA sequencing technologies, universal and succinct genomic fingerprints for each organism are still missing. Identifying the shortest species-specific nucleotide sequences offers insights into species evolution and holds potential practical applications in agriculture, wildlife conservation, and healthcare. We propose a new method for sequence analysis termed nucleic "quasi-primes," the shortest occurring sequences in each of 45,076 organismal reference genomes, present in one genome and absent from every other examined genome. In the human genome, we find that the genomic loci of nucleic quasi-primes are most enriched for genes associated with brain development and cognitive function. In a single-cell case study focusing on the human primary motor cortex, nucleic quasi-prime genes account for a significantly larger proportion of the variation based on average gene expression. Nonneuronal cell types, including astrocytes, endothelial cells, microglia perivascular-macrophages, oligodendrocytes, and vascular and leptomeningeal cells, exhibit significant activation of quasi-prime-containing gene associations related to cancer, whereas simultaneously suppressing quasi-prime-containing genes are associated with cognitive, mental, and developmental disorders. We also show that human disease-causing variants, eQTLs, mQTLs, and sQTLs are 4.43-fold, 4.34-fold, 4.29-fold, and 4.21-fold enriched at human quasi-prime loci, respectively. These findings indicate that nucleic quasi-primes are genomic loci linked to the evolution of species-specific traits, and in humans, they provide insights in the development of cognitive traits and human diseases, including neurodevelopmental disorders.
尽管大规模平行DNA测序技术推动测序信息呈指数级增长,但仍缺乏针对每个生物体的通用且简洁的基因组指纹。识别最短的物种特异性核苷酸序列有助于深入了解物种进化,并在农业、野生动物保护和医疗保健领域具有潜在的实际应用价值。我们提出了一种新的序列分析方法,称为核酸“准质数”,它是45076个生物体参考基因组中每个基因组中出现的最短序列,存在于一个基因组中,而在其他所有检测的基因组中均不存在。在人类基因组中,我们发现核酸准质数的基因组位点在与大脑发育和认知功能相关的基因中最为富集。在一项针对人类初级运动皮层的单细胞案例研究中,基于平均基因表达,核酸准质数基因在变异中所占比例显著更大。非神经元细胞类型,包括星形胶质细胞、内皮细胞、小胶质细胞、血管周围巨噬细胞、少突胶质细胞以及血管和软脑膜细胞,表现出与癌症相关的含准质数基因关联的显著激活,而同时抑制含准质数的基因则与认知、精神和发育障碍相关。我们还表明,人类致病变异、eQTL、mQTL和sQTL在人类准质数位点的富集倍数分别为4.43倍、4.34倍、4.29倍和4.21倍。这些发现表明,核酸准质数是与物种特异性性状进化相关的基因组位点,在人类中,它们为认知性状和人类疾病(包括神经发育障碍)的发展提供了见解。