Key Laboratory of Zoological Systematics and Evolution and State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
Genome Res. 2019 Apr;29(4):682-696. doi: 10.1101/gr.238733.118. Epub 2019 Mar 12.
The origination of new genes contributes to phenotypic evolution in humans. Two major challenges in the study of new genes are the inference of gene ages and annotation of their protein-coding potential. To tackle these challenges, we created GenTree, an integrated online database that compiles age inferences from three major methods together with functional genomic data for new genes. Genome-wide comparison of the age inference methods revealed that the synteny-based pipeline (SBP) is most suited for recently duplicated genes, whereas the protein-family-based methods are useful for ancient genes. For SBP-dated primate-specific protein-coding genes (PSGs), we performed manual evaluation based on published PSG lists and showed that SBP generated a conservative data set of PSGs by masking less reliable syntenic regions. After assessing the coding potential based on evolutionary constraint and peptide evidence from proteomic data, we curated a list of 254 PSGs with different levels of protein evidence. This list also includes 41 candidate misannotated pseudogenes that encode primate-specific short proteins. Coexpression analysis showed that PSGs are preferentially recruited into organs with rapidly evolving pathways such as spermatogenesis, immune response, mother-fetus interaction, and brain development. For brain development, primate-specific KRAB zinc-finger proteins (KZNFs) are specifically up-regulated in the mid-fetal stage, which may have contributed to the evolution of this critical stage. Altogether, hundreds of PSGs are either recruited to processes under strong selection pressure or to processes supporting an evolving novel organ.
新基因的起源促进了人类表型的进化。研究新基因的两个主要挑战是推断基因的年龄和注释其蛋白质编码潜力。为了解决这些挑战,我们创建了 GenTree,这是一个集成的在线数据库,它将三种主要方法的年龄推断与新基因的功能基因组数据一起编译。对年龄推断方法的全基因组比较表明,基于同线性的方法(SBP)最适合最近复制的基因,而基于蛋白质家族的方法则适用于古老的基因。对于 SBP 确定的灵长类特异性蛋白质编码基因(PSG),我们根据已发表的 PSG 列表进行了手动评估,并表明 SBP 通过屏蔽不太可靠的同线性区域生成了 PSG 的保守数据集。在基于进化约束和蛋白质组学数据中的肽证据评估编码潜力后,我们整理了一个具有不同蛋白质证据水平的 254 个 PSG 列表。该列表还包括 41 个候选错误注释的假基因,它们编码灵长类特异性的短蛋白。共表达分析表明,PSG 优先招募到快速进化途径的器官中,如精子发生、免疫反应、母婴相互作用和大脑发育。对于大脑发育,灵长类特异性 KRAB 锌指蛋白(KZNFs)在中期胎儿阶段特异性上调,这可能有助于这一关键阶段的进化。总之,数百个 PSG 要么被招募到受强烈选择压力影响的过程中,要么被招募到支持新器官进化的过程中。