Hoeppner Marc P, Lundquist Andrew, Pirun Mono, Meadows Jennifer R S, Zamani Neda, Johnson Jeremy, Sundström Görel, Cook April, FitzGerald Michael G, Swofford Ross, Mauceli Evan, Moghadam Behrooz Torabi, Greka Anna, Alföldi Jessica, Abouelleil Amr, Aftuck Lynne, Bessette Daniel, Berlin Aaron, Brown Adam, Gearin Gary, Lui Annie, Macdonald J Pendexter, Priest Margaret, Shea Terrance, Turner-Maier Jason, Zimmer Andrew, Lander Eric S, di Palma Federica, Lindblad-Toh Kerstin, Grabherr Manfred G
Science for Life Laboratories, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.
Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America; Division of Nephrology, Massachusetts General Hospital and Harvard Medical School, Charlestown, Massachusetts, United States of America.
PLoS One. 2014 Mar 13;9(3):e91172. doi: 10.1371/journal.pone.0091172. eCollection 2014.
The domestic dog, Canis familiaris, is a well-established model system for mapping trait and disease loci. While the original draft sequence was of good quality, gaps were abundant particularly in promoter regions of the genome, negatively impacting the annotation and study of candidate genes. Here, we present an improved genome build, canFam3.1, which includes 85 MB of novel sequence and now covers 99.8% of the euchromatic portion of the genome. We also present multiple RNA-Sequencing data sets from 10 different canine tissues to catalog ∼175,000 expressed loci. While about 90% of the coding genes previously annotated by EnsEMBL have measurable expression in at least one sample, the number of transcript isoforms detected by our data expands the EnsEMBL annotations by a factor of four. Syntenic comparison with the human genome revealed an additional ∼3,000 loci that are characterized as protein coding in human and were also expressed in the dog, suggesting that those were previously not annotated in the EnsEMBL canine gene set. In addition to ∼20,700 high-confidence protein coding loci, we found ∼4,600 antisense transcripts overlapping exons of protein coding genes, ∼7,200 intergenic multi-exon transcripts without coding potential, likely candidates for long intergenic non-coding RNAs (lincRNAs) and ∼11,000 transcripts were reported by two different library construction methods but did not fit any of the above categories. Of the lincRNAs, about 6,000 have no annotated orthologs in human or mouse. Functional analysis of two novel transcripts with shRNA in a mouse kidney cell line altered cell morphology and motility. All in all, we provide a much-improved annotation of the canine genome and suggest regulatory functions for several of the novel non-coding transcripts.
家犬(Canis familiaris)是用于绘制性状和疾病基因座的成熟模型系统。虽然最初的草图序列质量不错,但缺口大量存在,尤其是在基因组的启动子区域,对候选基因的注释和研究产生了负面影响。在此,我们展示了一个改进的基因组版本canFam3.1,它包含85兆字节的新序列,现在覆盖了基因组常染色质部分的99.8%。我们还展示了来自10种不同犬类组织的多个RNA测序数据集,以编目约175,000个表达基因座。虽然之前由EnsEMBL注释的编码基因中约90%在至少一个样本中有可测量的表达,但我们的数据检测到的转录本异构体数量将EnsEMBL注释扩展了四倍。与人类基因组的共线性比较揭示了另外约3000个在人类中被表征为蛋白质编码且在犬类中也表达的基因座,这表明这些基因座之前未在EnsEMBL犬类基因集中被注释。除了约20,700个高可信度蛋白质编码基因座外,我们还发现了约4600个与蛋白质编码基因外显子重叠的反义转录本、约7200个没有编码潜力的基因间多外显子转录本,它们可能是长链基因间非编码RNA(lincRNA)的候选者,并且约11,000个转录本是通过两种不同的文库构建方法报告的,但不符合上述任何类别。在这些lincRNA中,约6000个在人类或小鼠中没有注释的直系同源物。在小鼠肾细胞系中对两个带有短发夹RNA的新转录本进行功能分析,改变了细胞形态和运动性。总而言之,我们对犬类基因组进行了大幅改进的注释,并为一些新的非编码转录本提出了调控功能。