The Heart Institute, Division of Molecular Cardiovascular Biology, Cincinnati Children's Hospital Medical Center.
The Heart Institute, Division of Molecular Cardiovascular Biology, Cincinnati Children's Hospital Medical Center; Department of Pediatrics, University of Cincinnati College of Medicine;
J Vis Exp. 2022 Jul 12(185). doi: 10.3791/63841.
Next-generation sequencing (NGS) has propelled the field of genomics forward and produced whole genome sequences for numerous animal species and model organisms. However, despite this wealth of sequence information, comprehensive gene annotation efforts have proven challenging, especially for small proteins. Notably, conventional protein annotation methods were designed to intentionally exclude putative proteins encoded by short open reading frames (sORFs) less than 300 nucleotides in length to filter out the exponentially higher number of spurious noncoding sORFs throughout the genome. As a result, hundreds of functional small proteins called microproteins (<100 amino acids in length) have been incorrectly classified as noncoding RNAs or overlooked entirely. Here we provide a detailed protocol to leverage free, publicly available bioinformatic tools to query genomic regions for microprotein-coding potential based on evolutionary conservation. Specifically, we provide step-by-step instructions on how to examine sequence conservation and coding potential using Phylogenetic Codon Substitution Frequencies (PhyloCSF) on the user-friendly University of California Santa Cruz (UCSC) Genome Browser. Additionally, we detail steps to efficiently generate multiple species alignments of identified microprotein sequences to visualize amino acid sequence conservation and recommend resources to analyze microprotein characteristics, including predicted domain structures. These powerful tools can be used to help identify putative microprotein-coding sequences in noncanonical genomic regions or to rule out the presence of a conserved coding sequence with translational potential in a noncoding transcript of interest.
下一代测序(NGS)推动了基因组学领域的发展,为许多动物物种和模式生物生成了全基因组序列。然而,尽管有了如此丰富的序列信息,全面的基因注释工作仍然具有挑战性,尤其是对于小蛋白。值得注意的是,传统的蛋白质注释方法旨在有意排除短开放阅读框(sORF)编码的假定蛋白,这些 sORF 的长度小于 300 个核苷酸,以过滤掉基因组中数量呈指数级增加的虚假非编码 sORF。因此,数百种被称为微蛋白的功能小蛋白(长度小于 100 个氨基酸)被错误地归类为非编码 RNA 或完全被忽视。在这里,我们提供了一个详细的方案,利用免费的、公开的生物信息学工具,根据进化保守性查询基因组区域的微蛋白编码潜力。具体来说,我们提供了如何使用用户友好的加州大学圣克鲁兹分校(UCSC)基因组浏览器上的 Phylogenetic Codon Substitution Frequencies(PhyloCSF)检查序列保守性和编码潜力的分步说明。此外,我们详细介绍了如何有效地生成已识别的微蛋白序列的多种物种比对,以可视化氨基酸序列保守性,并推荐用于分析微蛋白特征的资源,包括预测的结构域结构。这些强大的工具可用于帮助识别非规范基因组区域中的假定微蛋白编码序列,或排除具有翻译潜力的保守编码序列在感兴趣的非编码转录本中的存在。