Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, 5020, Norway.
VIB-UGent Center for Medical Biotechnology, B-9000, Ghent, Belgium.
BMC Biol. 2017 Aug 30;15(1):76. doi: 10.1186/s12915-017-0416-0.
While methods for annotation of genes are increasingly reliable, the exact identification of translation initiation sites remains a challenging problem. Since the N-termini of proteins often contain regulatory and targeting information, developing a robust method for start site identification is crucial. Ribosome profiling reads show distinct patterns of read length distributions around translation initiation sites. These patterns are typically lost in standard ribosome profiling analysis pipelines, when reads from footprints are adjusted to determine the specific codon being translated.
Utilising these signatures in combination with nucleotide sequence information, we build a model capable of predicting translation initiation sites and demonstrate its high accuracy using N-terminal proteomics. Applying this to prokaryotic translatomes, we re-annotate translation initiation sites and provide evidence of N-terminal truncations and extensions of previously annotated coding sequences. These re-annotations are supported by the presence of structural and sequence-based features next to N-terminal peptide evidence. Finally, our model identifies 61 novel genes previously undiscovered in the Salmonella enterica genome.
Signatures within ribosome profiling read length distributions can be used in combination with nucleotide sequence information to provide accurate genome-wide identification of translation initiation sites.
虽然基因注释的方法越来越可靠,但准确识别翻译起始位点仍然是一个具有挑战性的问题。由于蛋白质的 N 端通常包含调节和靶向信息,因此开发一种稳健的起始位点识别方法至关重要。核糖体图谱读取显示了在翻译起始位点周围的读取长度分布的独特模式。当从足迹中读取的内容被调整以确定正在翻译的特定密码子时,这些模式通常会在标准核糖体图谱分析管道中丢失。
我们利用这些特征与核苷酸序列信息相结合,构建了一个能够预测翻译起始位点的模型,并通过 N 端蛋白质组学证明了其高精度。将其应用于原核翻译组学,我们重新注释了翻译起始位点,并提供了先前注释的编码序列的 N 端截断和延伸的证据。这些重新注释得到了 N 端肽证据旁边的结构和基于序列的特征的支持。最后,我们的模型确定了沙门氏菌 enterica 基因组中以前未发现的 61 个新基因。
核糖体图谱读取长度分布中的特征可与核苷酸序列信息结合使用,以提供翻译起始位点的全基因组准确识别。