Goodall Emily C A, Hodges Freya, Kok Weine, Permana Budi, Cuddihy Thom, Yang Zihao, Kahler Nicole, Shires Kenneth, Pullela Karthik, Torres Von Vergel L, Rooke Jessica L, Delhaye Antoine, Collet Jean-François, Bryant Jack A, Forde Brian M, Hemm Matthew R, Henderson Ian R
Institute for Molecular Bioscience, University of Queensland, Brisbane 4072, Australia.
Environment and Sustainability Institute & Centre for Ecology and Conservation, University of Exeter, Penryn, TR10 9FE, United Kingdom.
Nucleic Acids Res. 2025 Aug 11;53(15). doi: 10.1093/nar/gkaf774.
The advent of high-density mutagenesis and data-mining studies suggest the existence of further coding potential within bacterial genomes. Small or overlapping genes are prevalent across all domains of life but are often overlooked for annotation and function because of challenges in their detection. To overcome limitations in existing protein detection methods, we applied a genetics-based approach. We combined transposon insertion sequencing using a dual-selection transposon with a translation reporter to identify translated open reading frames throughout the genome at scale but independent of genome annotation. We applied our method to the well-characterised species Escherichia coli. This method revealed over 200 putative novel protein coding sequences (CDS). These are mostly short CDSs (<50 amino acids) and include proteins that are highly conserved and neighbour functionally important genes. Using chromosomal tags, we validated the expression of selected CDSs. We present this method (Protein Identification through Reporter Transposon-Sequencing: PIRT-Seq) as a complementary method to whole cell proteomics and ribosome trapping for condition-dependent identification of protein CDSs, and as a high-throughput method for testing conditional gene expression. We anticipate this technique will be a starting point for future high-throughput genetics investigations to determine the existence of unannotated genes in multiple bacterial species.
高密度诱变和数据挖掘研究的出现表明细菌基因组中存在进一步的编码潜力。小基因或重叠基因在生命的所有领域都很普遍,但由于其检测存在挑战,在注释和功能方面常常被忽视。为了克服现有蛋白质检测方法的局限性,我们应用了一种基于遗传学的方法。我们将使用双选转座子的转座子插入测序与翻译报告基因相结合,以大规模地识别全基因组中已翻译的开放阅读框,且独立于基因组注释。我们将我们的方法应用于特征明确的大肠杆菌物种。该方法揭示了200多个推定的新型蛋白质编码序列(CDS)。这些大多是短CDS(<50个氨基酸),包括高度保守且邻近功能重要基因的蛋白质。使用染色体标签,我们验证了所选CDS的表达。我们将这种方法(通过报告转座子测序进行蛋白质鉴定:PIRT-Seq)作为全细胞蛋白质组学和核糖体捕获的补充方法,用于条件依赖性蛋白质CDS的鉴定,并作为测试条件性基因表达的高通量方法。我们预计这项技术将成为未来高通量遗传学研究的起点,以确定多种细菌物种中未注释基因的存在。