Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i222-i231. doi: 10.1093/bioinformatics/btad264.
With advances in long-read transcriptome sequencing, we can now fully sequence transcripts, which greatly improves our ability to study transcription processes. A popular long-read transcriptome sequencing technique is Oxford Nanopore Technologies (ONT), which through its cost-effective sequencing and high throughput, has the potential to characterize the transcriptome in a cell. However, due to transcript variability and sequencing errors, long cDNA reads need substantial bioinformatic processing to produce a set of isoform predictions from the reads. Several genome and annotation-based methods exist to produce transcript predictions. However, such methods require high-quality genomes and annotations and are limited by the accuracy of long-read splice aligners. In addition, gene families with high heterogeneity may not be well represented by a reference genome and would benefit from reference-free analysis. Reference-free methods to predict transcripts from ONT, such as RATTLE, exist, but their sensitivity is not comparable to reference-based approaches.
We present isONform, a high-sensitivity algorithm to construct isoforms from ONT cDNA sequencing data. The algorithm is based on iterative bubble popping on gene graphs built from fuzzy seeds from the reads. Using simulated, synthetic, and biological ONT cDNA data, we show that isONform has substantially higher sensitivity than RATTLE albeit with some loss in precision. On biological data, we show that isONform's predictions have substantially higher consistency with the annotation-based method StringTie2 compared with RATTLE. We believe isONform can be used both for isoform construction for organisms without well-annotated genomes and as an orthogonal method to verify predictions of reference-based methods.
随着长读转录组测序技术的进步,我们现在可以完整地测序转录本,这极大地提高了我们研究转录过程的能力。一种流行的长读转录组测序技术是牛津纳米孔技术(ONT),它通过具有成本效益的测序和高通量,具有在细胞中表征转录组的潜力。然而,由于转录本的可变性和测序错误,长 cDNA 读取需要大量的生物信息学处理,才能从读取中产生一组异构体预测。有几种基于基因组和注释的方法可以产生转录本预测。然而,这种方法需要高质量的基因组和注释,并且受到长读拼接对齐器准确性的限制。此外,具有高度异质性的基因家族可能无法被参考基因组很好地表示,并且将受益于无参考分析。存在用于从 ONT 预测转录本的无参考方法,例如 RATTLE,但它们的灵敏度无法与基于参考的方法相比。
我们提出了 isONform,这是一种从 ONT cDNA 测序数据构建异构体的高灵敏度算法。该算法基于从读取的模糊种子构建的基因图上的迭代气泡弹出。使用模拟、合成和生物 ONT cDNA 数据,我们表明,尽管精度略有损失,但与 RATTLE 相比,isONform 的灵敏度大大提高。在生物数据上,我们表明与 RATTLE 相比,isONform 的预测与基于注释的方法 StringTie2 具有更高的一致性。我们相信 isONform 既可以用于构建没有良好注释基因组的生物体的异构体,也可以作为验证基于参考的方法预测的正交方法。