技术和生物学变异对RNA-Seq数据集从头组装的影响的综合分析

Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets.

作者信息

Sergio Alberto Gonzalez, Maximo Rivarola, Andres Ribone, Sergio Lew, Norma Paniego

机构信息

Instituto de Agrobiotecnología y Biología Molecular (IABIMO), CICVyA, Instituto Nacional de Tecnología Agropecuaria (INTA), Buenos Aires, Argentina.

Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina.

出版信息

Bioinform Biol Insights. 2024 Dec 5;18:11779322241274957. doi: 10.1177/11779322241274957. eCollection 2024.

DOI:10.1177/11779322241274957

PMID:39649541

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11622296/

Abstract

De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.

摘要

对于没有参考基因组的物种，从头组装转录组仍然是功能基因组学中的一个常见问题。虽然转录组组装的方法和算法不断得到开发和发表，但使用短读长进行从头组装的质量取决于转录组的复杂性，并受到几种类型错误的限制。需要克服的一个问题是，在每项研究中使用何种最佳方法来获得高质量的从头组装，这方面存在研究差距。目前，尚无考虑转录组复杂性来解决组装问题的既定方案。此外，用于评估组装的质量指标的准确性仍不明确。在本研究中，我们调查并讨论了不同的、反映RNA测序数据复杂性的变量如何独立于所使用的软件而影响组装结果。为此，我们从具有不同复杂程度的高质量全长预测转录本模型中模拟了转录组短读长序列数据集。随后，我们使用不同的组装程序进行从头组装，并使用依赖参考和独立于参考的指标对结果进行比较和分类。这些指标既单独评估，也通过多变量分析进行综合评估。可变剪接程度和双端读长的片段大小被确定为对组装结果影响最大的变量。此外，读长和片段大小对长转录本和短转录本的重建有不同影响。这些结果强调了了解所研究转录组组成的重要性，以及根据处理不同大小读长和片段的需求做出实验设计决策的重要性。此外，组装软件的选择将对最终的组装结果产生积极影响。这种选择将影响所代表基因和组装异构体的完整性，并有助于减少错误。