Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy.
Bioinformatics. 2021 May 1;37(4):464-472. doi: 10.1093/bioinformatics/btaa779.
Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study.
We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results.
The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark.
Supplementary data are available at Bioinformatics online.
高通量 RNA-Seq 技术的最新进展使得能够产生大量数据集。当一项研究仅关注少数几个基因时,大多数reads 是不相关的,并且会降低用于分析数据的工具的性能。从输入数据集中删除不相关的reads 可以提高效率,而不会影响研究的结果。
我们引入了一个新的计算问题,称为基因分配,并提出了一种有效的无对齐方法来解决它。给定一个 RNA-Seq 样本和一组基因,基因分配包括从样本中提取最有可能从这些基因中测序得到的reads。当样本显示出新型可变剪接事件的证据时,问题变得更加复杂。我们在一个名为 Shark 的工具中实现了我们的方法,并评估了它在加速差异剪接分析管道方面的有效性。该评估表明,Shark 能够显著提高 RNA-Seq 分析工具的性能,而不会对最终结果产生任何影响。
该工具作为独立模块分发,软件可在 https://github.com/AlgoLab/shark 上免费获得。
补充数据可在 Bioinformatics 在线获得。