Rozadilla Gaston, Clemente Jorgelina Moreiras, McCarthy Christina B
Centro Regional de Estudios Genómicos, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, La Plata, Argentina.
Departamento de Informática y Tecnología, Universidad Nacional del Noroeste de la Provincia de Buenos Aires, Pergamino, Buenos Aires, Argentina.
Bio Protoc. 2020 Jul 20;10(14):e3679. doi: 10.21769/BioProtoc.3679.
Data generated by metagenomic and metatranscriptomic experiments is both enormous and inherently noisy. When using taxonomy-dependent alignment-based methods to classify and label reads, the first step consists in performing homology searches against sequence databases. To obtain the most information from the samples, nucleotide sequences are usually compared to various databases (nucleotide and protein) using local sequence aligners such as BLASTN and BLASTX. Nevertheless, the analysis and integration of these results can be problematic because the outputs from these searches usually show inconsistencies, which can be notorious when working with RNA-seq. Moreover, and to the best of our knowledge, existing tools do not criss-cross and integrate information from the different homology searches, but provide the results of each analysis separately. We developed the HoSeIn workflow to intersect the information from these homology searches, and then determine the taxonomic and functional profile of the sample using this integrated information. The workflow is based on the assumption that the sequences that correspond to a certain taxon are composed of: sequences that were assigned to the same taxon by both homology searches; sequences that were assigned to that taxon by one of the homology searches but returned no hits in the other one.
宏基因组学和宏转录组学实验产生的数据量巨大且本质上存在噪声。当使用基于分类学的比对方法对 reads 进行分类和标记时,第一步是针对序列数据库进行同源性搜索。为了从样本中获取最多信息,核苷酸序列通常使用诸如 BLASTN 和 BLASTX 等局部序列比对工具与各种数据库(核苷酸和蛋白质)进行比较。然而,这些结果的分析和整合可能存在问题,因为这些搜索的输出通常显示不一致,在处理 RNA-seq 时这可能很明显。此外,据我们所知,现有工具不会交叉和整合来自不同同源性搜索的信息,而是分别提供每个分析的结果。我们开发了 HoSeIn 工作流程来交叉这些同源性搜索的信息,然后使用这些整合信息确定样本的分类学和功能概况。该工作流程基于这样的假设,即对应于某个分类单元的序列由以下部分组成:在两个同源性搜索中都被分配到同一分类单元的序列;在其中一个同源性搜索中被分配到该分类单元但在另一个搜索中未命中的序列。