Graduate Program in Technology Policy, Yonsei University, 50 Yonsei Ro, Seodaemun Gu, Seoul, 038722, South Korea.
School of Civil and Environmental Engineering, Yonsei University, 50 Yonsei Ro, Seodaemun Gu, Seoul, 038722, South Korea.
BMC Bioinformatics. 2018 Nov 3;19(1):399. doi: 10.1186/s12859-018-2431-8.
Since the analysis of a large number of metagenomic sequences costs heavy computing resources and takes long time, we examined a selected small part of metagenomic sequences as "sample"s of the entire full sequences, both for a mock community and for 10 different existing metagenomics case studies. A mock community with 10 bacterial strains was prepared, and their mixed genome were sequenced by Hiseq. The hits of BLAST search for reference genome of each strain were counted. Each of 176 different small parts selected from these sequences were also searched by BLAST and their hits were also counted, in order to compare them to the original search results from the full sequences. We also prepared small parts of sequences which were selected from 10 publicly downloadable research data of MG-RAST service, and analyzed these samples with MG-RAST.
Both the BLAST search tests of the mock community and the results from the publicly downloadable researches of MG-RAST show that sampling an extremely small part from sequence data is useful to estimate brief taxonomic information of the original metagenomic sequences. For 9 cases out of 10, the most annotated classes from the MG-RAST analyses of the selected partial sample sequences are the same as the ones from the originals.
When a researcher wants to estimate brief information of a metagenome's taxonomic distribution with less computing resources and within shorter time, the researcher can analyze a selected small part of metagenomic sequences. With this approach, we can also build a strategy to monitor metagenome samples of wider geographic area, more frequently.
由于对大量宏基因组序列进行分析需要大量的计算资源和时间,因此我们选择了宏基因组序列的一小部分作为“样本”,对模拟群落和 10 个不同的现有宏基因组案例研究进行了分析。我们准备了一个包含 10 个细菌菌株的模拟群落,并通过 Hiseq 对它们的混合基因组进行了测序。对每个菌株的参考基因组进行 BLAST 搜索的命中数进行了计数。从这些序列中选择的 176 个不同的小部分也进行了 BLAST 搜索,并对其命中数进行了计数,以便将其与原始全序列搜索结果进行比较。我们还从 MG-RAST 服务的 10 个可公开下载的研究数据中选择了部分序列的小部分,并使用 MG-RAST 对这些样本进行了分析。
模拟群落的 BLAST 搜索测试和 MG-RAST 可公开下载研究的结果均表明,从序列数据中采样极小一部分对于估计原始宏基因组序列的简要分类信息是有用的。在 10 个案例中的 9 个案例中,从所选部分样本序列的 MG-RAST 分析中注释最多的类与原始序列中的类相同。
当研究人员希望使用较少的计算资源和更短的时间来估计宏基因组的分类分布的简要信息时,研究人员可以分析宏基因组序列的一小部分。通过这种方法,我们还可以建立一种策略,更频繁地监测更广泛地理区域的宏基因组样本。