Zhou Tong, Zhao Feng, Xu Kuidong
Laboratory of Marine Organism Taxonomy and Phylogeny, Qingdao Key Laboratory of Marine Biodiversity and Conservation, Institute of Oceanology, Chinese Academy of Sciences, Qingdao 266071, China.
Shandong Province Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao 266071, China.
Microorganisms. 2023 Apr 6;11(4):949. doi: 10.3390/microorganisms11040949.
The integration and reanalysis of big data provide valuable insights into microbiome studies. However, the significant difference in information scale between amplicon data poses a key challenge in data analysis. Therefore, reducing batch effects is crucial to enhance data integration for large-scale molecular ecology data. To achieve this, the information scale correction (ISC) step, involving cutting different length amplicons into the same sub-region, is essential. In this study, we used the Hidden Markov model (HMM) method to extract 11 different 18S rRNA gene v4 region amplicon datasets with 578 samples in total. The length of the amplicons ranged from 344 bp to 720 bp, depending on the primer position. By comparing the information scale correction of amplicons with varying lengths, we explored the extent to which the comparability between samples decreases with increasing amplicon length. Our method was shown to be more sensitive than V-Xtractor, the most popular tool for performing ISC. We found that near-scale amplicons exhibited no significant change after ISC, while larger-scale amplicons exhibited significant changes. After the ISC treatment, the similarity among the data sets improved, especially for long amplicons. Therefore, we recommend adding ISC processing when integrating big data, which is crucial for unlocking the full potential of microbial community studies and advancing our knowledge of microbial ecology.
大数据的整合与重新分析为微生物组研究提供了有价值的见解。然而,扩增子数据之间信息规模的显著差异给数据分析带来了关键挑战。因此,减少批次效应对于增强大规模分子生态学数据的整合至关重要。要实现这一点,信息规模校正(ISC)步骤,即将不同长度的扩增子切割成相同的子区域,是必不可少的。在本研究中,我们使用隐马尔可夫模型(HMM)方法提取了总共578个样本的11个不同的18S rRNA基因v4区域扩增子数据集。扩增子的长度根据引物位置从344 bp到720 bp不等。通过比较不同长度扩增子的信息规模校正,我们探究了样本之间的可比性随扩增子长度增加而降低的程度。结果表明,我们的方法比执行ISC最常用的工具V-Xtractor更敏感。我们发现,接近规模的扩增子在ISC后没有显著变化,而规模较大的扩增子则有显著变化。经过ISC处理后,数据集之间的相似性提高了,尤其是对于长扩增子。因此,我们建议在整合大数据时添加ISC处理,这对于充分发挥微生物群落研究的潜力和推进我们对微生物生态学的认识至关重要。