Zha Yuguo, Chong Hui, Ning Kang
Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-Imaging, Department of Bioinformatics and Systems Biology, Center of AI Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China.
Front Microbiol. 2021 Apr 7;12:642439. doi: 10.3389/fmicb.2021.642439. eCollection 2021.
A huge quantity of microbiome samples have been accumulated, and more are yet to come from all niches around the globe. With the accumulation of data, there is an urgent need for comparisons and searches of microbiome samples among thousands of millions of samples in a fast and accurate manner. However, it is a very difficult computational challenge to identify similar samples, as well as identify their likely origins, among such a grand pool of samples from all around the world. Currently, several approaches have already been proposed for such a challenge, based on either distance calculation, unsupervised algorithms, or supervised algorithms. These methods have advantages and disadvantages for the different settings of comparisons and searches, and their results are also drastically different. In this review, we systematically compared distance-based, unsupervised, and supervised methods for microbiome sample comparison and search. Firstly, we assessed their accuracy and efficiency, both in theory and in practice. Then we described the scenarios in which one or multiple methods were applicable for sample searches. Thirdly, we provided several applications for microbiome sample comparisons and searches, and provided suggestions on the choice of methods. Finally, we provided several perspectives for the future development of microbiome sample comparison and search, including deep learning technologies for tracking the sources of microbiome samples.
大量的微生物组样本已经积累起来,而且全球各个生态位还有更多样本即将出现。随着数据的积累,迫切需要以快速准确的方式在数以亿计的样本中对微生物组样本进行比较和搜索。然而,在来自世界各地的如此庞大的样本库中识别相似样本及其可能的来源是一项非常困难的计算挑战。目前,已经针对这一挑战提出了几种方法,这些方法基于距离计算、无监督算法或监督算法。这些方法在不同的比较和搜索设置中有各自的优缺点,其结果也有很大差异。在本综述中,我们系统地比较了基于距离的、无监督的和监督的微生物组样本比较和搜索方法。首先,我们在理论和实践上评估了它们的准确性和效率。然后我们描述了一种或多种方法适用于样本搜索的场景。第三,我们提供了微生物组样本比较和搜索的几个应用,并就方法的选择提供了建议。最后,我们为微生物组样本比较和搜索的未来发展提供了几个观点,包括用于追踪微生物组样本来源的深度学习技术。