Şener Duygu Dede, Santoni Daniele, Felici Giovanni, Oğul Hasan
Başkent University, Faculty of Engineering, Computer Engineering Department, Ankara, Turkey.
Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Rome, Italy.
J Integr Bioinform. 2018 Oct 26;15(4):20170067. doi: 10.1515/jib-2017-0067.
Finding similarities and differences between metagenomic samples within large repositories has been rather a significant issue for researchers. Over the recent years, content-based retrieval has been suggested by various studies from different perspectives. In this study, a content-based retrieval framework for identifying relevant metagenomic samples is developed. The framework consists of feature extraction, selection methods and similarity measures for whole metagenome sequencing samples. Performance of the developed framework was evaluated on given samples. A ground truth was used to evaluate the system performance such that if the system retrieves patients with the same disease, -called positive samples-, they are labeled as relevant samples otherwise irrelevant. The experimental results show that relevant experiments can be detected by using different fingerprinting approaches. We observed that Latent Semantic Analysis (LSA) Method is a promising fingerprinting approach for representing metagenomic samples and finding relevance among them. Source codes and executable files are available at www.baskent.edu.tr/∼hogul/WMS_retrieval.rar.
在大型数据库中寻找宏基因组样本之间的异同,对研究人员来说一直是个重大问题。近年来,不同的研究从不同角度提出了基于内容的检索方法。在本研究中,开发了一个用于识别相关宏基因组样本的基于内容的检索框架。该框架由全宏基因组测序样本的特征提取、选择方法和相似性度量组成。在所给样本上评估了所开发框架的性能。使用一个基本事实来评估系统性能,即如果系统检索到患有相同疾病的患者(即所谓的阳性样本),则将它们标记为相关样本,否则为不相关样本。实验结果表明,使用不同的指纹识别方法可以检测到相关实验。我们观察到,潜在语义分析(LSA)方法是一种很有前景的数据指纹识别方法,可用于表示宏基因组样本并找出它们之间的相关性。源代码和可执行文件可在www.baskent.edu.tr/∼hogul/WMS_retrieval.rar获取。