Torbati Mahbaneh Eshaghzadeh, Mitreva Makedonka, Gopalakrishnan Vanathi
Department of Computer Science, University of Pittsburgh, 6135 Sennott Square, 210 S Bouquet St, Pittsburgh, PA 15260-9161, USA.
Department of Medicine, Washington University School of Medicine, 660 S Euclid Ave, St. Louis, MO 63110, USA.
Data (Basel). 2016 Dec;1(3). doi: 10.3390/data1030019. Epub 2016 Dec 13.
Human microbiome data from genomic sequencing technologies is fast accumulating, giving us insights into bacterial taxa that contribute to health and disease. The predictive modeling of such microbiota count data for the classification of human infection from parasitic worms, such as helminths, can help in the detection and management across global populations. Real-world datasets of microbiome experiments are typically sparse, containing hundreds of measurements for bacterial species, of which only a few are detected in the bio-specimens that are analyzed. This feature of microbiome data produces the challenge of needing more observations for accurate predictive modeling and has been dealt with previously, using different methods of feature reduction. To our knowledge, integrative methods, such as transfer learning, have not yet been explored in the microbiome domain as a way to deal with data sparsity by incorporating knowledge of different but related datasets. One way of incorporating this knowledge is by using a meaningful mapping among features of these datasets. In this paper, we claim that this mapping would exist among members of each individual cluster, grouped based on phylogenetic dependency among taxa and their association to the phenotype. We validate our claim by showing that models incorporating associations in such a grouped feature space result in no performance deterioration for the given classification task. In this paper, we test our hypothesis by using classification models that detect helminth infection in microbiota of human fecal samples obtained from Indonesia and Liberia countries. In our experiments, we first learn binary classifiers for helminth infection detection by using Naive Bayes, Support Vector Machines, Multilayer Perceptrons, and Random Forest methods. In the next step, we add taxonomic modeling by using the SMART-scan module to group the data, and learn classifiers using the same four methods, to test the validity of the achieved groupings. We observed a 6% to 23% and 7% to 26% performance improvement based on the Area Under the receiver operating characteristic (ROC) Curve (AUC) and Balanced Accuracy (Bacc) measures, respectively, over 10 runs of 10-fold cross-validation. These results show that using phylogenetic dependency for grouping our microbiota data actually results in a noticeable improvement in classification performance for helminth infection detection. These promising results from this feasibility study demonstrate that methods such as SMART-scan can be utilized in the future for knowledge transfer from different but related microbiome datasets by phylogenetically-related functional mapping, to enable novel integrative biomarker discovery.
来自基因组测序技术的人类微生物组数据正在迅速积累,使我们能够深入了解对健康和疾病有影响的细菌分类群。对这类微生物群计数数据进行预测建模,以对来自寄生虫(如蠕虫)的人类感染进行分类,有助于在全球人群中进行检测和管理。微生物组实验的实际数据集通常很稀疏,包含数百种细菌物种的测量数据,其中只有少数在分析的生物样本中被检测到。微生物组数据的这一特征给准确的预测建模带来了需要更多观测数据的挑战,并且之前已经使用不同的特征约简方法来处理这一问题。据我们所知,诸如迁移学习等整合方法尚未在微生物组领域中作为一种通过整合不同但相关数据集的知识来处理数据稀疏性的方式进行探索。整合这种知识的一种方法是在这些数据集的特征之间使用有意义的映射。在本文中,我们声称这种映射将存在于基于分类群之间的系统发育依赖性及其与表型的关联而分组的每个单独聚类的成员之间。我们通过表明在这样一个分组特征空间中纳入关联的模型对于给定的分类任务不会导致性能下降来验证我们的主张。在本文中,我们通过使用分类模型来检验我们的假设,这些模型用于检测从印度尼西亚和利比里亚国家获得的人类粪便样本微生物群中的蠕虫感染。在我们的实验中,我们首先使用朴素贝叶斯、支持向量机、多层感知器和随机森林方法学习用于蠕虫感染检测的二元分类器。在下一步中,我们使用SMART-scan模块添加分类建模以对数据进行分组,并使用相同的四种方法学习分类器,以测试所实现分组的有效性。在10次10折交叉验证的运行中,基于受试者工作特征(ROC)曲线下面积(AUC)和平衡准确率(Bacc)指标,我们分别观察到性能提高了6%至23%和7%至26%。这些结果表明,利用系统发育依赖性对我们的微生物群数据进行分组实际上会使蠕虫感染检测的分类性能有显著提高。这项可行性研究的这些有前景的结果表明,诸如SMART-scan这样的方法未来可用于通过系统发育相关的功能映射从不同但相关的微生物组数据集中进行知识转移,以实现新型整合生物标志物的发现。