Suppr超能文献

LANDMark:一种基于集成方法的高通量测序数据中生物标志物的有监督选择。

LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data.

机构信息

Department of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada.

Department of Biology, McMaster University, 1280 Main St. West, Hamilton, ON, L8S 4K1, Canada.

出版信息

BMC Bioinformatics. 2022 Mar 31;23(1):110. doi: 10.1186/s12859-022-04631-z.

Abstract

BACKGROUND

Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery.

RESULTS

We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada's Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark's generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision boundaries.

CONCLUSIONS

Our work introduces LANDMark, a meta-classifier which blends the characteristics of several machine learning models into a decision tree and ensemble learning framework. To our knowledge, this is the first study to apply this type of ensemble approach to amplicon sequencing data and we have shown that analyzing these datasets using LANDMark can produce highly predictive and consistent models.

摘要

背景

生物数据集的可测量特征——生物标志物的识别具有挑战性。虽然扩增子序列变体(ASV)可以被视为潜在的生物标志物,但是在高通量测序数据集中识别重要的 ASV 具有挑战性。噪声、算法未能考虑特定的分布特性以及特征交互会使 ASV 生物标志物的发现变得复杂。此外,这些问题会影响各种模型的可重复性并提高假阳性率。当代机器学习方法可以用来解决这些问题。决策树的集合在分类高通量测序(HTS)研究中通常生成的数据类型方面特别有效,因为它们在训练数据中的特征数量是样本数量的数量级大时具有稳健性。此外,当与适当的模型内省算法结合使用时,机器学习算法也可用于发现和选择潜在的生物标志物。然而,这些模型的构建可能会引入各种偏差,从而模糊特征的发现。

结果

我们开发了一个决策树集合 LANDMark,它在每个节点使用斜交和非线性切割。在合成和玩具测试中,LANDMark 始终被评为最佳分类器,并且经常优于随机森林分类器。在对从加拿大伍德布法罗国家公园获得的完整代谢组学数据集进行训练时,LANDMark 能够创建高度预测模型,总体平衡准确性得分为 0.96±0.06。递归特征消除的使用不会影响 LANDMark 的泛化性能,并且在对 BE 扩增子数据进行训练时,它能够优于线性支持向量机、逻辑回归模型和随机梯度下降模型(p≤0.05)。最后,LANDMark 因其能够学习更平滑的非线性决策边界而与众不同。

结论

我们的工作引入了 LANDMark,这是一种元分类器,它将几种机器学习模型的特征融合到决策树和集成学习框架中。据我们所知,这是首次将这种集成方法应用于扩增子测序数据的研究,我们已经表明,使用 LANDMark 分析这些数据集可以生成高度预测和一致的模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a95/8969335/3e778cff04fc/12859_2022_4631_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验