IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):250-263. doi: 10.1109/TCBB.2018.2858814. Epub 2018 Jul 23.
Optimal Bayesian feature filtering (OBF) is a fast and memory-efficient algorithm that optimally identifies markers with distributional differences between treatment groups under Gaussian models. Here, we study the performance and robustness of OBF for biomarker discovery. Our contributions are twofold: (1) we examine how OBF performs on data that violates modeling assumptions, and (2) we provide guidelines on how to set input parameters for robust performance. Contribution (1) addresses an important, relevant, and commonplace problem in computational biology, where it is often impossible to validate an algorithm's core assumptions. To accomplish both tasks, we present a battery of simulations that implement OBF with different inputs and challenge each assumption made by OBF. In particular, we examine the robustness of OBF with respect to incorrect input parameters, false independence, imbalanced sample size, and we address the Gaussianity assumption by considering performance on an extensive family of non-Gaussian distributions. We address advantages and disadvantages between different priors and optimization criteria throughout. Finally, we evaluate the utility of OBF in biomarker discovery using acute myeloid leukemia (AML) and colon cancer microarray datasets, and show that OBF is successful at identifying well-known biomarkers for these diseases that rank low under moderated t-test.
最优贝叶斯特征过滤 (OBF) 是一种快速且内存高效的算法,它可以在高斯模型下优化识别治疗组之间具有分布差异的标志物。在这里,我们研究了 OBF 在生物标志物发现中的性能和鲁棒性。我们的贡献有两点:(1) 我们研究了 OBF 在违反建模假设的数据上的表现,(2) 我们提供了有关如何设置输入参数以实现稳健性能的指南。贡献 (1) 解决了计算生物学中一个重要、相关且常见的问题,即通常不可能验证算法的核心假设。为了完成这两个任务,我们提出了一系列模拟,这些模拟使用不同的输入实现了 OBF,并对 OBF 做出的每个假设进行了挑战。特别是,我们研究了 OBF 对错误输入参数、虚假独立性、不平衡样本大小的鲁棒性,并通过考虑广泛的非高斯分布族的性能来解决高斯性假设。我们在整个过程中讨论了不同先验和优化标准之间的优缺点。最后,我们使用急性髓性白血病 (AML) 和结肠癌微阵列数据集评估了 OBF 在生物标志物发现中的效用,并表明 OBF 成功地识别了这些疾病的知名生物标志物,这些标志物在适度 t 检验下排名较低。