Wang Wei, Smith Jack, Hejase Hussein A, Liu Kevin J
1Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 USA.
2Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA.
Algorithms Mol Biol. 2020 Apr 16;15:7. doi: 10.1186/s13015-020-00167-0. eCollection 2020.
Non-parametric and semi-parametric resampling procedures are widely used to perform support estimation in computational biology and bioinformatics. Among the most widely used methods in this class is the standard bootstrap method, which consists of random sampling with replacement. While not requiring assumptions about any particular parametric model for resampling purposes, the bootstrap and related techniques assume that sites are independent and identically distributed (i.i.d.). The i.i.d. assumption can be an over-simplification for many problems in computational biology and bioinformatics. In particular, sequential dependence within biomolecular sequences is often an essential biological feature due to biochemical function, evolutionary processes such as recombination, and other factors. To relax the simplifying i.i.d. assumption, we propose a new non-parametric/semi-parametric sequential resampling technique that generalizes "Heads-or-Tails" mirrored inputs, a simple but clever technique due to Landan and Graur. The generalized procedure takes the form of random walks along either aligned or unaligned biomolecular sequences. We refer to our new method as the SERES (or "SEquential RESampling") method. To demonstrate the performance of the new technique, we apply SERES to estimate support for the multiple sequence alignment problem. Using simulated and empirical data, we show that SERES-based support estimation yields comparable or typically better performance compared to state-of-the-art methods.
非参数和半参数重采样程序在计算生物学和生物信息学中被广泛用于进行支持度估计。这类方法中使用最广泛的是标准自助法,它包括有放回的随机抽样。虽然在重采样时不需要对任何特定参数模型做假设,但自助法及相关技术假设位点是独立同分布的(i.i.d.)。对于计算生物学和生物信息学中的许多问题而言,独立同分布假设可能过于简化。特别是,由于生化功能、诸如重组等进化过程以及其他因素,生物分子序列内的序列依赖性通常是一个基本的生物学特征。为了放宽这种简化的独立同分布假设,我们提出了一种新的非参数/半参数序列重采样技术,它推广了“抛硬币”镜像输入,这是一种由兰丹和格劳尔提出的简单而巧妙的技术。广义程序采取沿着对齐或未对齐的生物分子序列进行随机游走的形式。我们将我们的新方法称为SERES(或“序列重采样”)方法。为了证明新技术的性能,我们应用SERES来估计多序列比对问题的支持度。使用模拟数据和实证数据,我们表明与现有最先进的方法相比,基于SERES的支持度估计产生了相当或通常更好的性能。