Department of Mathematics, Florida Gulf Coast University, Fort Myers, FL, USA.
Department of Statistics, Florida State University, Tallahassee, FL, USA.
BMC Bioinformatics. 2018 Apr 11;19(1):131. doi: 10.1186/s12859-018-2140-3.
Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems.
We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences.
We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.
识别基因组的功能元素通常需要将基因组上的测量序列划分为具有不同属性的片段,例如不同的平均值。尽管在基因组学研究中已经开发了数十种算法来解决这个问题,但仍需要更精确和快速的方法来有效地解决现有的和新兴的基因组和表观基因组分割问题。
我们设计了一种名为 iSeg 的高效算法,用于基因组和表观基因组谱的分割。iSeg 首先利用动态规划来识别候选片段并测试其显著性。然后,它使用一种基于两个耦合平衡二叉树的数据结构来检测重叠的显著片段,并在搜索和细化阶段同时更新它们。最后,对显著片段进行细化和合并,以生成最终的片段集。通过使用基于片段 p 值的目标函数,该算法可以作为一个通用的计算框架,与数据分布的不同假设相结合。作为一种通用的分割方法,它可以分割不同类型的基因组和表观基因组数据,如 DNA 拷贝数变异、核小体占有率、核酸酶敏感性和差异核酸酶敏感性数据。我们使用简单的 t 检验来计算不同类型的多个数据集的 p 值,并用模拟数据集和实验数据集对 iSeg 进行了评估,并与一些其他常用方法进行了比较,结果表明,与一些经常使用更复杂统计模型的常用方法相比,iSeg 的性能令人满意。用 C++实现的 iSeg 也非常高效,非常适合处理大量输入的谱和具有非常长序列的数据集。
我们开发了一种高效的通用分割工具,并表明它的结果与当代基因组数据分析中使用的许多最流行的片段调用算法相当或更准确。iSeg 能够分析具有正负值的数据集。可调参数允许用户轻松调整统计严格程度,以最佳匹配各个数据集的生物学性质,包括广泛或稀疏映射的基因组数据集或具有非正态分布的数据集。