文献检索，用中文搜 PubMed

BACKGROUND

Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems.

RESULTS

We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences.

CONCLUSIONS

We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.

BACKGROUND

RESULTS

CONCLUSIONS

背景

识别基因组的功能元素通常需要将基因组上的测量序列划分为具有不同属性的片段，例如不同的平均值。尽管在基因组学研究中已经开发了数十种算法来解决这个问题，但仍需要更精确和快速的方法来有效地解决现有的和新兴的基因组和表观基因组分割问题。

结果

我们设计了一种名为 iSeg 的高效算法，用于基因组和表观基因组谱的分割。iSeg 首先利用动态规划来识别候选片段并测试其显著性。然后，它使用一种基于两个耦合平衡二叉树的数据结构来检测重叠的显著片段，并在搜索和细化阶段同时更新它们。最后，对显著片段进行细化和合并，以生成最终的片段集。通过使用基于片段 p 值的目标函数，该算法可以作为一个通用的计算框架，与数据分布的不同假设相结合。作为一种通用的分割方法，它可以分割不同类型的基因组和表观基因组数据，如 DNA 拷贝数变异、核小体占有率、核酸酶敏感性和差异核酸酶敏感性数据。我们使用简单的 t 检验来计算不同类型的多个数据集的 p 值，并用模拟数据集和实验数据集对 iSeg 进行了评估，并与一些其他常用方法进行了比较，结果表明，与一些经常使用更复杂统计模型的常用方法相比，iSeg 的性能令人满意。用 C++实现的 iSeg 也非常高效，非常适合处理大量输入的谱和具有非常长序列的数据集。