MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK.
MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK.
Bioinformatics. 2022 Sep 15;38(18):4255-4263. doi: 10.1093/bioinformatics/btac525.
Genome sequencing experiments have revolutionized molecular biology by allowing researchers to identify important DNA-encoded elements genome wide. Regions where these elements are found appear as peaks in the analog signal of an assay's coverage track, and despite the ease with which humans can visually categorize these patterns, the size of many genomes necessitates algorithmic implementations. Commonly used methods focus on statistical tests to classify peaks, discounting that the background signal does not completely follow any known probability distribution and reducing the information-dense peak shapes to simply maximum height. Deep learning has been shown to be highly accurate for many pattern recognition tasks, on par or even exceeding human capabilities, providing an opportunity to reimagine and improve peak calling.
We present the peak calling framework LanceOtron, which combines deep learning for recognizing peak shape with multifaceted enrichment calculations for assessing significance. In benchmarking ATAC-seq, ChIP-seq and DNase-seq, LanceOtron outperforms long-standing, gold-standard peak callers through its improved selectivity and near-perfect sensitivity.
A fully featured web application is freely available from LanceOtron.molbiol.ox.ac.uk, command line interface via python is pip installable from PyPI at https://pypi.org/project/lanceotron/, and source code and benchmarking tests are available at https://github.com/LHentges/LanceOtron.
Supplementary data are available at Bioinformatics online.
基因组测序实验通过允许研究人员在全基因组范围内识别重要的 DNA 编码元素,彻底改变了分子生物学。这些元素所在的区域在分析物覆盖轨迹的模拟信号中表现为峰,尽管人类可以轻松地对这些模式进行视觉分类,但许多基因组的大小需要算法实现。常用的方法侧重于统计测试来对峰进行分类,忽略了背景信号并不完全遵循任何已知的概率分布,并将信息密集的峰形状简化为简单的最大高度。深度学习已被证明非常适合许多模式识别任务,其准确性与人类相当,甚至超过人类,为重新构想和改进峰调用提供了机会。
我们提出了峰调用框架 LanceOtron,它将识别峰形状的深度学习与多方面的富集计算相结合,以评估显著性。在 ATAC-seq、ChIP-seq 和 DNase-seq 的基准测试中,LanceOtron 通过提高选择性和接近完美的灵敏度,优于长期以来的黄金标准峰调用器。
一个功能齐全的网络应用程序可在 LanceOtron.molbiol.ox.ac.uk 上免费获得,通过 python 的命令行接口可从 PyPI 在 https://pypi.org/project/lanceotron/ 上安装,源代码和基准测试可在 https://github.com/LHentges/LanceOtron 上获得。
补充数据可在 Bioinformatics 在线获得。