Suppr超能文献

k-典范空间:使用互补序列进行草图绘制。

k-nonical space: sketching with reverse complements.

机构信息

Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States.

出版信息

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae629.

Abstract

MOTIVATION

Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space).

RESULTS

The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space.

AVAILABILITY AND IMPLEMENTATION

The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.

摘要

动机

与它们的反向互补序列(即双链 DNA)等效的序列在文本分析和非生物字符串算法中没有类似物。尽管存在这种明显的差异,但为计算生物学设计的算法(例如草图算法)是以与经典字符串算法相同的方式设计和测试的。然后,作为后处理步骤,这些算法通过将一个 k-mer 及其反向互补折叠成单个序列来适应基因组序列的工作:规范表示(k-规范空间)。

结果

使用草图方法的规范表示的效果研究不足,也不被理解。作为第一步,我们使用上下文无关的草图方法来说明使用非专为其设计的规范 k-mer 对字符串算法的潜在不利影响。具体来说,我们表明,基因组的大片段(“草图沙漠”)被上下文无关的草图方法抽样不足或完全跳过,实际上使这些基因组区域对随后使用这些草图的算法不可见。我们提供了显示这些效果的经验数据,并提出了一个理论框架来解释草图沙漠的出现。最后,我们提出了两种方案来适应这些影响:(i)一种新的程序,将现有的草图方法适应到 k-规范空间中,(ii)一种直接为 k-规范空间设计新的草图方法的优化程序。

可用性和实现

本分析中使用的代码在 https://github.com/Kingsford-Group/mdsscope 下以许可协议的形式提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b53/11549021/483973dfab85/btae629f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验