Pacybara：用于带条码诱变等位基因文库的准确长读测序。

Pacybara: accurate long-read sequencing for barcoded mutagenized allelic libraries.

机构信息

Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON M5G 1X5, Canada.

Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada.

出版信息

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae182.

DOI:10.1093/bioinformatics/btae182

PMID:38569896

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11021806/

Abstract

MOTIVATION

Long-read sequencing technologies, an attractive solution for many applications, often suffer from higher error rates. Alignment of multiple reads can improve base-calling accuracy, but some applications, e.g. sequencing mutagenized libraries where multiple distinct clones differ by one or few variants, require the use of barcodes or unique molecular identifiers. Unfortunately, sequencing errors can interfere with correct barcode identification, and a given barcode sequence may be linked to multiple independent clones within a given library.

RESULTS

Here we focus on the target application of sequencing mutagenized libraries in the context of multiplexed assays of variant effects (MAVEs). MAVEs are increasingly used to create comprehensive genotype-phenotype maps that can aid clinical variant interpretation. Many MAVE methods use long-read sequencing of barcoded mutant libraries for accurate association of barcode with genotype. Existing long-read sequencing pipelines do not account for inaccurate sequencing or nonunique barcodes. Here, we describe Pacybara, which handles these issues by clustering long reads based on the similarities of (error-prone) barcodes while also detecting barcodes that have been associated with multiple genotypes. Pacybara also detects recombinant (chimeric) clones and reduces false positive indel calls. In three example applications, we show that Pacybara identifies and correctly resolves these issues.

AVAILABILITY AND IMPLEMENTATION

Pacybara, freely available at https://github.com/rothlab/pacybara, is implemented using R, Python, and bash for Linux. It runs on GNU/Linux HPC clusters via Slurm, PBS, or GridEngine schedulers. A single-machine simplex version is also available.

摘要

动机

长读测序技术在许多应用中是一种很有吸引力的解决方案，但通常存在较高的错误率。多序列比对可以提高碱基调用的准确性，但有些应用，例如对经过诱变的文库进行测序，其中多个不同的克隆仅相差一个或几个变体，就需要使用条形码或独特的分子标识符。不幸的是，测序错误可能会干扰正确的条形码识别，并且给定的条形码序列可能与给定文库中的多个独立克隆相关联。

结果

在这里，我们专注于诱变文库测序在变体效应多重分析（MAVE）中的目标应用。MAVE 越来越多地用于创建全面的基因型-表型图谱，以帮助临床变异解释。许多 MAVE 方法使用带有条形码的突变文库的长读测序来准确地将条形码与基因型关联。现有的长读测序管道没有考虑到不准确的测序或非唯一的条形码。在这里，我们描述了 Pacybara，它通过基于（易错）条形码的相似性对长读进行聚类，同时还检测与多个基因型相关联的条形码来处理这些问题。Pacybara 还检测重组（嵌合）克隆并减少假阳性插入缺失调用。在三个示例应用中，我们表明 Pacybara 可以识别和正确解决这些问题。

可用性和实现

Pacybara 可在 https://github.com/rothlab/pacybara 上免费获得，它使用 R、Python 和用于 Linux 的 bash 实现。它通过 Slurm、PBS 或 GridEngine 调度程序在 GNU/Linux HPC 群集上运行。也提供单机单线程版本。