Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, Canada.
PLoS Comput Biol. 2021 Mar 22;17(3):e1008815. doi: 10.1371/journal.pcbi.1008815. eCollection 2021 Mar.
Across the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a "reverse mapping" approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper's utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample's population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper.
在整个生命科学领域,下一代测序数据的处理通常依赖于一种计算成本高昂的过程,其中读取序列被映射到参考序列上。然而,在进行这种处理之前,可以从读取序列中获得大量信息,从而可能不需要进行处理,或者可以部署优化的映射方法。在这里,我们提出了一种称为 FlexTyper 的方法,该方法实现了一种“反向映射”方法,其中高通量序列查询(以 k-mer 搜索的形式)针对索引的短读取数据集运行,以提取有用信息。这种反向映射方法可以快速计数感兴趣的目标序列。我们展示了 FlexTyper 在恢复人类基因组中 SNP 位点的覆盖深度和准确基因分型方面的实用性。我们表明,对未映射读取进行基因分型可以正确地告知样本在家族环境中的群体、性别和亲缘关系。在 RNA-seq 数据中检测病原体序列具有较高的灵敏度和准确性,与现有方法相比性能相当,但具有更高的灵活性。我们提出了两种利用这种灵活性分析线性参考中代表性较差的基因组特征的方法。首先,我们分析了来自非洲基因组测序研究的 contigs,展示了它们在来自三个不同群体的家族中的分布情况。其次,我们展示了如何为杀手免疫受体基因座标记基因标记 k-mer,以便在标准读取映射管道难以处理的区域中检测等位基因。通过更有效的 FM-index 生成方法和基于生物学的参考查询集合,将能够采用 FlexTyper 所代表的反向映射方法。从长远来看,使用 FlexTyper 方法可以选择特定于群体的参考或对泛群体参考基因组图中的边缘进行加权。FlexTyper 可在 https://github.com/wassermanlab/OpenFlexTyper 上获得。