Rytsareva Inna, Campo David S, Zheng Yueli, Sims Seth, Thankachan Sharma V, Tetik Cansu, Chirag Jain, Chockalingam Sriram P, Sue Amanda, Aluru Srinivas, Khudyakov Yury
Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, Atlanta, GA, USA.
School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
BMC Genomics. 2017 May 24;18(Suppl 4):372. doi: 10.1186/s12864-017-3732-4.
Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed. The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples.
We developed a three-step strategy that filters pairs of samples according to different criteria: (i) a k-mer bloom filter, (ii) a Levenhstein filter and (iii) a filter of identical sequences. We applied these three filters on a set of samples that cover the spectrum of genetic relationships among HCV cases, from being part of the same transmission cluster, to belonging to different subtypes.
Our three-step filtering strategy rapidly removes 85.1% of all the pairwise sample comparisons and 91.0% of all pairwise sequence comparisons, accurately establishing which pairs of HCV samples are below the relatedness threshold.
We present a fast and efficient three-step filtering strategy that removes most sequence comparisons and accurately establishes transmission links of any threshold-based method. This highly efficient workflow will allow a faster response and molecular detection capacity, improving the rate of detection of viral transmissions with molecular data.
丙型肝炎在美国乃至全球都是一个重大的公共卫生问题。与不安全注射行为、药物转移及其他血液暴露相关的丙型肝炎病毒(HCV)感染暴发难以被发现和调查。分子分析在HCV暴发及传播链研究中经常被使用;如果基因距离低于先前定义的阈值,可帮助确定一组序列是通过传播联系在一起的。然而,HCV在每个受感染个体中以众多变体群体的形式存在,并且已经观察到,源头的少数变体往往是导致传播的原因,这种情况使得不能每个个体只使用一个序列,因为许多这样的传播情况会被遗漏。下一代测序的使用极大地提高了传播检测的灵敏度,但带来了相当大的计算挑战,因为所有序列都需要在所有样本对之间进行比较。
我们开发了一种三步策略,根据不同标准对样本对进行筛选:(i)k-mer布隆过滤器,(ii)莱文斯坦过滤器,以及(iii)相同序列过滤器。我们将这三种过滤器应用于一组样本,这些样本涵盖了HCV病例之间从属于同一传播簇到属于不同亚型的各种遗传关系。
我们的三步筛选策略迅速去除了所有成对样本比较中的85.1%以及所有成对序列比较中的91.0%,准确确定了哪些HCV样本对低于相关性阈值。
我们提出了一种快速有效的三步筛选策略,该策略去除了大部分序列比较,并准确建立了任何基于阈值方法的传播联系。这种高效的工作流程将实现更快的响应和分子检测能力,提高利用分子数据检测病毒传播的速率。