Deakin Claire T, Deakin Jeffrey J, Ginn Samantha L, Young Paul, Humphreys David, Suter Catherine M, Alexander Ian E, Hallwirth Claus V
Gene Therapy Research Unit, Children's Medical Research Institute and The Children's Hospital at Westmead, Westmead, New South Wales 2145, Australia.
Molecular Genetics Division, Victor Chang Cardiac Research Institute, Sydney, Darlinghurst, New South Wales 2010, Australia.
Nucleic Acids Res. 2014;42(16):e129. doi: 10.1093/nar/gku607. Epub 2014 Jul 10.
Barcoded vectors are promising tools for investigating clonal diversity and dynamics in hematopoietic gene therapy. Analysis of clones marked with barcoded vectors requires accurate identification of potentially large numbers of individually rare barcodes, when the exact number, sequence identity and abundance are unknown. This is an inherently challenging application, and the feasibility of using contemporary next-generation sequencing technologies is unresolved. To explore this potential application empirically, without prior assumptions, we sequenced barcode libraries of known complexity. Libraries containing 1, 10 and 100 Sanger-sequenced barcodes were sequenced using an Illumina platform, with a 100-barcode library also sequenced using a SOLiD platform. Libraries containing 1 and 10 barcodes were distinguished from false barcodes generated by sequencing error by a several log-fold difference in abundance. In 100-barcode libraries, however, expected and false barcodes overlapped and could not be resolved by bioinformatic filtering and clustering strategies. In independent sequencing runs multiple false-positive barcodes appeared to be represented at higher abundance than known barcodes, despite their confirmed absence from the original library. Such errors, which potentially impact barcoding studies in an application-dependent manner, are consistent with the existence of both stochastic and systematic error, the mechanism of which is yet to be fully resolved.
条形码载体是研究造血基因治疗中克隆多样性和动态变化的有前景的工具。当确切数量、序列同一性和丰度未知时,对用条形码载体标记的克隆进行分析需要准确识别潜在大量的个体稀有条形码。这是一个本质上具有挑战性的应用,并且使用当代下一代测序技术的可行性尚未解决。为了在没有先验假设的情况下凭经验探索这种潜在应用,我们对已知复杂性的条形码文库进行了测序。使用Illumina平台对包含1、10和100个经桑格测序的条形码的文库进行测序,还使用SOLiD平台对一个包含100个条形码的文库进行了测序。包含1个和10个条形码的文库与由测序错误产生的假条形码在丰度上存在几个对数级的差异,从而得以区分。然而,在包含100个条形码的文库中,预期条形码和假条形码重叠,并且无法通过生物信息学过滤和聚类策略来解决。在独立的测序运行中,尽管已确认原始文库中不存在多个假阳性条形码,但它们似乎比已知条形码以更高的丰度出现。这些错误可能以依赖应用的方式影响条形码研究,这与随机误差和系统误差的存在一致,其机制尚未完全解决。