Ontario Institute for Cancer Research, MaRS Centre, Toronto, Ontario, Canada.
BMC Bioinformatics. 2012 Aug 17;13:206. doi: 10.1186/1471-2105-13-206.
It is now well established that nearly 20% of human cancers are caused by infectious agents, and the list of human oncogenic pathogens will grow in the future for a variety of cancer types. Whole tumor transcriptome and genome sequencing by next-generation sequencing technologies presents an unparalleled opportunity for pathogen detection and discovery in human tissues but requires development of new genome-wide bioinformatics tools.
Here we present CaPSID (Computational Pathogen Sequence IDentification), a comprehensive bioinformatics platform for identifying, querying and visualizing both exogenous and endogenous pathogen nucleotide sequences in tumor genomes and transcriptomes. CaPSID includes a scalable, high performance database for data storage and a web application that integrates the genome browser JBrowse. CaPSID also provides useful metrics for sequence analysis of pre-aligned BAM files, such as gene and genome coverage, and is optimized to run efficiently on multiprocessor computers with low memory usage.
To demonstrate the usefulness and efficiency of CaPSID, we carried out a comprehensive analysis of both a simulated dataset and transcriptome samples from ovarian cancer. CaPSID correctly identified all of the human and pathogen sequences in the simulated dataset, while in the ovarian dataset CaPSID's predictions were successfully validated in vitro.
现在已经明确,近 20%的人类癌症是由感染因子引起的,并且在未来各种癌症类型中,人类致癌病原体的清单将会增加。下一代测序技术对整个肿瘤转录组和基因组进行测序,为在人体组织中检测和发现病原体提供了前所未有的机会,但需要开发新的全基因组生物信息学工具。
在这里,我们提出了 CaPSID(计算病原体序列识别),这是一个全面的生物信息学平台,用于识别、查询和可视化肿瘤基因组和转录组中外源和内源性病原体核苷酸序列。CaPSID 包括一个可扩展的、高性能的数据库用于数据存储,以及一个集成基因组浏览器 JBrowse 的 Web 应用程序。CaPSID 还为预对齐的 BAM 文件的序列分析提供了有用的指标,如基因和基因组覆盖率,并针对低内存使用的多处理器计算机进行了优化,以实现高效运行。
为了展示 CaPSID 的有用性和效率,我们对模拟数据集和卵巢癌的转录组样本进行了全面分析。CaPSID 正确地识别了模拟数据集中的所有人类和病原体序列,而在卵巢数据集,CaPSID 的预测在体外得到了成功验证。