Suppr超能文献

在来自多名患者和多种癌症的数据中识别已知和新型复发性病毒序列。

Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers.

作者信息

Friis-Nielsen Jens, Kjartansdóttir Kristín Rós, Mollerup Sarah, Asplund Maria, Mourier Tobias, Jensen Randi Holm, Hansen Thomas Arn, Rey-Iglesia Alba, Richter Stine Raith, Nielsen Ida Broman, Alquezar-Planas David E, Olsen Pernille V S, Vinner Lasse, Fridholm Helena, Nielsen Lars Peter, Willerslev Eske, Sicheritz-Pontén Thomas, Lund Ole, Hansen Anders Johannes, Izarzugaza Jose M G, Brunak Søren

机构信息

Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark.

Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, Denmark.

出版信息

Viruses. 2016 Feb 19;8(2):53. doi: 10.3390/v8020053.

Abstract

Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.

摘要

从高通量测序数据中发现病毒通常采用自下而上的方法,即在与疾病关联之前先进行分类注释。尽管这种方法在某些情况下有效,但它无法检测到参考数据库中不存在的新型病原体和远距离变体。我们开发了一种不依赖物种的流程,该流程利用序列聚类来识别在多个测序数据实例中共同出现的核苷酸序列。我们将该工作流程应用于来自252个不同癌症和组织类型的癌症样本、32个无模板对照和24个测试样本的686个测序文库。将反复出现的序列与生物学、方法学或技术特征进行统计学关联,目的是识别可能与特定试剂盒或方法相关的新型病原体或可能的污染物。我们提供了健康组织菌群中已识别出的微生物以及实验污染物的示例。具有高统计学显著性共同出现的未映射序列可能代表了未知的序列空间,在这个空间中可以识别新型病原体。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/98c5/4776208/e341b9f35d47/viruses-08-00053-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验