Suppr超能文献

分析从 GISAID 数据库中检索到的 329,942 份 SARS-CoV-2 记录。

Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database.

机构信息

Quantori, 625 Massachusetts Ave, Cambridge, MA, 02139, USA; Mental Health Research Center, Kashirskoe Shosse 34, 115522, Moscow, Russia.

Quantori, 625 Massachusetts Ave, Cambridge, MA, 02139, USA.

出版信息

Comput Biol Med. 2021 Dec;139:104981. doi: 10.1016/j.compbiomed.2021.104981. Epub 2021 Oct 26.

Abstract

BACKGROUND

The SARS-CoV-2 virus caused a worldwide pandemic - although none of its predecessors from the coronavirus family ever achieved such a scale. The key to understanding the global success of SARS-CoV-2 is hidden in its genome.

MATERIALS AND METHODS

We retrieved data for 329,942 SARS-CoV-2 records uploaded to the GISAID database from the beginning of the pandemic until the January 8, 2021. A Python variant detection script was developed to process the data using pairwise2 from the BioPython library. Sequence alignments were performed for every gene separately (except ORF1ab, which was not studied). Genomes less than 26,000 nucleotides long were excluded from the research. Clustering was performed using HDBScan.

RESULTS

Here, we addressed the genetic variability of SARS-CoV-2 using 329,942 samples. The analysis yielded 155 SNPs and deletions in more than 0.3% of the sequences. Clustering results suggested that a proportion of people (2.46%) was infected with a distinct subtype of the B.1.1.7 variant, which contained four to six additional mutations (G28881A, G28882A, G28883С, A23403G, A28095T, G25437T). Two clusters were formed by mutations in the samples uploaded predominantly by Denmark and Australia (1.48% and 2.51%, respectively). A correlation coefficient matrix detected 160 pairs of mutations (correlation coefficient greater than 0.7). We also addressed the completeness of the GISAID database, patient gender, and age. Finally, we found ORF6 and E to be the most conserved genes (96.15% and 94.66% of the sequences totally match the reference, respectively). Our results indicate multiple areas for further research in both SARS-CoV-2 studies and health science.

摘要

背景

SARS-CoV-2 病毒引发了全球大流行——尽管其冠状病毒家族的前代病毒从未达到如此规模。了解 SARS-CoV-2 全球成功的关键隐藏在其基因组中。

材料和方法

我们从疫情开始到 2021 年 1 月 8 日,从 GISAID 数据库中检索了 329942 条上传的 SARS-CoV-2 记录数据。使用来自 BioPython 库的 pairwise2 开发了一个 Python 变体检测脚本来处理数据。对每个基因(ORF1ab 除外,未进行研究)分别进行序列比对。排除长度小于 26000 个核苷酸的基因组。使用 HDBScan 进行聚类。

结果

在这里,我们使用 329942 个样本研究了 SARS-CoV-2 的遗传变异性。分析得出,超过 0.3%的序列中存在 155 个 SNP 和缺失。聚类结果表明,一部分人(2.46%)感染了 B.1.1.7 变体的一个独特亚型,该变体包含 4 到 6 个额外突变(G28881A、G28882A、G28883C、A23403G、A28095T、G25437T)。由丹麦和澳大利亚主要上传的样本形成了两个突变簇(分别为 1.48%和 2.51%)。相关性系数矩阵检测到 160 对突变(相关性系数大于 0.7)。我们还解决了 GISAID 数据库的完整性、患者性别和年龄问题。最后,我们发现 ORF6 和 E 是最保守的基因(序列分别完全匹配参考序列的 96.15%和 94.66%)。我们的结果表明,在 SARS-CoV-2 研究和健康科学领域有多个进一步研究的方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f98d/8547852/b0f7f0faa45b/ga1_lrg.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验