Barbosa Raquel de M, Fernandes Marcelo A C
MIT Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02142, USA.
Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte, Natal 59078-970, Brazil.
Data Brief. 2020 Apr 25;30:105618. doi: 10.1016/j.dib.2020.105618. eCollection 2020 Jun.
As of April 16, 2020, the novel coronavirus disease (called COVID-19) spread to more than 185 countries/regions with more than 142,000 deaths and more than 2,000,000 confirmed cases. In the bioinformatics area, one of the crucial points is the analysis of the virus nucleotide sequences using approaches such as data stream, digital signal processing, and machine learning techniques and algorithms. However, to make feasible this approach, it is necessary to transform the nucleotide sequences string to numerical values representation. Thus, the dataset provides a chaos game representation (CGR) of SARS-CoV-2 virus nucleotide sequences. The dataset provides the CGR of 100 instances of SARS-CoV-2 virus, 11540 instances of other viruses from the Virus-Host DB dataset, and three instances of Riboviria viruses from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21).
截至2020年4月16日,新型冠状病毒病(称为COVID-19)已传播至185多个国家/地区,死亡人数超过14.2万,确诊病例超过200万。在生物信息学领域,关键点之一是使用诸如数据流、数字信号处理以及机器学习技术和算法等方法对病毒核苷酸序列进行分析。然而,为使这种方法可行,有必要将核苷酸序列字符串转换为数值表示形式。因此,该数据集提供了严重急性呼吸综合征冠状病毒2(SARS-CoV-2)病毒核苷酸序列的混沌博弈表示(CGR)。该数据集提供了100个SARS-CoV-2病毒实例的CGR、来自病毒-宿主数据库(Virus-Host DB)数据集的11540个其他病毒实例,以及来自美国国立医学图书馆(NCBI)的三个核糖病毒实例(乙型冠状病毒RaTG13、蝙蝠严重急性呼吸综合征相关冠状病毒ZC45和蝙蝠严重急性呼吸综合征相关冠状病毒ZXC21)。