Department of Computer Science, Georgia State University, Atlanta, GA, USA.
IBM T. J. Watson Research Center, Yorktown Heights, Yorktown, NY, USA.
Sci Rep. 2023 Mar 13;13(1):4154. doi: 10.1038/s41598-023-31368-3.
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
新冠疫情的迅速传播导致 SARS-CoV-2 基因组的序列数据呈爆发式增长——数以百万计的序列数据。与传统方法相比,这些数据的数量级大大超出了理解病毒多样性、动态变化和进化的能力范围,但对于机器学习 (ML) 方法来说,这些数据是一种丰富的资源,可以替代从这些数据中提取此类重要信息的方法。因此,设计一个用于测试和基准测试这些 ML 模型稳健性的框架至关重要。本文首次(据我们所知)通过模拟具有误差的生物序列来基准测试 ML 模型的稳健性。在本文中,我们引入了几种方法来扰动 SARS-CoV-2 基因组序列,以模拟 Illumina 和 PacBio 等常见测序平台的错误分布。我们通过对各种 ML 模型的实验表明,对于特定的嵌入方法,某些基于模拟的方法(具有不同的扰动预算)比其他方法更稳健(和准确),可以针对输入序列的某些噪声模拟进行调整。我们的基准测试框架可以帮助研究人员正确评估不同的 ML 模型,并帮助他们了解 SARS-CoV-2 病毒的行为或避免可能出现的未来大流行。