Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America.
Comput Biol Med. 2022 Oct;149:105969. doi: 10.1016/j.compbiomed.2022.105969. Epub 2022 Aug 17.
Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes "patient status" metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models.
流行病学研究表明,像德尔塔和奥密克戎这样的关注变种,对严重疾病造成的风险不同,但它们通常缺乏病毒的序列水平信息。获得病毒基因组序列的研究通常在时间、地点和人口范围上受到限制。回顾性荟萃分析需要从异构格式中提取耗时的数据,并且仅限于公开可用的报告。幸运的是,GISAID 的一部分,即全球 SARS-CoV-2 序列存储库,包括“患者状态”元数据,可以表明序列记录是否与轻度或重度疾病相关。虽然 GISAID 缺乏与严重程度相关的合并症数据,如肥胖和慢性疾病,但它确实包含年龄和性别的元数据,可作为建模的附加属性。有了这些警告,以前的努力已经表明,基因型-患者状态模型可以拟合 GISAID 数据,特别是当使用原籍国作为附加特征时。但是这些模型是否稳健且具有生物学意义?本文表明,事实上,GISAID 提交的序列中的时间和地理偏差,以及不断演变的大流行病应对措施,特别是由于接种疫苗导致严重疾病的减少,给模型开发和解释带来了复杂的问题。本文提出了一种潜在的解决方案:使用 GPBoost 进行高效的混合效应机器学习,将国家视为随机效应组。使用时间分割的 GISAID 数据和新兴的奥密克戎变体进行训练和验证表明,与固定效应 XGBoost、LightGBM、随机森林和弹性网络逻辑回归模型相比,GPBoost 模型更能预测刺突蛋白突变对患者结果的影响。