Chen Lei, Li Zhandong, Zeng Tao, Zhang Yu-Hang, Feng KaiYan, Huang Tao, Cai Yu-Dong
School of Life Sciences, Shanghai University, shanghai 200444, China.
College of Information Engineering, Shanghai Maritime University, shanghai 201306, China.
Biomed Res Int. 2021 Jul 6;2021:9939134. doi: 10.1155/2021/9939134. eCollection 2021.
COVID-19, a severe respiratory disease caused by a new type of coronavirus SARS-CoV-2, has been spreading all over the world. Patients infected with SARS-CoV-2 may have no pathogenic symptoms, i.e., presymptomatic patients and asymptomatic patients. Both patients could further spread the virus to other susceptible people, thereby making the control of COVID-19 difficult. The two major challenges for COVID-19 diagnosis at present are as follows: (1) patients could share similar symptoms with other respiratory infections, and (2) patients may not have any symptoms but could still spread the virus. Therefore, new biomarkers at different omics levels are required for the large-scale screening and diagnosis of COVID-19. Although some initial analyses could identify a group of candidate gene biomarkers for COVID-19, the previous work still could not identify biomarkers capable for clinical use in COVID-19, which requires disease-specific diagnosis compared with other multiple infectious diseases. As an extension of the previous study, optimized machine learning models were applied in the present study to identify some specific qualitative host biomarkers associated with COVID-19 infection on the basis of a publicly released transcriptomic dataset, which included healthy controls and patients with bacterial infection, influenza, COVID-19, and other kinds of coronavirus. This dataset was first analysed by Boruta, Max-Relevance and Min-Redundancy feature selection methods one by one, resulting in a feature list. This list was fed into the incremental feature selection method, incorporating one of the classification algorithms to extract essential biomarkers and build efficient classifiers and classification rules. The capacity of these findings to distinguish COVID-19 with other similar respiratory infectious diseases at the transcriptomic level was also validated, which may improve the efficacy and accuracy of COVID-19 diagnosis.
新型冠状病毒SARS-CoV-2引发的严重呼吸道疾病COVID-19已在全球蔓延。感染SARS-CoV-2的患者可能没有致病症状,即处于症状前阶段的患者和无症状患者。这两类患者都可能将病毒进一步传播给其他易感人群,从而使COVID-19的防控变得困难。目前COVID-19诊断面临的两大挑战如下:(1)患者可能表现出与其他呼吸道感染相似的症状;(2)患者可能没有任何症状,但仍可传播病毒。因此,需要不同组学水平的新型生物标志物用于COVID-19的大规模筛查和诊断。尽管一些初步分析能够识别出一组COVID-19的候选基因生物标志物,但先前的研究仍未能确定可用于COVID-19临床诊断的生物标志物,这需要与其他多种传染病进行区分的疾病特异性诊断。作为先前研究的延伸,本研究应用优化的机器学习模型,基于公开的转录组数据集,识别与COVID-19感染相关的一些特定定性宿主生物标志物,该数据集包括健康对照、细菌感染患者、流感患者、COVID-19患者以及其他冠状病毒感染患者。该数据集首先依次通过Boruta、最大相关性和最小冗余特征选择方法进行分析,得到一个特征列表。该列表被输入到增量特征选择方法中,并结合一种分类算法来提取关键生物标志物,构建高效的分类器和分类规则。这些研究结果在转录组水平区分COVID-19与其他相似呼吸道传染病的能力也得到了验证,这可能会提高COVID-19诊断的有效性和准确性。