O'Connor Karen, Weissenbacher Davy, Elyaderani Amir, Lautenbach Ebbing, Scotch Matthew, Gonzalez-Hernandez Graciela
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States.
JMIR Res Protoc. 2025 Apr 22;14:e58567. doi: 10.2196/58567.
There has been an unprecedented effort to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, such as the GISAID (Global Initiative on Sharing All Influenza Data) and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. Genomic epidemiology, however, seeks to go beyond phylogenetic (the study of evolutionary relationships among biological entities) analysis by linking genetic information to patient characteristics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact. While these repositories include fields reflecting patient-related metadata for a given sequence, the inclusion of these demographic and clinical details is scarce. The current understanding of patient-related metadata in published sequencing studies and its quality remains unexplored.
Our review aims to quantitatively assess the extent and quality of patient-reported metadata in papers reporting original whole genome sequencing of the SARS-CoV-2 virus and analyze publication patterns using bibliometric analysis. Finally, we will evaluate the efficacy and reliability of a machine learning classifier in accurately identifying relevant papers for inclusion in the scoping review.
The National Institutes of Health's LitCovid collection will be used for the automated classification of papers reporting having deposited SARS-CoV-2 sequences in public repositories, while an independent search will be conducted in MEDLINE and PubMed Central for validation. Data extraction will be conducted using Covidence (Veritas Health Innovation Ltd). The extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations, citation metrics, author keywords, and Medical Subject Headings terms will be extracted.
This study is expected to be completed in early 2025. Our classification model has been developed and we have classified publications in LitCovid published through February 2023. As of September 2024, papers through August 2024 are being prepared for processing. Screening is underway for validated papers from the classifier. Direct literature searches and screening of the results began in October 2024. We will summarize and narratively describe our findings using tables, graphs, and charts where applicable.
This scoping review will report findings on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2, identify gaps in the reporting of patient metadata, and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, including differences in reporting based on study types or geographic regions. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases.
OSF Registries osf.io/wrh95; https://doi.org/10.17605/OSF.IO/WRH95.
INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/58567.
对严重急性呼吸综合征冠状病毒2(SARS-CoV-2)病毒进行测序并研究其分子进化的工作力度空前。可公开访问的数据库,如全球共享所有流感数据倡议组织(GISAID)和基因库,为这项工作提供了便利,这些数据库总共保存了数百万条SARS-CoV-2序列记录。然而,基因组流行病学旨在通过将遗传信息与患者特征和疾病结局联系起来,超越系统发育学(对生物实体之间进化关系的研究)分析,从而全面了解传播动态和疾病影响。虽然这些数据库包含反映给定序列患者相关元数据的数据字段,但人口统计学和临床细节的纳入却很少。目前对已发表测序研究中患者相关元数据及其质量的了解仍未得到探索。
我们的综述旨在定量评估报告SARS-CoV-2病毒原始全基因组测序的论文中患者报告元数据的范围和质量,并使用文献计量分析来分析发表模式。最后,我们将评估机器学习分类器在准确识别纳入范围综述的相关论文方面的有效性和可靠性。
美国国立卫生研究院的LitCovid数据集将用于自动分类报告已在公共数据库中存入SARS-CoV-2序列的论文,同时将在医学文献数据库(MEDLINE)和美国国立医学图书馆的生物医学期刊文献数据库(PubMed Central)中进行独立检索以进行验证。将使用Covidence(Veritas Health Innovation Ltd)进行数据提取。提取的数据将进行综合和总结,以量化SARS-CoV-2测序研究已发表文献中患者元数据的可获得性。对于文献计量分析,将提取相关数据点,如作者单位、引用指标、作者关键词和医学主题词。
本研究预计于2025年初完成。我们已经开发了分类模型,并对截至2023年2月在LitCovid上发表的文献进行了分类。截至2024年9月,正在准备处理2024年8月之前的论文。正在对分类器验证后的论文进行筛选。直接文献检索和结果筛选于2024年10月开始。我们将在适用的情况下使用表格、图表和图形总结并叙述性描述我们的研究结果。
本范围综述将报告SARS-CoV-2基因组病毒测序研究中报告的患者相关元数据的范围和类型的研究结果,识别患者元数据报告中的差距,并就提高该领域报告的质量和一致性提出建议。文献计量分析将揭示患者相关元数据报告中的趋势和模式,包括基于研究类型或地理区域的报告差异。本研究获得的见解可能有助于提高报告患者元数据的质量和一致性,增强序列元数据的实用性,并促进未来传染病研究。
开放科学框架(OSF)注册库osf.io/wrh95;https://doi.org/10.17605/OSF.IO/WRH95。
国际注册报告标识符(IRRID):DERR1-10.2196/58567。