Department of Computer Science, Rice University, Houston, TX, USA.
Signature Science, LLC, 8329 North Mopac Expressway, Austin, TX, USA.
Genome Biol. 2022 Jun 20;23(1):133. doi: 10.1186/s13059-022-02695-x.
The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .
新冠疫情强调了准确检测已知和新兴病原体的重要性。然而,对病原体序列的全面描述仍然是一个悬而未决的挑战。为了满足这一需求,我们开发了 SeqScreen,它使用分类学和功能标签以及一组针对微生物发病机制的定制化的精选序列关注点功能(FunSoCs)来准确描述短核苷酸序列。我们展示了我们的集成机器学习模型可以使用 FunSoCs 对蛋白质编码序列进行标记,具有较高的召回率和精度。SeqScreen 是朝着功能信息综合 DNA 筛选和病原体特征描述的新范例迈出的一步,可在 www.gitlab.com/treangenlab/seqscreen 上下载。