通过深度度量学习改进基因工程质粒的来源实验室预测

Improving lab-of-origin prediction of genetically engineered plasmids via deep metric learning.

作者信息

Soares Igor M, Camargo Fernando H F, Marques Adriano, Crook Oliver M

机构信息

Amalgam, Chicago, IL, USA.

Oxford Protein Informatics Group, University of Oxford, Oxford, UK.

出版信息

Nat Comput Sci. 2022 Apr;2(4):253-264. doi: 10.1038/s43588-022-00234-z. Epub 2022 Apr 28.

DOI:10.1038/s43588-022-00234-z

PMID:38177551

Abstract

Genome engineering is undergoing unprecedented development and is now becoming widely available. Genetic engineering attribution can make sequence-lab associations and assist forensic experts in ensuring responsible biotechnology innovation and reducing misuse of engineered DNA sequences. Here we propose a method based on metric learning to rank the most likely labs of origin while simultaneously generating embeddings for plasmid sequences and labs. These embeddings can be used to perform various downstream tasks, such as clustering DNA sequences and labs, as well as using them as features in machine learning models. Our approach employs a circular shift augmentation method and can correctly rank the lab of origin 90% of the time within its top-10 predictions. We also demonstrate that we can perform few-shot learning and obtain 76% top-10 accuracy using only 10% of the sequences. Finally, our approach can also extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.

摘要

基因组工程正在经历前所未有的发展，如今正变得广泛可用。基因工程溯源能够建立序列与实验室之间的关联，并协助法医专家确保生物技术的创新是负责任的，同时减少对工程化DNA序列的滥用。在此，我们提出一种基于度量学习的方法，对最有可能的来源实验室进行排名，同时为质粒序列和实验室生成嵌入表示。这些嵌入表示可用于执行各种下游任务，例如对DNA序列和实验室进行聚类，以及在机器学习模型中将它们用作特征。我们的方法采用循环移位增强方法，在其前10个预测中，有90%的时间能够正确对来源实验室进行排名。我们还证明，我们可以进行少样本学习，仅使用10%的序列就能获得76%的前10准确率。最后，我们的方法还可以提取特定实验室质粒序列中的关键特征，从而对模型输出进行可解释的检验。