Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Rice University, Houston, Texas, 77005, USA.
Department of Computer Science, Rice University, Houston, Texas, 77005, United States.
Nat Commun. 2021 Feb 26;12(1):1167. doi: 10.1038/s41467-021-21180-w.
With advances in synthetic biology and genome engineering comes a heightened awareness of potential misuse related to biosafety concerns. A recent study employed machine learning to identify the lab-of-origin of DNA sequences to help mitigate some of these concerns. Despite their promising results, this deep learning based approach had limited accuracy, was computationally expensive to train, and wasn't able to provide the precise features that were used in its predictions. To address these shortcomings, we developed PlasmidHawk for lab-of-origin prediction. Compared to a machine learning approach, PlasmidHawk has higher prediction accuracy; PlasmidHawk can successfully predict unknown sequences' depositing labs 76% of the time and 85% of the time the correct lab is in the top 10 candidates. In addition, PlasmidHawk can precisely single out the signature sub-sequences that are responsible for the lab-of-origin detection. In summary, PlasmidHawk represents an explainable and accurate tool for lab-of-origin prediction of synthetic plasmid sequences. PlasmidHawk is available at https://gitlab.com/treangenlab/plasmidhawk.git .
随着合成生物学和基因组工程的进步,人们越来越意识到与生物安全问题相关的潜在滥用问题。最近的一项研究使用机器学习来识别 DNA 序列的实验室来源,以帮助减轻其中的一些担忧。尽管这种基于深度学习的方法取得了有希望的结果,但它的准确性有限,训练计算成本高,并且无法提供其预测中使用的精确特征。为了解决这些缺点,我们开发了 PlasmidHawk 用于实验室来源预测。与机器学习方法相比,PlasmidHawk 具有更高的预测准确性;PlasmidHawk 可以成功预测未知序列的存放实验室,76%的时间正确的实验室在前 10 名候选实验室中,85%的时间正确的实验室在前 10 名候选实验室中。此外,PlasmidHawk 可以精确地找出负责实验室来源检测的特征子序列。总之,PlasmidHawk 代表了一种可解释且准确的用于合成质粒序列实验室来源预测的工具。PlasmidHawk 可在 https://gitlab.com/treangenlab/plasmidhawk.git 上获得。