Asma Hasiba, Liu Luna, Halfon Marc S
Departments of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY, United States of America.
Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY, United States of America.
PLoS One. 2024 Dec 5;19(12):e0311752. doi: 10.1371/journal.pone.0311752. eCollection 2024.
As the number of sequenced insect genomes continues to grow, there is a pressing need for rapid and accurate annotation of their regulatory component. SCRMshaw is a computational tool designed to predict cis-regulatory modules ("enhancers") in the genomes of various insect species. A key advantage of SCRMshaw is its accessibility. It requires minimal resources-just a genome sequence and training data from known Drosophila regulatory sequences, which are readily available for download. Even users with modest computational skills can run SCRMshaw on a desktop computer for basic applications, although a high-performance computing cluster is recommended for optimal results. SCRMshaw can be tailored to specific needs: users can employ a single set of training data to predict enhancers associated with a particular gene expression pattern, or utilize multiple sets to provide a first-pass regulatory annotation for a newly-sequenced genome. This protocol provides an extensive update to the previously published SCRMshaw protocol and aligns with the methods used in a recent annotation of over 30 insect regulatory genomes. It includes the most recent modifications to the SCRMshaw protocol and details an end-to-end pipeline that begins with a sequenced genome and ends with a fully-annotated regulatory genome. Relevant scripts are available via GitHub, and a living protocol that will be updated as necessary is linked to this article at protocols.io.
随着已测序昆虫基因组数量的不断增加,迫切需要对其调控元件进行快速准确的注释。SCRMshaw是一种计算工具,旨在预测各种昆虫物种基因组中的顺式调控模块(“增强子”)。SCRMshaw的一个关键优势在于其易用性。它所需资源极少——只需要一个基因组序列和来自已知果蝇调控序列的训练数据,这些数据很容易下载获得。即使是计算技能一般的用户也可以在台式计算机上运行SCRMshaw进行基本应用,不过为了获得最佳结果,建议使用高性能计算集群。SCRMshaw可以根据特定需求进行定制:用户可以使用一组训练数据来预测与特定基因表达模式相关的增强子,或者使用多组数据为新测序的基因组提供初步的调控注释。本方案对先前发布的SCRMshaw方案进行了大量更新,并与最近对30多个昆虫调控基因组进行注释时所使用的方法保持一致。它包括对SCRMshaw方案的最新修改,并详细介绍了一个端到端的流程,该流程从测序的基因组开始,以完全注释的调控基因组结束。相关脚本可通过GitHub获取,并且一个会根据需要进行更新的实用方案在protocols.io上与本文链接。