WGSSAT：一种从全基因组中挖掘和注释 SSR 标记的高通量计算流程。

WGSSAT: A High-Throughput Computational Pipeline for Mining and Annotation of SSR Markers From Whole Genomes.

机构信息

Division of Molecular Biology and Biotechnology, ICAR-National Bureau of Fish Genetic Resources, Lucknow, India.

AMITY Institute of Biotechnology, AMITY University, Uttar Pradesh, Lucknow Campus, Lucknow, India.

出版信息

J Hered. 2018 Mar 16;109(3):339-343. doi: 10.1093/jhered/esx075.

DOI:10.1093/jhered/esx075

PMID:28992259

Abstract

Mining and characterization of Simple Sequence Repeat (SSR) markers from whole genomes provide valuable information about biological significance of SSR distribution and also facilitate development of markers for genetic analysis. Whole genome sequencing (WGS)-SSR Annotation Tool (WGSSAT) is a graphical user interface pipeline developed using Java Netbeans and Perl scripts which facilitates in simplifying the process of SSR mining and characterization. WGSSAT takes input in FASTA format and automates the prediction of genes, noncoding RNA (ncRNA), core genes, repeats and SSRs from whole genomes followed by mapping of the predicted SSRs onto a genome (classified according to genes, ncRNA, repeats, exonic, intronic, and core gene region) along with primer identification and mining of cross-species markers. The program also generates a detailed statistical report along with visualization of mapped SSRs, genes, core genes, and RNAs. The features of WGSSAT were demonstrated using Takifugu rubripes data. This yielded a total of 139 057 SSR, out of which 113 703 SSR primer pairs were uniquely amplified in silico onto a T. rubripes (fugu) genome. Out of 113 703 mined SSRs, 81 463 were from coding region (including 4286 exonic and 77 177 intronic), 7 from RNA, 267 from core genes of fugu, whereas 105 641 SSR and 601 SSR primer pairs were uniquely mapped onto the medaka genome. WGSSAT is tested under Ubuntu Linux. The source code, documentation, user manual, example dataset and scripts are available online at https://sourceforge.net/projects/wgssat-nbfgr.

摘要

从全基因组中挖掘和表征简单重复序列（SSR）标记可为 SSR 分布的生物学意义提供有价值的信息，并且还为遗传分析的标记开发提供了便利。全基因组测序（WGS）-SSR 注释工具（WGSSAT）是一个使用 Java Netbeans 和 Perl 脚本开发的图形用户界面管道，它简化了 SSR 挖掘和表征的过程。WGSSAT 以 FASTA 格式输入，自动预测基因、非编码 RNA（ncRNA）、核心基因、重复和整个基因组中的 SSR，然后将预测的 SSR 映射到基因组上（根据基因、ncRNA、重复、外显子、内含子和核心基因区域进行分类），同时识别和挖掘跨物种标记的引物。该程序还生成详细的统计报告，以及映射 SSR、基因、核心基因和 RNA 的可视化。使用 Takifugu rubripes 数据演示了 WGSSAT 的功能。这总共产生了 139057 个 SSR，其中 113703 个 SSR 引物对在 Takifugu rubripes（河豚）基因组上以数字方式唯一扩增。在挖掘的 113703 个 SSR 中，81463 个来自编码区（包括 4286 个外显子和 77177 个内含子），7 个来自 RNA，267 个来自河豚的核心基因，而 105641 个 SSR 和 601 个 SSR 引物对在 Medaka 基因组上唯一映射。WGSSAT 在 Ubuntu Linux 下进行测试。源代码、文档、用户手册、示例数据集和脚本可在 https://sourceforge.net/projects/wgssat-nbfgr 上在线获取。