• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

BigFiRSt:一种使用大数据技术从大规模测序数据中挖掘简单序列重复序列的软件程序。

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

作者信息

Chen Jinxiang, Li Fuyi, Wang Miao, Li Junlong, Marquez-Lago Tatiana T, Leier André, Revote Jerico, Li Shuqin, Liu Quanzhong, Song Jiangning

机构信息

Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China.

Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia.

出版信息

Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.

DOI:10.3389/fdata.2021.727216
PMID:35118375
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8805145/
Abstract

BACKGROUND

Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.

RESULTS

In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data.

CONCLUSIONS

The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.

摘要

背景

简单序列重复(SSR)是核苷酸序列的短串联重复。研究表明,SSR与人类疾病相关,具有医学相关性。因此,人们提出了多种计算方法从基因组中挖掘SSR。传统方法依赖高质量的完整基因组来识别SSR。然而,测序基因组往往会遗漏一些高度重复区域。此外,许多非模式物种没有完整的基因组。随着下一代测序(NGS)技术的最新进展,可以使用NGS快速生成任何物种的大规模序列读数。在这种情况下,人们提出了一些方法来在大量非模式物种的读数中识别数千个SSR位点。虽然市场上最常用的NGS平台(如Illumina平台)通常提供短的双端读数,但在识别SSR位点之前,合并重叠的双端读数已成为一种常见方法。这给传统的单机工具带来了大数据分析挑战,使其难以合并短读对并从大规模数据中识别SSR。

结果

在本研究中,我们提出了一种基于Hadoop的新软件程序,称为BigFiRSt,以利用前沿大数据技术解决这一问题。BigFiRSt由两个主要模块BigFLASH和BigPERF组成,分别基于两个最先进的单机工具FLASH和PERF实现。BigFLASH和BigPERF分别以大数据方式解决合并短读对和挖掘SSR的问题。综合基准实验表明,BigFiRSt可以显著减少从超大规模DNA序列数据中快速合并读对和挖掘SSR的执行时间。

结论

BigFiRSt的卓越性能主要得益于大数据Hadoop技术,能够在集群上并行和分布式计算中合并读对并挖掘SSR。我们预计BigFiRSt将成为即将到来的生物大数据时代的一个有价值的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/6ad5a8a3b83e/fdata-04-727216-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/064b129f8b02/fdata-04-727216-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/3c0140fbdd1e/fdata-04-727216-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/5cba8bd7ab2f/fdata-04-727216-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/59527db67fdd/fdata-04-727216-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/6ad5a8a3b83e/fdata-04-727216-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/064b129f8b02/fdata-04-727216-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/3c0140fbdd1e/fdata-04-727216-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/5cba8bd7ab2f/fdata-04-727216-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/59527db67fdd/fdata-04-727216-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2973/8805145/6ad5a8a3b83e/fdata-04-727216-g0005.jpg

相似文献

1
BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.BigFiRSt:一种使用大数据技术从大规模测序数据中挖掘简单序列重复序列的软件程序。
Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.
2
PSR: polymorphic SSR retrieval.PSR:多态性简单序列重复检索
BMC Res Notes. 2015 Oct 1;8:525. doi: 10.1186/s13104-015-1474-4.
3
SATIN: a micro and mini satellite mining tool of total genome and coding regions with analysis of perfect repeats polymorphism in coding regions.SATIN:一种微小型卫星全基因组和编码区挖掘工具,可分析编码区完全重复多态性。
BMC Bioinformatics. 2024 Jun 18;25(1):217. doi: 10.1186/s12859-024-05842-2.
4
Large-scale identification of polymorphic microsatellites using an in silico approach.利用计算机模拟方法大规模鉴定多态性微卫星。
BMC Bioinformatics. 2008 Sep 15;9:374. doi: 10.1186/1471-2105-9-374.
5
A genome-wide analysis of simple sequence repeats in maize and the development of polymorphism markers from next-generation sequence data.玉米简单序列重复的全基因组分析及基于下一代序列数据的多态性标记开发。
BMC Res Notes. 2013 Oct 7;6:403. doi: 10.1186/1756-0500-6-403.
6
A second generation framework for the analysis of microsatellites in expressed sequence tags and the development of EST-SSR markers for a conifer, Cryptomeria japonica.用于分析表达序列标签中微卫星的第二代框架,以及为针叶树日本柳杉开发 EST-SSR 标记。
BMC Genomics. 2012 Apr 16;13:136. doi: 10.1186/1471-2164-13-136.
7
Characterization of simple sequence repeats (SSRs) from Phlebotomus papatasi (Diptera: Psychodidae) expressed sequence tags (ESTs).从白蛉(双翅目:长角亚目)表达序列标签(ESTs)中鉴定简单序列重复(SSR)。
Parasit Vectors. 2011 Sep 29;4:189. doi: 10.1186/1756-3305-4-189.
8
Benefits of merging paired-end reads before pre-processing environmental metagenomics data.在预处理环境宏基因组数据之前合并配对末端reads 的好处。
Mar Genomics. 2022 Feb;61:100914. doi: 10.1016/j.margen.2021.100914. Epub 2021 Dec 2.
9
SSRMMD: A Rapid and Accurate Algorithm for Mining SSR Feature Loci and Candidate Polymorphic SSRs Based on Assembled Sequences.SSRMMD:一种基于组装序列挖掘SSR特征位点和候选多态性SSR的快速准确算法。
Front Genet. 2020 Jul 27;11:706. doi: 10.3389/fgene.2020.00706. eCollection 2020.
10
Mining microsatellite markers from public expressed sequence tags databases for the study of threatened plants.从公共表达序列标签数据库中挖掘微卫星标记用于濒危植物研究。
BMC Genomics. 2015 Oct 13;16:781. doi: 10.1186/s12864-015-2031-1.

引用本文的文献

1
Development of strain specific simple sequence repeats and assessment of genetic diversity in Erwinia amylovora from marker selection to phylogenetic analysis.针对梨火疫病菌的菌株特异性简单序列重复序列的开发及遗传多样性评估:从标记选择到系统发育分析
Sci Rep. 2025 Aug 19;15(1):30357. doi: 10.1038/s41598-025-15530-7.

本文引用的文献

1
BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models.BioSeq-BLM:一个基于生物语言模型分析 DNA、RNA 和蛋白质序列的平台。
Nucleic Acids Res. 2021 Dec 16;49(22):e129. doi: 10.1093/nar/gkab829.
2
iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization.iLearnPlus:一个全面的、自动化的机器学习平台,用于核酸和蛋白质序列分析、预测和可视化。
Nucleic Acids Res. 2021 Jun 4;49(10):e60. doi: 10.1093/nar/gkab122.
3
BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.
BioSeq-Analysis2.0:一个基于机器学习方法的更新平台,用于在序列水平和残基水平上分析 DNA、RNA 和蛋白质序列。
Nucleic Acids Res. 2019 Nov 18;47(20):e127. doi: 10.1093/nar/gkz740.
4
IDSSR: An Efficient Pipeline for Identifying Polymorphic Microsatellites from a Single Genome Sequence.IDSSR:一种从单个基因组序列中识别多态微卫星的高效流水线。
Int J Mol Sci. 2019 Jul 16;20(14):3497. doi: 10.3390/ijms20143497.
5
Look4TRs: a de novo tool for detecting simple tandem repeats using self-supervised hidden Markov models.Look4TRs:一种使用自监督隐马尔可夫模型检测简单串联重复序列的新工具。
Bioinformatics. 2020 Jan 15;36(2):380-387. doi: 10.1093/bioinformatics/btz551.
6
iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data.iLearn:一个集成平台和元学习者,用于 DNA、RNA 和蛋白质序列数据的特征工程、机器学习分析和建模。
Brief Bioinform. 2020 May 21;21(3):1047-1057. doi: 10.1093/bib/bbz041.
7
Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics.分析基因组序列的大数据集:快速可扩展的 k-mer 统计信息收集。
BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):138. doi: 10.1186/s12859-019-2694-8.
8
A new statistic for efficient detection of repetitive sequences.一种用于高效检测重复序列的新统计方法。
Bioinformatics. 2019 Nov 1;35(22):4596-4606. doi: 10.1093/bioinformatics/btz262.
9
Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.串联基因型:从长 DNA 读取中稳健检测串联重复扩展。
Genome Biol. 2019 Mar 19;20(1):58. doi: 10.1186/s13059-019-1667-6.
10
SpaRC: scalable sequence clustering using Apache Spark.SpaRC:使用 Apache Spark 进行可扩展的序列聚类。
Bioinformatics. 2019 Mar 1;35(5):760-768. doi: 10.1093/bioinformatics/bty733.