• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用快速哈希函数和跨物种信息检索进行精细的重复序列搜索。

Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals.

作者信息

Reneker Jeff, Shyu Chi-Ren

机构信息

Department of Computer Science, University of Missouri, Columbia, USA.

出版信息

BMC Bioinformatics. 2005 May 3;6:111. doi: 10.1186/1471-2105-6-111.

DOI:10.1186/1471-2105-6-111
PMID:15869708
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1131890/
Abstract

BACKGROUND

Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches -- most of which are irrelevant to the researcher.

RESULTS

We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes -- without the main memory constraints present in other systems.

CONCLUSION

We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47de/1131890/33d4508bb82f/1471-2105-6-111-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47de/1131890/b3543df95d8f/1471-2105-6-111-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47de/1131890/33d4508bb82f/1471-2105-6-111-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47de/1131890/b3543df95d8f/1471-2105-6-111-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47de/1131890/33d4508bb82f/1471-2105-6-111-2.jpg
摘要

背景

搜索小串联/分散重复DNA序列可简化许多生物医学研究过程。例如,酵母中的全基因组阵列分析揭示了22个受PHO调控的基因。除其中一个基因外,其他所有基因的启动子区域都至少包含两个核心Pho4p结合位点(CACGTG和CACGTT)中的一个。在人类中,微卫星在一些罕见的神经退行性疾病中起作用,如1型脊髓小脑共济失调(SCA1)。SCA1是一种遗传性神经退行性疾病,由该基因编码序列中CAG重复序列的扩增引起。在细菌病原体中,微卫星被认为可调节某些毒力因子的表达。例如,细菌通常通过相变产生菌株内多样性,这与毒力决定因素密切相关。最近对幽门螺杆菌菌株26695和J99的完整序列分析通过与同聚物序列和二核苷酸重复序列的关联,在这两个基因组中鉴定出46个推定的相变可变基因。生命科学家对研究小DNA序列的功能越来越感兴趣。然而,当前的搜索算法通常会产生数千个匹配项——其中大多数与研究人员无关。

结果

我们展示了用于在多个基因组中定位小DNA序列的哈希函数和搜索算法。我们的系统应用信息检索算法来发现重复序列的跨物种保守性知识。我们讨论了将基因本体论(GO)数据库纳入这些算法的情况。我们对系统针对各种重复序列长度进行了详尽的时间分析。例如,在49条不同染色体上的3.224千兆碱基中搜索8个碱基的序列平均需要1.147秒。为了说明搜索结果的相关性,我们对酵母Pho4p结合位点CACGTG和CACGTT进行了添加和不添加注释项的搜索。此外,还进行了跨物种搜索,以说明如何快速识别基因组数据中潜在的隐藏相关性。一个物种中的发现可作为在另一个物种中发现新事物的催化剂。这些实验还表明,我们的系统在搜索多个基因组时表现良好——不存在其他系统中存在的主内存限制。

结论

我们提出了一种高效的算法来定位小DNA片段,并同时搜索序列附带的注释数据。全基因组范围内对短序列的搜索通常会返回数百个命中结果。我们的实验表明,随后搜索注释数据可以为用户细化和聚焦结果。我们的算法在主内存需求方面也具有空间效率。可根据要求提供源代码。

相似文献

1
Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals.利用快速哈希函数和跨物种信息检索进行精细的重复序列搜索。
BMC Bioinformatics. 2005 May 3;6:111. doi: 10.1186/1471-2105-6-111.
2
ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval.ACMES:用于短重复序列的快速多基因组搜索及并发跨物种信息检索
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W649-53. doi: 10.1093/nar/gkh455.
3
GATA: a graphic alignment tool for comparative sequence analysis.GATA:一种用于比较序列分析的图形比对工具。
BMC Bioinformatics. 2005 Jan 17;6:9. doi: 10.1186/1471-2105-6-9.
4
Database of repetitive elements in complete genomes and data mining using transcription factor binding sites.完整基因组中的重复元件数据库以及利用转录因子结合位点进行数据挖掘
IEEE Trans Inf Technol Biomed. 2003 Jun;7(2):93-100. doi: 10.1109/titb.2003.811878.
5
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.SS-Wrapper:用于在Linux集群上进行相似性搜索的一组包装应用程序。
BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.
6
MILANO--custom annotation of microarray results using automatic literature searches.米兰——使用自动文献检索对微阵列结果进行定制注释。
BMC Bioinformatics. 2005 Jan 20;6:12. doi: 10.1186/1471-2105-6-12.
7
Simple sequence repeats in the Helicobacter pylori genome.幽门螺杆菌基因组中的简单序列重复
Mol Microbiol. 1998 Mar;27(6):1091-8. doi: 10.1046/j.1365-2958.1998.00768.x.
8
SeqHelp: a program to analyze molecular sequences utilizing common computational resources.SeqHelp:一个利用普通计算资源分析分子序列的程序。
Genome Res. 1998 Mar;8(3):306-12. doi: 10.1101/gr.8.3.306.
9
MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes.MAPPER:一种用于在多个基因组中通过计算识别潜在转录因子结合位点的搜索引擎。
BMC Bioinformatics. 2005 Mar 30;6:79. doi: 10.1186/1471-2105-6-79.
10
Recent Hits Acquired by BLAST (ReHAB): a tool to identify new hits in sequence similarity searches.通过BLAST获取的近期命中结果(ReHAB):一种在序列相似性搜索中识别新命中结果的工具。
BMC Bioinformatics. 2005 Feb 8;6:23. doi: 10.1186/1471-2105-6-23.

引用本文的文献

1
A method for identification of highly conserved elements and evolutionary analysis of superphylum Alveolata.一种用于鉴定高度保守元件及对囊泡虫超门进行进化分析的方法。
BMC Bioinformatics. 2016 Sep 20;17:385. doi: 10.1186/s12859-016-1257-5.
2
Towards computational improvement of DNA database indexing and short DNA query searching.迈向DNA数据库索引和短DNA查询搜索的计算改进
Biotechnol Biotechnol Equip. 2014 Sep 3;28(5):958-967. doi: 10.1080/13102818.2014.959711. Epub 2014 Oct 31.
3
ProGeRF: proteome and genome repeat finder utilizing a fast parallel hash function.

本文引用的文献

1
ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval.ACMES:用于短重复序列的快速多基因组搜索及并发跨物种信息检索
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W649-53. doi: 10.1093/nar/gkh455.
2
The diversity within an expanded and redefined repertoire of phase-variable genes in Helicobacter pylori.幽门螺杆菌中经扩展和重新定义的相变可变基因库内的多样性。
Microbiology (Reading). 2004 Apr;150(Pt 4):817-830. doi: 10.1099/mic.0.26993-0.
3
mreps: Efficient and flexible detection of tandem repeats in DNA.
ProGeRF:利用快速并行哈希函数的蛋白质组和基因组重复序列查找器。
Biomed Res Int. 2015;2015:394157. doi: 10.1155/2015/394157. Epub 2015 Feb 25.
4
Long identical multispecies elements in plant and animal genomes.动植物基因组中的长相同多物种元件。
Proc Natl Acad Sci U S A. 2012 May 8;109(19):E1183-91. doi: 10.1073/pnas.1121356109. Epub 2012 Apr 10.
5
Development of a database of health insurance claims: standardization of disease classifications and anonymous record linkage.开发医疗保险索赔数据库:疾病分类标准化和匿名记录链接。
J Epidemiol. 2010;20(5):413-9. doi: 10.2188/jea.je20090066. Epub 2010 Aug 7.
6
MICA: desktop software for comprehensive searching of DNA databases.MICA:用于全面搜索DNA数据库的桌面软件。
BMC Bioinformatics. 2006 Oct 3;7:427. doi: 10.1186/1471-2105-7-427.
Mreps:高效灵活地检测DNA中的串联重复序列。
Nucleic Acids Res. 2003 Jul 1;31(13):3672-8. doi: 10.1093/nar/gkg617.
4
Beyond tandem repeats: complex pattern structures and distant regions of similarity.超越串联重复序列:复杂模式结构与远距离相似区域。
Bioinformatics. 2002;18 Suppl 1:S31-7. doi: 10.1093/bioinformatics/18.suppl_1.s31.
5
TROLL--tandem repeat occurrence locator.TROLL——串联重复序列出现定位器。
Bioinformatics. 2002 Apr;18(4):634-6. doi: 10.1093/bioinformatics/18.4.634.
6
BLAT--the BLAST-like alignment tool.BLAT——类BLAST比对工具。
Genome Res. 2002 Apr;12(4):656-64. doi: 10.1101/gr.229202.
7
SSAHA: a fast search method for large DNA databases.SSAHA:一种用于大型DNA数据库的快速搜索方法。
Genome Res. 2001 Oct;11(10):1725-9. doi: 10.1101/gr.194201.
8
An efficient algorithm for finding short approximate non-tandem repeats.一种用于寻找短近似非串联重复序列的高效算法。
Bioinformatics. 2001;17 Suppl 1:S5-S12. doi: 10.1093/bioinformatics/17.suppl_1.s5.
9
An algorithm for approximate tandem repeats.一种用于近似串联重复序列的算法。
J Comput Biol. 2001;8(1):1-18. doi: 10.1089/106652701300099038.
10
New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis.通过基因组表达分析揭示的酿酒酵母中磷酸盐积累和多聚磷酸盐代谢系统的新组分
Mol Biol Cell. 2000 Dec;11(12):4309-21. doi: 10.1091/mbc.11.12.4309.