Fondrat C, Dessen P
CIT12 (Centre Interuniversitaire de Traitement de l'Information), Universite Paris, France.
Comput Appl Biosci. 1995 Jun;11(3):273-9. doi: 10.1093/bioinformatics/11.3.273.
We present here a codification structure, entirely interfaced with the main packages for biomolecule database management, associated with a new search algorithm to retrieve quickly a sequence in a database. This system is derived from a method previously proposed for homology search in databanks with a preprocessed codification of an entire database in which all the overlapping subsequences of a specific length in a sequence were converted into a code and stored in a hash-coding file. This new algorithm is designed for an improved use of the codification. It is based on the recognition of the rarest strings which characterize the query sequence and the intersection of sorted lists read in the codification structure. The system is applicable to both nucleic acid and protein sequences and is used to find patterns in databanks or large sets of sequences. A few examples of applications are given. In addition, the comparison of our method with existing ones shows that this new approach speeds up the search for query patterns in large data sets.
我们在此展示一种编码结构,它与生物分子数据库管理的主要程序包完全对接,并关联一种新的搜索算法,以便在数据库中快速检索序列。该系统源自先前提出的一种用于数据库同源性搜索的方法,此方法对整个数据库进行预处理编码,即将序列中特定长度的所有重叠子序列转换为代码并存储在哈希编码文件中。这种新算法旨在更有效地利用编码。它基于对表征查询序列的最稀有字符串的识别以及在编码结构中读取的排序列表的交集。该系统适用于核酸和蛋白质序列,用于在数据库或大量序列集中查找模式。文中给出了一些应用示例。此外,将我们的方法与现有方法进行比较表明,这种新方法加快了在大数据集中搜索查询模式的速度。