Suppr超能文献

Sma3s:一种用于大型序列数据集的三步模块化注释器。

Sma3s: a three-step modular annotator for large sequence datasets.

作者信息

Muñoz-Mérida Antonio, Viguera Enrique, Claros M Gonzalo, Trelles Oswaldo, Pérez-Pulido Antonio J

机构信息

Integrated Bioinformatics, National Institute for Bioinformatics, University of Málaga, Campus de Teatinos, Spain.

Cellular Biology, Genetics and Physiology Department, University of Málaga, Campus de Teatinos, Spain.

出版信息

DNA Res. 2014 Aug;21(4):341-53. doi: 10.1093/dnares/dsu001. Epub 2014 Feb 5.

Abstract

Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes.

摘要

自动序列注释是现代“组学”研究的重要组成部分,这些研究旨在从大量序列数据集中提取信息。大多数现有工具利用序列同源性来建立进化关系,并为序列赋予假定的功能。然而,要定义一个既能实现足够的覆盖范围又不牺牲注释质量的相似性阈值可能很困难。定义正确的配置至关重要,对于非专业用户来说可能具有挑战性。因此,开发无需专家知识就能生成高质量注释的强大自动注释技术对研究界将非常有价值。我们展示了Sma3s,这是一种用于自动注释来自任何种类基因文库或基因组的大量生物序列的工具。Sma3s由三个模块组成,这些模块使用以下方式逐步注释查询序列:(i)非常相似的同源物,(ii)直系同源序列,或(iii)在同源序列组中富集的术语。我们使用几组已知序列的随机集对该系统进行了训练,结果表明平均灵敏度和特异性值约为85%。总之,Sma3s是一种用于对各种序列数据集进行高通量注释的通用工具,其准确性优于其他成熟的注释算法,并且它可以丰富现有数据库注释并揭示以前隐藏的特征。重要的是,Sma3s已经被用于两个已发表转录组的功能注释。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/794e/4131829/21da9252190f/dsu00101.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验