Suppr超能文献

公共 DNA 数据库中物种划分的协议,应用于昆虫纲。

A protocol for species delineation of public DNA databases, applied to the Insecta.

机构信息

Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, PR China.

Key Laboratory of Zoological Systematics and Evolution (CAS), Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, PR China

出版信息

Syst Biol. 2014 Sep;63(5):712-25. doi: 10.1093/sysbio/syu038. Epub 2014 Jun 14.

Abstract

Public DNA databases are composed of data from many different taxa, although the taxonomic annotation on sequences is not always complete, which impedes the utilization of mined data for species-level applications. There is much ongoing work on species identification and delineation based on the molecular data itself, although applying species clustering to whole databases requires consolidation of results from numerous undefined gene regions, and introduces significant obstacles in data organization and computational load. In the current paper, we demonstrate an approach for species delineation of a sequence database. All DNA sequences for the insects were obtained and processed. After filtration of duplicated data, delineation of the database into species or molecular operational taxonomic units (MOTUs) followed a three-step process in which (i) the genetic loci L are partitioned, (ii) the species S are delineated within each locus, then (iii) species units are matched across loci to form the matrix L × S, a set of global (multilocus) species units. Partitioning the database into a set of homologous gene fragments was achieved by Markov clustering using edge weights calculated from the amount of overlap between pairs of sequences, then delineation of species units and assignment of species names were performed for the set of genes necessary to capture most of the species diversity. The complexity of computing pairwise similarities for species clustering was substantial at the cytochrome oxidase subunit I locus in particular, but made feasible through the development of software that performs pairwise alignments within the taxonomic framework, while accounting for the different ranks at which sequences are labeled with taxonomic information. Over 24 different homologs, the unidentified sequences numbered approximately 194,000, containing 41,525 species IDs (98.7% of all found in the insect database), and were grouped into 59,173 single-locus MOTUs by hierarchical clustering under parameters optimized independently for each locus. Species units from different loci were matched using a multipartite matching algorithm to form multilocus species units with minimal incongruence between loci. After matching, the insect database as represented by these 24 loci was found to be composed of 78,091 species units in total. 38,574 of these units contained only species labeled data, 34,891 contained only unlabeled data, leaving 4,626 units composed both of labeled and unlabeled sequences. In addition to giving estimates of species diversity of sequence repositories, the protocol developed here will facilitate species-level applications of modern-day sequence data sets. In particular, the L × S matrix represents a post-taxonomic framework that can be used for species-level organization of metagenomic data, and incorporation of these methods into phylogenetic pipelines will yield matrices more representative of species diversity.

摘要

公共 DNA 数据库由来自许多不同分类群的数据组成,尽管序列的分类注释并不总是完整的,但这阻碍了挖掘数据在物种水平应用的使用。目前正在进行基于分子数据本身进行物种识别和划分的大量工作,尽管将物种聚类应用于整个数据库需要整合来自许多未定义基因区域的结果,并在数据组织和计算负载方面引入重大障碍。在当前的论文中,我们展示了一种用于序列数据库物种划分的方法。获取并处理了昆虫的所有 DNA 序列。在过滤重复数据后,数据库被划分为物种或分子操作分类单元(MOTU),遵循三个步骤,其中(i) 遗传基因座 L 被分区,(ii) 在每个基因座内划分物种 S,然后 (iii) 在基因座之间匹配物种单元,形成矩阵 L×S,一组全局(多基因座)物种单元。通过使用从序列对之间重叠量计算得出的边缘权重的马尔可夫聚类将数据库划分为一组同源基因片段,然后为捕获大多数物种多样性所需的一组基因执行物种单元的划分和物种名称的分配。在特定的细胞色素氧化酶亚基 I 基因座中,计算物种聚类的成对相似性的复杂性很高,但通过开发在分类框架内执行成对比对的软件,同时考虑到序列标记有分类信息的不同级别,使其变得可行。在 24 个不同的同源物中,未识别的序列数量约为 194000 个,包含 41525 个物种 ID(昆虫数据库中发现的所有 ID 的 98.7%),并通过在每个基因座独立优化的参数下进行层次聚类,将其分为 59173 个单基因座 MOTU。使用多部分匹配算法匹配来自不同基因座的物种单元,以形成基因座之间最小不一致的多基因座物种单元。匹配后,发现由这 24 个基因座表示的昆虫数据库总共由 78091 个物种单元组成。其中 38574 个单元仅包含标记数据的物种,34891 个单元仅包含未标记数据,留下 4626 个单元由标记和未标记序列组成。除了给出序列存储库的物种多样性估计外,这里开发的协议还将促进现代序列数据集的物种水平应用。特别是,L×S 矩阵代表了一种后分类学框架,可用于元基因组数据的物种水平组织,并且将这些方法纳入系统发育管道将产生更能代表物种多样性的矩阵。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验