用于从短读段进行DNA序列组装的相似性阈值可能会降低不同物种群体历史的可比性。

Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species.

作者信息

Harvey Michael G, Judy Caroline Duffie, Seeholzer Glenn F, Maley James M, Graves Gary R, Brumfield Robb T

机构信息

Museum of Natural Science, Louisiana State University , Baton Rouge, LA , USA ; Department of Biological Sciences, Louisiana State University , Baton Rouge, LA , USA.

Museum of Natural Science, Louisiana State University , Baton Rouge, LA , USA ; Department of Biological Sciences, Louisiana State University , Baton Rouge, LA , USA ; Department of Vertebrate Zoology, MRC-116, National Museum of Natural History, Smithsonian Institution , Washington, D.C. , USA.

出版信息

PeerJ. 2015 Apr 21;3:e895. doi: 10.7717/peerj.895. eCollection 2015.

DOI:10.7717/peerj.895

PMID:25922792

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4411482/

Abstract

Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted impacts of divergence, gene flow and selection across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de novo assembly of short reads into clusters of alleles representing different loci, but the resulting datasets are sensitive to both the similarity threshold used and to the variation naturally present in the organism under study. Thresholds that require high sequence similarity among reads for assembly (stringent thresholds) as well as highly variable species may result in datasets in which divergent alleles are lost or divided into separate loci ('over-splitting'), whereas liberal thresholds increase the risk of paralogous loci being combined into a single locus ('under-splitting'). Comparisons among datasets or species are therefore potentially biased if different similarity thresholds are applied or if the species differ in levels of within-lineage genetic variation. We examine the impact of a range of similarity thresholds on assembly of empirical short read datasets from populations of four different non-model bird lineages (species or species pairs) with different levels of genetic divergence. We find that, in all species, stringent similarity thresholds result in fewer alleles per locus than more liberal thresholds, which appears to be the result of high levels of over-splitting. The frequency of putative under-splitting, conversely, is low at all thresholds. Inferred genetic distances between individuals, gene tree depths, and estimates of the ancestral mutation-scaled effective population size (θ) differ depending upon the similarity threshold applied. Relative differences in inferences across species differ even when the same threshold is applied, but may be dramatically different when datasets assembled under different thresholds are compared. These differences not only complicate comparisons across species, but also preclude the application of standard mutation rates for parameter calibration. We suggest some best practices for assembling short read data to maximize comparability, such as using more liberal thresholds and examining the impact of different thresholds on each dataset.

摘要

比较使用短读长测序生成的数据集之间的推断，可能有助于深入了解分歧、基因流和选择对生物体的协同影响，但由于在数据集组装过程中引入的偏差，这种比较变得复杂。序列相似性阈值允许将短读长从头组装成代表不同位点的等位基因簇，但所得数据集对所使用的相似性阈值以及所研究生物体中自然存在的变异都很敏感。要求读长之间具有高序列相似性才能进行组装的阈值（严格阈值）以及高度可变的物种，可能会导致数据集中不同的等位基因丢失或被分成单独的位点（“过度拆分”），而宽松的阈值则增加了旁系同源位点被合并为单个位点的风险（“拆分不足”）。因此，如果应用不同的相似性阈值，或者物种在谱系内遗传变异水平上存在差异，那么数据集或物种之间的比较可能会存在偏差。我们研究了一系列相似性阈值对来自四个具有不同遗传分歧水平的不同非模式鸟类谱系（物种或物种对）群体的经验性短读长数据集组装的影响。我们发现，在所有物种中，严格的相似性阈值导致每个位点的等位基因数量比更宽松的阈值少，这似乎是过度拆分程度高的结果。相反，在所有阈值下，推测的拆分不足频率都很低。个体之间推断的遗传距离、基因树深度以及祖先突变尺度有效种群大小（θ）的估计值，取决于所应用的相似性阈值。即使应用相同的阈值，不同物种之间推断的相对差异也不同，但当比较在不同阈值下组装的数据集时，差异可能会非常显著。这些差异不仅使跨物种比较变得复杂，还排除了应用标准突变率进行参数校准的可能性。我们建议了一些组装短读长数据以最大化可比性的最佳做法，例如使用更宽松的阈值并检查不同阈值对每个数据集的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e756/4411482/12085237d2ab/peerj-03-895-g001.jpg

相似文献

PeerJ. 2015 Apr 21;3:e895. doi: 10.7717/peerj.895. eCollection 2015.

Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation.比较提高宏转录组功能注释率的组装算法。

Microbiome. 2014 Oct 28;2:39. doi: 10.1186/2049-2618-2-39. eCollection 2014.

An empirical examination of sample size effects on population demographic estimates in birds using single nucleotide polymorphism (SNP) data.利用单核苷酸多态性（SNP）数据对样本量对鸟类种群人口统计学估计的影响进行实证检验。

PeerJ. 2020 Sep 16;8:e9939. doi: 10.7717/peerj.9939. eCollection 2020.

Fragmentation and Coverage Variation in Viral Metagenome Assemblies, and Their Effect in Diversity Calculations.病毒宏基因组组装中的碎片化和覆盖度变化，及其对多样性计算的影响。

Front Bioeng Biotechnol. 2015 Sep 17;3:141. doi: 10.3389/fbioe.2015.00141. eCollection 2015.

Challenges and advances for transcriptome assembly in non-model species.非模式物种转录组组装面临的挑战与进展

PLoS One. 2017 Sep 20;12(9):e0185020. doi: 10.1371/journal.pone.0185020. eCollection 2017.

Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies.视角：系统发育地理学研究中的基因分歧、种群分歧及溯祖时间方差

Evolution. 2000 Dec;54(6):1839-54. doi: 10.1111/j.0014-3820.2000.tb01231.x.

Phylogenomic inferences from reference-mapped and de novo assembled short-read sequence data using RADseq sequencing of California white oaks (Quercus section Quercus).基于 RADseq 测序的加利福尼亚白橡树（栎属栎亚属）参考映射和从头组装短读序列数据的系统发育基因组推断。

Genome. 2017 Sep;60(9):743-755. doi: 10.1139/gen-2016-0202. Epub 2017 Mar 29.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

SMRT sequencing only de novo assembly of the sugar beet (Beta vulgaris) chloroplast genome.甜菜（Beta vulgaris）叶绿体基因组的单分子实时测序从头组装

BMC Bioinformatics. 2015 Sep 16;16(1):295. doi: 10.1186/s12859-015-0726-6.

Defining loci in restriction-based reduced representation genomic data from nonmodel species: sources of bias and diagnostics for optimal clustering.在非模式物种基于限制性的简化基因组数据中定义基因座：偏差来源及优化聚类的诊断方法

Biomed Res Int. 2014;2014:675158. doi: 10.1155/2014/675158. Epub 2014 Jun 25.

引用本文的文献

Speciation in the Peninsular Indian Flying Lizard (Draco dussumieri) Follows Climatic Transition and Not Physical Barriers.印度半岛飞蜥（Draco dussumieri）的物种形成遵循气候转变而非地理屏障。

Mol Ecol. 2025 Jun;34(12):e17800. doi: 10.1111/mec.17800. Epub 2025 May 20.

Widespread Deviant Patterns of Heterozygosity in Whole-Genome Sequencing Due to Autopolyploidy, Repeated Elements, and Duplication.由于同源多倍体、重复元件和重复导致全基因组测序中广泛存在的杂合性偏离模式。

Genome Biol Evol. 2023 Dec 1;15(12). doi: 10.1093/gbe/evad229.

A New Assessment of Robust Capuchin Monkey () Evolutionary History Using Genome-Wide SNP Marker Data and a Bayesian Approach to Species Delimitation.利用全基因组 SNP 标记数据和贝叶斯物种界定方法对强壮卷尾猴（）进化史的新评估。

Genes (Basel). 2023 Apr 25;14(5):970. doi: 10.3390/genes14050970.

2b or not 2b? 2bRAD is an effective alternative to ddRAD for phylogenomics.是2b还是非2b？对于系统发育基因组学而言，2bRAD是双酶切RAD（ddRAD）的一种有效替代方法。

Ecol Evol. 2023 Mar 8;13(3):e9842. doi: 10.1002/ece3.9842. eCollection 2023 Mar.

Long-read genotyping with SLANG (Simple Long-read loci Assembly of Nanopore data for Genotyping).使用SLANG（用于基因分型的纳米孔数据简单长读长位点组装）进行长读长基因分型。

Appl Plant Sci. 2022 Jun 14;10(3):e11484. doi: 10.1002/aps3.11484. eCollection 2022 May-Jun.

Population genomics for symbiotic anthozoans: can reduced representation approaches be used for taxa without reference genomes?共生刺胞动物的群体基因组学：没有参考基因组的分类单元可以使用简化基因组学方法吗？

Heredity (Edinb). 2022 May;128(5):338-351. doi: 10.1038/s41437-022-00531-3. Epub 2022 Apr 13.

Genetic Differentiation and Demographic Trajectory of the Insular Formosan and Orii's Flying Foxes.岛屿型台湾狐蝠和琉球狐蝠的遗传分化和种群历史动态。

J Hered. 2021 Mar 29;112(2):192-203. doi: 10.1093/jhered/esab007.

Double-digest RAD-sequencing: do pre- and post-sequencing protocol parameters impact biological results?双酶切 RAD 测序：测序前后的协议参数是否会影响生物学结果？

Mol Genet Genomics. 2021 Mar;296(2):457-471. doi: 10.1007/s00438-020-01756-9. Epub 2021 Jan 20.

Parallel ddRAD and Genome Skimming Analyses Reveal a Radiative and Reticulate Evolutionary History of the Temperate Bamboos.平行 ddRAD 和基因组简化分析揭示了温带竹子的辐射和网状进化历史。

Syst Biol. 2021 Jun 16;70(4):756-773. doi: 10.1093/sysbio/syaa076.

Opening the door to greater phylogeographic inference in Southeast Asia: Comparative genomic study of five codistributed rainforest bird species using target capture and historical DNA.开启东南亚更大规模系统地理学推断之门：利用目标捕获和古DNA对五种同域分布雨林鸟类进行比较基因组研究

Ecol Evol. 2020 Mar 6;10(7):3222-3247. doi: 10.1002/ece3.5964. eCollection 2020 Apr.

本文引用的文献

Sequence Capture versus Restriction Site Associated DNA Sequencing for Shallow Systematics.靶向捕获测序与限制性位点相关 DNA 测序在浅层系统学中的应用。

Syst Biol. 2016 Sep;65(5):910-24. doi: 10.1093/sysbio/syw036. Epub 2016 Jun 10.

Phylogenomics of phrynosomatid lizards: conflicting signals from sequence capture versus restriction site associated DNA sequencing.角蜥科蜥蜴的系统发育基因组学：序列捕获与限制性位点相关DNA测序产生的相互矛盾的信号

Genome Biol Evol. 2015 Feb 7;7(3):706-19. doi: 10.1093/gbe/evv026.

AftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data.AftrRAD：一种用于准确高效地对RADseq数据进行从头组装的流程。

Mol Ecol Resour. 2015 Sep;15(5):1163-71. doi: 10.1111/1755-0998.12378. Epub 2015 Feb 16.

Genomic variation in a widespread Neotropical bird (Xenops minutus) reveals divergence, population expansion, and gene flow.一种分布广泛的新热带界鸟类（小食蚁鸟）的基因组变异揭示了分化、种群扩张和基因流动。

Mol Phylogenet Evol. 2015 Feb;83:305-16. doi: 10.1016/j.ympev.2014.10.023. Epub 2014 Nov 13.

The drivers of tropical speciation.热带物种形成的驱动因素。

Nature. 2014 Nov 20;515(7527):406-9. doi: 10.1038/nature13687. Epub 2014 Sep 10.

Amplification biases and consistent recovery of loci in a double-digest RAD-seq protocol.双酶切RAD-seq方案中的扩增偏差和位点的一致回收

PLoS One. 2014 Sep 4;9(9):e106713. doi: 10.1371/journal.pone.0106713. eCollection 2014.

Comparative population genomics in animals uncovers the determinants of genetic diversity.动物比较群体基因组学揭示了遗传多样性的决定因素。

Nature. 2014 Nov 13;515(7526):261-3. doi: 10.1038/nature13685. Epub 2014 Aug 20.

Biomed Res Int. 2014;2014:675158. doi: 10.1155/2014/675158. Epub 2014 Jun 25.

Unforeseen Consequences of Excluding Missing Data from Next-Generation Sequences: Simulation Study of RAD Sequences.排除下一代测序中缺失数据的意外后果：RAD序列的模拟研究

Syst Biol. 2016 May;65(3):357-65. doi: 10.1093/sysbio/syu046. Epub 2014 Jul 4.

Reduced representation genome sequencing suggests low diversity on the sex chromosomes of tonkean macaque monkeys.简化代表性基因组测序表明，在短尾猴的性染色体上多样性较低。

Mol Biol Evol. 2014 Sep;31(9):2425-40. doi: 10.1093/molbev/msu197. Epub 2014 Jul 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于从短读段进行DNA序列组装的相似性阈值可能会降低不同物种群体历史的可比性。

Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献