Dang Zhenyu, Yang Jixuan, Wang Lin, Tao Qin, Zhang Fengjun, Zhang Yuxin, Luo Zewei
Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan University Shanghai, Shanghai 200433, China.
Qinghai Academy of Agricultural and Forestry Sciences, Xining 200433, China.
Plants (Basel). 2021 Feb 7;10(2):319. doi: 10.3390/plants10020319.
The new sequencing technology enables identification of genome-wide sequence-based variants at a population level and a competitively low cost. The sequence variant-based molecular markers have motivated enormous interest in population and quantitative genetic analyses. Generation of the sequence data involves a sophisticated experimental process embedded with rich non-biological variation. Statistically, the sequencing process indeed involves sampling DNA fragments from an individual sequence. Adequate knowledge of sampling variation of the sequence data generation is one of the key statistical properties for any downstream analysis of the data and for implementing statistically appropriate methods. This paper reports a thorough investigation on modeling the sampling variation of the sequence data from the optimized RAD-seq (Restriction sit associated DNA sequencing) experiments with two parents and their offspring of diploid and autotetraploid potato ( L.). The analysis shows significant dispersion in sampling variation of the sequence data over that expected under multinomial distribution as widely assumed in the literature and provides statistical methods for modeling the variation and calculating the model parameters, which may be easily implemented in real sequence datasets. The optimized design of RAD-seq experiments enabled effective control of presentation of undesirable chloroplast DNA and RNA genes in the sequence data generated.
新的测序技术能够在群体水平上以具有竞争力的低成本识别全基因组范围内基于序列的变异。基于序列变异的分子标记激发了人们对群体和数量遗传学分析的极大兴趣。序列数据的生成涉及一个复杂的实验过程,其中包含丰富的非生物学变异。从统计学角度来看,测序过程实际上涉及从单个序列中对DNA片段进行采样。充分了解序列数据生成的采样变异是对数据进行任何下游分析以及实施统计上合适的方法的关键统计特性之一。本文报告了一项深入研究,该研究针对来自优化的RAD-seq(限制性位点相关DNA测序)实验的序列数据的采样变异进行建模,该实验使用了二倍体和同源四倍体马铃薯(L.)的两个亲本及其后代。分析表明,序列数据的采样变异存在显著离散,超出了文献中广泛假设的多项分布所预期的范围,并提供了对变异进行建模和计算模型参数的统计方法,这些方法可以很容易地在实际序列数据集中实现。RAD-seq实验的优化设计能够有效控制在生成的序列数据中出现不需要的叶绿体DNA和RNA基因。