Yang Shengping, Fang Zhide
Department of Pathology, School of Medicine, Texas Tech University Health Science Center, Lubbock, Texas, USA.
Biostatistics Program, School of Public Health, LSU Health Sciences Center, New Orleans, Louisiana, USA.
J Appl Stat. 2017;44(1):57-70. doi: 10.1080/02664763.2016.1158798. Epub 2016 Mar 16.
Paired sequencing data are commonly collected in genomic studies to control biological variation. However, existing data processing strategies suffer at low coverage regions, which are unavoidable due to the limitation of current sequencing technology. Furthermore, information contained in the absolute values of the read counts is commonly ignored. We propose a read count ratio processing/modification method, to not only incorporate information contained in the absolute values of paired counts into one variable, but also mitigate the discrete artifact, especially when both counts are small. Simulation shows that the processed variable fits well with a Beta distribution, thus providing an easy tool for down-stream inference analysis.
在基因组研究中,通常会收集配对测序数据以控制生物学变异。然而,现有的数据处理策略在低覆盖区域存在问题,由于当前测序技术的局限性,这些区域是不可避免的。此外,读取计数绝对值中包含的信息通常被忽略。我们提出了一种读取计数比率处理/修正方法,不仅将配对计数绝对值中包含的信息整合到一个变量中,还能减轻离散伪影,特别是当两个计数都很小时。模拟表明,处理后的变量与贝塔分布拟合良好,从而为下游推断分析提供了一个简单的工具。