Oexle Konrad
Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland.
J Hum Genet. 2016 Jul;61(7):627-32. doi: 10.1038/jhg.2016.21. Epub 2016 Apr 14.
The evenness score (E) in next-generation sequencing (NGS) quantifies the homogeneity in coverage of the NGS targets. Here I clarify the mathematical description of E, which is 1 minus the integral from 0 to 1 over the cumulative distribution function F(x) of the normalized coverage x, where normalization means division by the mean, and derive a computationally more efficient formula; that is, 1 minus the integral from 0 to 1 over the probability density distribution f(x) times 1-x. An analogous formula for empirical coverage data is provided as well as fast R command line scripts. This new formula allows for a general comparison of E with the coefficient of variation (=standard deviation σ of normalized data) which is the conventional measure of the relative width of a distribution. For symmetrical distributions, including the Gaussian, E can be predicted closely as 1-σ(2)/2⩾E⩾1-σ/2 with σ⩽1 owing to normalization and symmetry. In case of the log-normal distribution as a typical representative of positively skewed biological data, the analysis yields E≈exp(-σ*/2) with σ*(2)=ln(σ(2)+1) up to large σ (⩽3), and E≈1-F(exp(-1)) for very large σ (⩾2.5). In the latter kind of rather uneven coverage, E can provide direct information on the fraction of well-covered targets that is not immediately delivered by the normalized σ. Otherwise, E does not appear to have major advantages over σ or over a simple score exp(-σ) based on it. Actually, exp(-σ) exploits a much larger part of its range for the evaluation of realistic NGS outputs.