Rempala Grzegorz A, Seweryn Michal
Department of Biostatistics and Cancer Research Center, Georgia Health Sciences University, Augusta, GA, 30912, USA,
J Math Biol. 2013 Dec;67(6-7):1339-68. doi: 10.1007/s00285-012-0589-7. Epub 2012 Sep 25.
The paper presents some novel approaches to the empirical analysis of diversity and similarity (overlap) in biological or ecological systems. The analysis is motivated by the molecular studies of highly diverse mammalian T-cell receptor (TCR) populations, and is related to the classical statistical problem of analyzing two-way contingency tables with missing cells and low cell counts. The new measures of diversity and overlap are proposed, based on the information-theoretic as well as geometric considerations, with the capacity to naturally up-weight or down-weight the rare and abundant population species. The consistent estimates are derived by applying the Good-Turing sample-coverage correction. In particular, novel consistent estimates of the Shannon entropy function and the Morisita-Horn index are provided. Data from TCR populations in mice are used to illustrate the empirical performance of the proposed methods vis a vis the existing alternatives.
本文提出了一些新颖的方法,用于对生物或生态系统中的多样性和相似性(重叠性)进行实证分析。该分析的动机源于对高度多样化的哺乳动物T细胞受体(TCR)群体的分子研究,并且与分析存在缺失单元格和低单元格计数的双向列联表这一经典统计问题相关。基于信息论以及几何方面的考虑,提出了新的多样性和重叠性度量方法,这些方法能够自然地对稀有和丰富的种群物种进行加权或减权。通过应用古德 - 图灵样本覆盖校正得出一致估计量。特别地,给出了香农熵函数和莫里西塔 - 霍恩指数的新颖一致估计量。使用来自小鼠TCR群体的数据来说明所提出方法相对于现有替代方法的实证性能。