Suppr超能文献

“高信息量”基因标记可能会使结论产生偏差:示例与通用解决方案

'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions.

作者信息

Lee Andy, Hemstrom William, Molea Natalie, Luikart Gordon, Christie Mark R

机构信息

Department of Biological Sciences, Purdue University, West Lafayette, Indiana, USA.

Department of Biology, Colorado State University, Fort Collins, Colorado, USA.

出版信息

Mol Ecol Resour. 2025 Oct;25(7):e70011. doi: 10.1111/1755-0998.70011. Epub 2025 Jul 11.

Abstract

High-grading bias is the overestimation power in a subset of loci caused by model overfitting. Using both empirical and simulated datasets, we show that high-grading bias can cause severe overestimation of population structure, and thus mislead investigators, whenever highly informative or high-F markers are chosen (i.e., ascertained) and used for subsequent assessments, a common practice in population genetic studies. This problem can occur in panmictic populations with no local adaptation. Biased results from choosing high-F markers may have severe downstream implications for management and conservation, such as erroneous conservation unit delineation, which could squander limited conservation resources to protect incorrectly defined 'populations'. Furthermore, we caution that high-grading is not limited to F approaches; high-grading bias is a concern whenever a small subset of markers are first chosen to explain differences among groups based on their degree of difference and are subsequently reused to estimate the degree of difference among those groups. For example, selecting high F loci for use in a GT-seq panel or using differentially expressed genes to plot sample membership in multivariate space can both result in spurious structure when none exists. We illustrate that using statistically based outlier tests in place of arbitrary F cut-offs can reduce bias. Alternatively, permutation tests or cross-evaluation can be used to detect high-grading bias. We provide an R package, PCAssess, to help researchers detect and prevent high-grading bias in genetic datasets by automating permutation tests and principal component analyses (https://github.com/hemstrow/PCAssess).

摘要

高评分偏差是由模型过度拟合导致的在一部分基因座上的高估能力。使用实证数据集和模拟数据集,我们表明,只要选择(即确定)并用于后续评估高信息性或高F标记(群体遗传学研究中的常见做法),高评分偏差就会导致对群体结构的严重高估,从而误导研究人员。这个问题可能发生在没有局部适应性的随机交配群体中。选择高F标记产生的偏差结果可能会对管理和保护产生严重的下游影响,例如错误的保护单元划分,这可能会浪费有限的保护资源来保护错误定义的“群体”。此外,我们提醒,高评分不限于F方法;只要首先选择一小部分标记,根据它们的差异程度来解释群体之间的差异,然后再用于估计这些群体之间的差异程度,就会存在高评分偏差问题。例如,在GT-seq面板中选择高F基因座用于分析,或者使用差异表达基因在多变量空间中绘制样本归属,在不存在真实结构时都可能导致虚假结构。我们说明,使用基于统计的异常值检验代替任意的F临界值可以减少偏差。或者,可以使用置换检验或交叉评估来检测高评分偏差。我们提供了一个R包PCAssess,通过自动化置换检验和主成分分析来帮助研究人员检测和预防遗传数据集中的高评分偏差(https://github.com/hemstrow/PCAssess)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3605/12415817/9cf6da5a8b95/MEN-25-e70011-g003.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验