Department of Mathematical Sciences, SUNY Binghamton University, Vestal, NY 13850, USA.
Genes (Basel). 2021 May 13;12(5):736. doi: 10.3390/genes12050736.
Despite the fact that imbalance between case and control groups is prevalent in genome-wide association studies (GWAS), it is often overlooked. This imbalance is getting more significant and urgent as the rapid growth of biobanks and electronic health records have enabled the collection of thousands of phenotypes from large cohorts, in particular for diseases with low prevalence. The unbalanced binary traits pose serious challenges to traditional statistical methods in terms of both genomic selection and disease prediction. For example, the well-established linear mixed models (LMM) yield inflated type I error rates in the presence of unbalanced case-control ratios. In this article, we review multiple statistical approaches that have been developed to overcome the inaccuracy caused by the unbalanced case-control ratio, with the advantages and limitations of each approach commented. In addition, we also explore the potential for applying several powerful and popular state-of-the-art machine-learning approaches, which have not been applied to the GWAS field yet. This review paves the way for better analysis and understanding of the unbalanced case-control disease data in GWAS.
尽管病例对照组之间的不平衡在全基因组关联研究(GWAS)中很常见,但往往被忽视。随着生物库和电子健康记录的快速增长,使得从大型队列中收集数千种表型成为可能,特别是对于患病率较低的疾病,这种不平衡变得更加显著和紧迫。不平衡的二元特征对传统的统计方法在基因组选择和疾病预测方面都提出了严峻的挑战。例如,在存在不平衡的病例对照比例的情况下,成熟的线性混合模型(LMM)会导致过高的Ⅰ型错误率。在本文中,我们综述了多种已开发的统计方法,这些方法旨在克服病例对照比例不平衡所带来的不准确性,并对每种方法的优缺点进行了评论。此外,我们还探讨了应用几种强大且流行的最新机器学习方法的可能性,这些方法尚未应用于 GWAS 领域。本综述为更好地分析和理解 GWAS 中不平衡的病例对照疾病数据铺平了道路。