使用零膨胀离散混合分布的经验零值估计及其在蛋白质结构域数据中的应用。

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.

作者信息

Gauran Iris Ivy M, Park Junyong, Lim Johan, Park DoHwan, Zylstra John, Peterson Thomas, Kann Maricel, Spouge John L

机构信息

Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.

School of Statistics, University of the Philippines Diliman, Quezon City, 1101, Philippines.

出版信息

Biometrics. 2018 Jun;74(2):458-471. doi: 10.1111/biom.12779. Epub 2017 Sep 22.

DOI:10.1111/biom.12779

PMID:28940296

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5862774/

Abstract

In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.

摘要

在最近的突变研究中，基于蛋白质结构域位置的分析比以基因为中心的方法更受欢迎，因为后者在考虑突变位置所提供的功能背景方面存在局限性。这带来了一个大规模的同时推断问题，需要同时考虑数百个假设检验。本文旨在通过错误发现率（FDR）程序在控制给定水平的I型错误的同时选择显著的突变计数。一个主要假设是突变计数遵循零膨胀模型，以便解释计数模型中的真实零值和过多的零值。所考虑的模型类别是零膨胀广义泊松（ZIGP）分布。此外，我们假设存在一个截止值，使得小于该值的计数是由零分布产生的。我们提出了几种依赖数据的方法来确定截止值。我们还考虑了一种基于筛选过程的两阶段程序，以便将超过某个值的突变数量视为显著突变。使用离散分布的混合，通过模拟和蛋白质结构域数据集来说明该程序在估计经验零值方面的应用。总体而言，在保持对FDR的控制的同时，所提出的两阶段测试程序具有优越的经验功效。

相似文献

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.使用零膨胀离散混合分布的经验零值估计及其在蛋白质结构域数据中的应用。

Biometrics. 2018 Jun;74(2):458-471. doi: 10.1111/biom.12779. Epub 2017 Sep 22.

On performance of parametric and distribution-free models for zero-inflated and over-dispersed count responses.关于零膨胀和过度分散计数响应的参数模型和非参数模型的性能。

Stat Med. 2015 Oct 30;34(24):3235-45. doi: 10.1002/sim.6560. Epub 2015 Jun 15.

Zero-inflated Conway-Maxwell Poisson Distribution to Analyze Discrete Data.用于分析离散数据的零膨胀康威-麦克斯韦泊松分布

Int J Biostat. 2018 Jan 9;14(1):ijb-2016-0070. doi: 10.1515/ijb-2016-0070.

Marginalized multilevel hurdle and zero-inflated models for overdispersed and correlated count data with excess zeros.用于具有过多零值的过度分散和相关计数数据的边缘化多级障碍模型和零膨胀模型。

Stat Med. 2014 Nov 10;33(25):4402-19. doi: 10.1002/sim.6237. Epub 2014 Jun 23.

A simulation study of the performance of statistical models for count outcomes with excessive zeros.计数结局中过度零的统计模型性能的模拟研究。

Stat Med. 2024 Oct 30;43(24):4752-4767. doi: 10.1002/sim.10198. Epub 2024 Aug 28.

Nonlinear mixed-effects modeling of longitudinal count data: Bayesian inference about median counts based on the marginal zero-inflated discrete Weibull distribution.基于边缘零膨胀离散 Weibull 分布的纵向计数数据的非线性混合效应建模：基于边缘零膨胀离散 Weibull 分布的中位数计数的贝叶斯推断。

Stat Med. 2021 Oct 15;40(23):5078-5095. doi: 10.1002/sim.9112. Epub 2021 Jun 21.

Modelling count data with excessive zeros: the need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data.对过多零值进行计数数据分析：零膨胀模型中类别预测的必要性，以及针对龋齿数据在零膨胀模型和通用混合模型之间选择时的数据生成问题。

Stat Med. 2009 Dec 10;28(28):3539-53. doi: 10.1002/sim.3699.

Bivariate zero-inflated regression for count data: a Bayesian approach with application to plant counts.计数数据的双变量零膨胀回归：一种贝叶斯方法及其在植物计数中的应用

Int J Biostat. 2010;6(1):Article 27. doi: 10.2202/1557-4679.1229.

Resampling-based empirical Bayes multiple testing procedures for controlling generalized tail probability and expected value error rates: focus on the false discovery rate and simulation study.基于重采样的经验贝叶斯多重检验程序，用于控制广义尾概率和期望值错误率：聚焦于错误发现率及模拟研究

Biom J. 2008 Oct;50(5):716-44. doi: 10.1002/bimj.200710473.

Semiparametric analysis of zero-inflated count data.零膨胀计数数据的半参数分析

Biometrics. 2006 Dec;62(4):996-1003. doi: 10.1111/j.1541-0420.2006.00575.x.

引用本文的文献

Bioinformatic RNA-Seq Functional Profiling of the Tumor Suppressor Gene OPCML in Ovarian Cancers: The Multifunctional, Pleiotropic Impacts of Having Three Ig Domains.卵巢癌中肿瘤抑制基因OPCML的生物信息学RNA测序功能分析：拥有三个免疫球蛋白结构域的多功能、多效性影响

Curr Issues Mol Biol. 2025 May 29;47(6):405. doi: 10.3390/cimb47060405.

C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data.C-ziptf：用于零膨胀多维基因组学数据的稳定张量分解。

BMC Bioinformatics. 2024 Oct 5;25(1):323. doi: 10.1186/s12859-024-05886-4.

Individualized empirical null estimation for exact tests of healthcare quality.个体化经验性零假设估计在医疗质量精确检验中的应用。

Stat Med. 2024 May 30;43(12):2403-2420. doi: 10.1002/sim.10074. Epub 2024 Apr 8.

Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach.使用单训练随机森林对基尼变量重要性进行阈值处理：一种经验贝叶斯方法。

Comput Struct Biotechnol J. 2023 Sep 1;21:4354-4360. doi: 10.1016/j.csbj.2023.08.033. eCollection 2023.

Adaptive local false discovery rate procedures for highly spiky data and their application RNA sequencing data of yeast SET4 deletion mutants.适用于高度尖峰数据的自适应局部假发现率程序及其在酵母 SET4 缺失突变体 RNA 测序数据中的应用。

Biom J. 2021 Dec;63(8):1729-1744. doi: 10.1002/bimj.202000256. Epub 2021 Jul 28.

Oncodomains: A protein domain-centric framework for analyzing rare variants in tumor samples.癌基因结构域：一种以蛋白质结构域为中心的框架，用于分析肿瘤样本中的罕见变异。

PLoS Comput Biol. 2017 Apr 20;13(4):e1005428. doi: 10.1371/journal.pcbi.1005428. eCollection 2017 Apr.

本文引用的文献

Signal transduction in cancer.癌症中的信号转导

Cold Spring Harb Perspect Med. 2015 Apr 1;5(4):a006098. doi: 10.1101/cshperspect.a006098.

Overexpression of NOTCH-regulated ankyrin repeat protein is associated with breast cancer cell proliferation.NOTCH调节的锚蛋白重复蛋白的过表达与乳腺癌细胞增殖相关。

Anticancer Res. 2014 May;34(5):2165-71.

A protein domain-centric approach for the comparative analysis of human and yeast phenotypically relevant mutations.基于蛋白质结构域的人类与酵母表型相关突变的比较分析方法。

BMC Genomics. 2013;14 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2164-14-S3-S5. Epub 2013 May 28.

Domain landscapes of somatic mutations in cancer.癌症体细胞突变的域景观。

BMC Genomics. 2012 Jun 18;13 Suppl 4(Suppl 4):S9. doi: 10.1186/1471-2164-13-S4-S9.

Incorporating molecular and functional context into the analysis and prioritization of human variants associated with cancer.将分子和功能背景纳入与癌症相关的人类变异的分析和优先级排序中。

J Am Med Inform Assoc. 2012 Mar-Apr;19(2):275-83. doi: 10.1136/amiajnl-2011-000655.

Objective method for estimating asymptotic parameters, with an application to sequence alignment.估计渐近参数的客观方法及其在序列比对中的应用。

Phys Rev E Stat Nonlin Soft Matter Phys. 2011 Sep;84(3 Pt 1):031914. doi: 10.1103/PhysRevE.84.031914. Epub 2011 Sep 13.

DMDM: domain mapping of disease mutations.DMDM：疾病突变的域映射。

Bioinformatics. 2010 Oct 1;26(19):2458-9. doi: 10.1093/bioinformatics/btq447. Epub 2010 Aug 4.

Activation of an olfactory receptor inhibits proliferation of prostate cancer cells.嗅觉受体的激活会抑制前列腺癌细胞的增殖。

J Biol Chem. 2009 Jun 12;284(24):16218-16225. doi: 10.1074/jbc.M109.012096. Epub 2009 Apr 23.

Cadherins and cancer: how does cadherin dysfunction promote tumor progression?钙黏蛋白与癌症：钙黏蛋白功能障碍如何促进肿瘤进展？

Oncogene. 2008 Nov 24;27(55):6920-9. doi: 10.1038/onc.2008.343.

Generalized Poisson distribution: the property of mixture of Poisson and comparison with negative binomial distribution.广义泊松分布：泊松混合的性质及与负二项分布的比较

Biom J. 2005 Apr;47(2):219-29. doi: 10.1002/bimj.200410102.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。