XeroGraph：通过统计和预测分析在存在缺失值的情况下增强数据完整性。

XeroGraph: enhancing data integrity in the presence of missing values with statistical and predictive analysis.

作者信息

Mousafi Alasal Laila, Hammarlund Emma U, Pienta Kenneth J, Rönnstrand Lars, Kazi Julhash U

机构信息

Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, 22363, Sweden.

Lund Stem Cell Center, Department of Laboratory Medicine, Lund University, Lund, 22184, Sweden.

出版信息

Bioinform Adv. 2025 Feb 21;5(1):vbaf035. doi: 10.1093/bioadv/vbaf035. eCollection 2025.

DOI:10.1093/bioadv/vbaf035

PMID:40061871

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11889451/

Abstract

MOTIVATION

Missing data present a pervasive challenge in data analysis, potentially biasing outcomes and undermining conclusions if not addressed properly. Missing data are commonly classified into Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). While MCAR poses a minimal risk of data distortion, both MAR and MNAR can seriously affect the results of subsequent analyses. Therefore, it is important to know the type of missing data and appropriately handle them.

RESULTS

To facilitate efficient handling of missing data, we introduce a Python package named XeroGraph that is designed to evaluate data quality, categorize the nature of missingness, and guide imputation decisions. By comparing how various imputation methods influence underlying distributions, XeroGraph provides a systematic framework that supports more accurate and transparent analyses. Through its comprehensive preliminary assessments and user-friendly interface, this package facilitates the selection of optimal strategies tailored to the specific missing data mechanisms present in a dataset. In doing so, XeroGraph may significantly improve the validity and reproducibility of research findings, making it a valuable tool for professionals in data-intensive fields.

AVAILABILITY AND IMPLEMENTATION

XeroGraph is compatible with all operating systems and requires Python version 3.9 or higher. It can be freely downloaded from PyPI (https://pypi.org/project/XeroGraph). The source code is accessible on GitHub (https://github.com/kazilab/XeroGraph), and comprehensive documentation is available at Read the Docs (https://xerograph.readthedocs.io). This software is distributed under the Apache License 2.0.

摘要

动机

缺失数据在数据分析中是一个普遍存在的挑战，如果处理不当，可能会使结果产生偏差并削弱结论。缺失数据通常分为完全随机缺失（MCAR）、随机缺失（MAR）和非随机缺失（MNAR）。虽然MCAR对数据扭曲的风险最小，但MAR和MNAR都可能严重影响后续分析的结果。因此，了解缺失数据的类型并对其进行适当处理非常重要。

结果

为便于高效处理缺失数据，我们引入了一个名为XeroGraph的Python包，该包旨在评估数据质量、对缺失的性质进行分类并指导插补决策。通过比较各种插补方法如何影响基础分布，XeroGraph提供了一个系统框架，支持更准确和透明的分析。通过其全面的初步评估和用户友好的界面，该包有助于选择针对数据集中存在的特定缺失数据机制量身定制的最佳策略。这样做，XeroGraph可能会显著提高研究结果的有效性和可重复性，使其成为数据密集型领域专业人员的宝贵工具。

可用性和实现

XeroGraph与所有操作系统兼容，需要Python 3.9或更高版本。它可以从PyPI（https://pypi.org/project/XeroGraph）免费下载。源代码可在GitHub（https://github.com/kazilab/XeroGraph）上获取，完整的文档可在Read the Docs（https://xerograph.readthedocs.io）上获取。本软件根据Apache许可证2.0发布。

相似文献

XeroGraph: enhancing data integrity in the presence of missing values with statistical and predictive analysis.XeroGraph：通过统计和预测分析在存在缺失值的情况下增强数据完整性。

Bioinform Adv. 2025 Feb 21;5(1):vbaf035. doi: 10.1093/bioadv/vbaf035. eCollection 2025.

Adapting Safety Plans for Autistic Adults with Involvement from the Autism Community.在自闭症群体的参与下为成年自闭症患者调整安全计划。

Autism Adulthood. 2025 May 28;7(3):293-302. doi: 10.1089/aut.2023.0124. eCollection 2025 Jun.

Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果：面向临床医生的网状Meta分析教程

Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.

Community views on mass drug administration for soil-transmitted helminths: a qualitative evidence synthesis.社区对土壤传播蠕虫群体药物给药的看法：定性证据综合分析

Cochrane Database Syst Rev. 2025 Jun 20;6:CD015794. doi: 10.1002/14651858.CD015794.pub2.

What Matters Most? An Exploration of Quality of Life Through the Everyday Experiences of Autistic Young People and Adults.最重要的是什么？通过自闭症青少年和成年人的日常经历探索生活质量。

Autism Adulthood. 2025 May 28;7(3):312-323. doi: 10.1089/aut.2023.0127. eCollection 2025 Jun.

"Just Ask What Support We Need": Autistic Adults' Feedback on Social Skills Training.“只需询问我们需要什么支持”：成年自闭症患者对社交技能培训的反馈

Autism Adulthood. 2025 May 28;7(3):283-292. doi: 10.1089/aut.2023.0136. eCollection 2025 Jun.

Prognostic factors for return to work in breast cancer survivors.乳腺癌幸存者恢复工作的预后因素。

Cochrane Database Syst Rev. 2025 May 7;5(5):CD015124. doi: 10.1002/14651858.CD015124.pub2.

An Occupational Science Contribution to Camouflaging Scholarship: Centering Intersectional Experiences of Occupational Disruptions.职业科学对伪装学术的贡献：以职业中断的交叉经历为中心

Autism Adulthood. 2025 May 28;7(3):238-248. doi: 10.1089/aut.2023.0070. eCollection 2025 Jun.

Aural toilet (ear cleaning) for chronic suppurative otitis media.慢性化脓性中耳炎的耳道清理（耳部清洁）

Cochrane Database Syst Rev. 2025 Jun 9;6(6):CD013057. doi: 10.1002/14651858.CD013057.pub3.

Electronic cigarettes for smoking cessation.用于戒烟的电子烟。

Cochrane Database Syst Rev. 2025 Jan 29;1(1):CD010216. doi: 10.1002/14651858.CD010216.pub9.

本文引用的文献

Overview of Data Quality: Examining the Dimensions, Antecedents, and Impacts of Data Quality.数据质量概述：审视数据质量的维度、前因及影响

J Knowl Econ. 2023 Feb 10:1-20. doi: 10.1007/s13132-022-01096-6.

Xputer: bridging data gaps with NMF, XGBoost, and a streamlined GUI experience.Xputer：利用非负矩阵分解、极端梯度提升和简化的图形用户界面体验弥合数据差距。

Front Artif Intell. 2024 Apr 24;7:1345179. doi: 10.3389/frai.2024.1345179. eCollection 2024.

AlphaML: A clear, legible, explainable, transparent, and elucidative binary classification platform for tabular data.AlphaML：一个用于表格数据的清晰、易读、可解释、透明且具有阐释性的二元分类平台。

Patterns (N Y). 2023 Dec 13;5(1):100897. doi: 10.1016/j.patter.2023.100897. eCollection 2024 Jan 12.

Assumptions and analysis planning in studies with missing data in multiple variables: moving beyond the MCAR/MAR/MNAR classification.多变量缺失数据研究中的假设和分析计划：超越 MCAR/MAR/MNAR 分类。

Int J Epidemiol. 2023 Aug 2;52(4):1268-1275. doi: 10.1093/ije/dyad008.

Advanced methods for missing values imputation based on similarity learning.基于相似性学习的缺失值插补先进方法。

PeerJ Comput Sci. 2021 Jul 21;7:e619. doi: 10.7717/peerj-cs.619. eCollection 2021.

Missing data and multiple imputation in clinical epidemiological research.临床流行病学研究中的缺失数据与多重填补

Clin Epidemiol. 2017 Mar 15;9:157-166. doi: 10.2147/CLEP.S129785. eCollection 2017.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

XeroGraph：通过统计和预测分析在存在缺失值的情况下增强数据完整性。

XeroGraph: enhancing data integrity in the presence of missing values with statistical and predictive analysis.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献