Mousafi Alasal Laila, Hammarlund Emma U, Pienta Kenneth J, Rönnstrand Lars, Kazi Julhash U
Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, 22363, Sweden.
Lund Stem Cell Center, Department of Laboratory Medicine, Lund University, Lund, 22184, Sweden.
Bioinform Adv. 2025 Feb 21;5(1):vbaf035. doi: 10.1093/bioadv/vbaf035. eCollection 2025.
Missing data present a pervasive challenge in data analysis, potentially biasing outcomes and undermining conclusions if not addressed properly. Missing data are commonly classified into Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). While MCAR poses a minimal risk of data distortion, both MAR and MNAR can seriously affect the results of subsequent analyses. Therefore, it is important to know the type of missing data and appropriately handle them.
To facilitate efficient handling of missing data, we introduce a Python package named XeroGraph that is designed to evaluate data quality, categorize the nature of missingness, and guide imputation decisions. By comparing how various imputation methods influence underlying distributions, XeroGraph provides a systematic framework that supports more accurate and transparent analyses. Through its comprehensive preliminary assessments and user-friendly interface, this package facilitates the selection of optimal strategies tailored to the specific missing data mechanisms present in a dataset. In doing so, XeroGraph may significantly improve the validity and reproducibility of research findings, making it a valuable tool for professionals in data-intensive fields.
XeroGraph is compatible with all operating systems and requires Python version 3.9 or higher. It can be freely downloaded from PyPI (https://pypi.org/project/XeroGraph). The source code is accessible on GitHub (https://github.com/kazilab/XeroGraph), and comprehensive documentation is available at Read the Docs (https://xerograph.readthedocs.io). This software is distributed under the Apache License 2.0.
缺失数据在数据分析中是一个普遍存在的挑战,如果处理不当,可能会使结果产生偏差并削弱结论。缺失数据通常分为完全随机缺失(MCAR)、随机缺失(MAR)和非随机缺失(MNAR)。虽然MCAR对数据扭曲的风险最小,但MAR和MNAR都可能严重影响后续分析的结果。因此,了解缺失数据的类型并对其进行适当处理非常重要。
为便于高效处理缺失数据,我们引入了一个名为XeroGraph的Python包,该包旨在评估数据质量、对缺失的性质进行分类并指导插补决策。通过比较各种插补方法如何影响基础分布,XeroGraph提供了一个系统框架,支持更准确和透明的分析。通过其全面的初步评估和用户友好的界面,该包有助于选择针对数据集中存在的特定缺失数据机制量身定制的最佳策略。这样做,XeroGraph可能会显著提高研究结果的有效性和可重复性,使其成为数据密集型领域专业人员的宝贵工具。
XeroGraph与所有操作系统兼容,需要Python 3.9或更高版本。它可以从PyPI(https://pypi.org/project/XeroGraph)免费下载。源代码可在GitHub(https://github.com/kazilab/XeroGraph)上获取,完整的文档可在Read the Docs(https://xerograph.readthedocs.io)上获取。本软件根据Apache许可证2.0发布。