Suppr超能文献

美国动物科学学会-北美猪营养大会研讨会:动物营养中的数学建模:非正态多元分布的综合数据库生成:一种基于秩的方法及其在反刍动物甲烷排放中的应用

ASAS-NANP symposium: mathematical modeling in animal nutrition: synthetic database generation for non-normal multivariate distributions: a rank-based method with application to ruminant methane emissions.

作者信息

Tedeschi Luis O

机构信息

Department of Animal Science, Texas A&M University, College Station, TX, USA.

出版信息

J Anim Sci. 2025 Jan 4;103. doi: 10.1093/jas/skaf136.

Abstract

This study addresses the challenge of limited data availability in animal science, particularly in modeling complex biological processes such as methane emissions from ruminants. We propose a novel rank-based method for generating synthetic databases with correlated non-normal multivariate distributions aimed at enhancing the accuracy and reliability of predictive modeling tools. Our rank-based approach involves a four-step process: 1) fitting distributions to variables using normal or best-fit non-normal distributions, 2) generating synthetic databases, 3) preserving relationships among variables using Spearman correlations, and 4) cleaning datasets to ensure biological plausibility. We compare this method with copula-based approaches to maintain a preestablished correlation structure. The rank-based method demonstrated superior performance in preserving original distribution moments (mean, variance, skewness, kurtosis) and correlation structures compared to copula-based methods. We generated two synthetic databases (normal and non-normal distributions) and applied random forest (RF) and multiple linear model (LM) regression analyses. RF regression outperformed LM in predicting methane emissions, showing higher R2 values (0.927 vs. 0.622) and lower standard errors. However, cross-testing revealed that RF regressions exhibit high specificity to distribution types, underperforming when applied to data with differing distributions. In contrast, LM regressions showed robustness across different distribution types. Our findings highlight the importance of understanding distributional assumptions in regression techniques when generating synthetic databases. The study also underscores the potential of synthetic data in augmenting limited samples, addressing class imbalances, and simulating rare scenarios. While our method effectively preserves descriptive statistical properties, we acknowledge the possibility of introducing artificial (unknown) relationships within subsets of the synthetic database. This research uncovered a practical solution for creating realistic, statistically sound datasets when original data is scarce or sensitive. Its application in predicting methane emissions demonstrates the potential to enhance modeling accuracy in animal science. Future research directions include integrating this approach with deep learning, exploring real-world applications, and developing adaptive machine-learning models for diverse data distributions.

摘要

本研究应对动物科学中数据可用性有限的挑战,特别是在对复杂生物过程(如反刍动物甲烷排放)进行建模时。我们提出了一种新颖的基于秩的方法,用于生成具有相关非正态多元分布的合成数据库,旨在提高预测建模工具的准确性和可靠性。我们基于秩的方法包括四个步骤:1)使用正态分布或最佳拟合非正态分布对变量进行分布拟合;2)生成合成数据库;3)使用斯皮尔曼相关性保留变量之间的关系;4)清理数据集以确保生物学合理性。我们将此方法与基于 copula 的方法进行比较,以维持预先建立的相关结构。与基于 copula 的方法相比,基于秩的方法在保留原始分布矩(均值、方差、偏度、峰度)和相关结构方面表现出卓越性能。我们生成了两个合成数据库(正态和非正态分布),并应用随机森林(RF)和多元线性模型(LM)回归分析。在预测甲烷排放方面,RF 回归优于 LM,显示出更高的 R2 值(0.927 对 0.622)和更低的标准误差。然而,交叉测试表明,RF 回归对分布类型具有高度特异性,应用于不同分布的数据时表现不佳。相比之下,LM 回归在不同分布类型中表现出稳健性。我们的研究结果突出了在生成合成数据库时理解回归技术中分布假设的重要性。该研究还强调了合成数据在扩充有限样本、解决类别不平衡以及模拟罕见场景方面的潜力。虽然我们的方法有效地保留了描述性统计属性,但我们承认在合成数据库子集中引入人为(未知)关系的可能性。这项研究在原始数据稀缺或敏感时,为创建现实、统计上合理的数据集找到了一个切实可行的解决方案。其在预测甲烷排放中的应用证明了提高动物科学建模准确性的潜力。未来的研究方向包括将此方法与深度学习相结合、探索实际应用,以及为不同数据分布开发自适应机器学习模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/115e/12351256/5441e575ffa1/skaf136_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验