Suppr超能文献

大生物医学数据集的最优分布保持降采样(opdisDownsampling)。

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

机构信息

Institute of Clinical Pharmacology, Goethe-University, Frankfurt am Main, Germany.

Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Frankfurt am Main, Germany.

出版信息

PLoS One. 2021 Aug 5;16(8):e0255838. doi: 10.1371/journal.pone.0255838. eCollection 2021.

Abstract

MOTIVATION

The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method.

RESULTS

By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn.

CONCLUSIONS

Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.

摘要

动机

当今的生物医学数据集规模使得计算机设备不堪重负,即使是看似标准的分析任务,如数据投影或聚类。因此,通过降采样来减小大型生物医学数据的规模是数据处理的常见初始步骤,通常采用随机均匀类比例降采样方法来完成。在本报告中,我们假设可以对其进行优化,以获得比使用当前标准方法获得的样本更好地反映整个数据集的样本。

结果

通过重复随机抽样并比较所抽取样本的分布与原始数据的分布,我们可以建立一种方法,从数据集中获取子集,这些子集比仅从第一个随机选择的子集中获取更好地反映整个数据集,而这是当前的标准方法。在人工和真实生物医学数据集上的实验表明,从降采样数据中重建原始数据集的剩余数据的效果显著提高。无论是从原始数据中抽取的案例数量还是抽取的样本数量,都可以观察到这种情况。

结论

最优的保分布类比例降采样方法生成的数据集子集比使用标准方法获得的子集更好地反映整个数据集的结构。通过仅使用分布相似性作为唯一选择标准,所提出的方法不会以任何方式影响后续计划分析的结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82b1/8341664/2b92ea50cf89/pone.0255838.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验