Suppr超能文献

从社会与生命科学大数据集的精简存储中提升统计模式。

Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets.

作者信息

Garlaschi Stefano, Fochesato Anna, Tovo Anna

机构信息

Dipartimento di Fisica e Astronomia "Galileo Galilei", Università degli studi di Padova, Via Marzolo 8, 35131 Padova, Italy.

Fondazione The Microsoft Research-University of Trento, Centre for Computational and Systems Biology (COSBI), Piazza Manifattura 1, 38068 Rovereto, Italy.

出版信息

Entropy (Basel). 2020 Sep 26;22(10):1084. doi: 10.3390/e22101084.

Abstract

Recent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and challenges related to storage and analysis limits. Here, we strengthen an upscaling approach borrowed from theoretical ecology that allows us to infer with small errors relevant patterns of a dataset in its entirety, although only a limited fraction of it has been analysed. In particular we show that, after reducing the input amount of information on the system under study, by applying our framework it is still possible to recover two statistical patterns of interest of the entire dataset. Tested against big ecological, human activity and genomics data, our framework was successful in the reconstruction of global statistics related to both the number of types and their abundances while starting from limited presence/absence information on small random samples of the datasets. These results pave the way for future applications of our procedure in different life science contexts, from social activities to natural ecosystems.

摘要

最近的技术和计算进展使得能够以前所未有的速度收集数据。一方面,突然可用的大量数据为新的数据驱动研究开辟了新机会,但另一方面,它也凸显了与存储和分析限制相关的新障碍和挑战。在这里,我们强化了一种从理论生态学借鉴而来的放大方法,该方法使我们能够以较小的误差推断整个数据集的相关模式,尽管只分析了其中有限的一部分。特别是我们表明,在减少关于所研究系统的信息输入量之后,通过应用我们的框架,仍然有可能恢复整个数据集的两种感兴趣的统计模式。在大型生态、人类活动和基因组学数据上进行测试时,我们的框架成功地从数据集中小随机样本的有限存在/不存在信息开始,重建了与类型数量及其丰度相关的全局统计数据。这些结果为我们的程序在从社会活动到自然生态系统等不同生命科学背景下的未来应用铺平了道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5fa/7597173/94a982ebf6f0/entropy-22-01084-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验