Ge Kenneth, Nguyen Phuc, Arnaout Ramy
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), and is a student at Carnegie Mellon University.
Department of Pathology at BIDMC, Boston, MA 02215.
bioRxiv. 2024 Oct 21:2024.10.18.618994. doi: 10.1101/2024.10.18.618994.
The University of California-Irvine (UCI) Machine Learning (ML) Repository (UCIMLR) is consistently cited as one of the most popular dataset repositories, hosting hundreds of high-impact datasets. However, a significant portion, including 28.4% of the top 250, cannot be imported via the package that is provided and recommended by the UCIMLR website. Instead, they are hosted as .zip files, containing nonstandard formats that are difficult to import without additional ad hoc processing. To address this issue, here we present -load University California Irvine examples-a utility that automatically determines the data format and imports many of these previously non-importable datasets, while preserving as much of a tabular data structure as possible. was designed using the top 100 most popular datasets and benchmarked on the next 130, where it resulted in a success rate of 95.4% vs. 73.1% for . is available as a Python package on PyPI with 98% code coverage.
加利福尼亚大学欧文分校(UCI)机器学习(ML)库(UCIMLR)一直被认为是最受欢迎的数据集存储库之一,托管着数百个具有重大影响力的数据集。然而,其中很大一部分,包括排名前250的数据集中的28.4%,无法通过UCIMLR网站提供和推荐的软件包导入。相反,它们是以.zip文件的形式托管的,包含非标准格式,在没有额外的临时处理的情况下很难导入。为了解决这个问题,我们在这里展示“加载加利福尼亚大学欧文分校示例”——一种实用工具,它可以自动确定数据格式并导入许多以前无法导入的数据集,同时尽可能保留表格数据结构。它是使用排名前100的最受欢迎的数据集设计的,并在接下来的130个数据集上进行了基准测试,其成功率为95.4%,而[未提及的工具]的成功率为73.1%。它作为一个Python包在PyPI上可用,代码覆盖率为98%。