Department of Biology, Brigham Young University, Provo, Utah 84602, USA.
Northeast Ohio Medical University, Rootstown, Ohio 44272, USA.
Sci Data. 2018 Apr 17;5:180066. doi: 10.1038/sdata.2018.66.
One important use of genome-wide transcriptional profiles is to identify relationships between transcription levels and patient outcomes. These translational insights can guide the development of biomarkers for clinical application. Data from thousands of translational-biomarker studies have been deposited in public repositories, enabling reuse. However, data-reuse efforts require considerable time and expertise because transcriptional data are generated using heterogeneous profiling technologies, preprocessed using diverse normalization procedures, and annotated in non-standard ways. To address this problem, we curated 45 publicly available, translational-biomarker datasets from a variety of human diseases. To increase the data's utility, we reprocessed the raw expression data using a uniform computational pipeline, addressed quality-control problems, mapped the clinical annotations to a controlled vocabulary, and prepared consistently structured, analysis-ready data files. These data, along with scripts we used to prepare the data, are available in a public repository. We believe these data will be particularly useful to researchers seeking to perform benchmarking studies-for example, to compare and optimize machine-learning algorithms' ability to predict biomedical outcomes.
全基因组转录谱的一个重要用途是识别转录水平与患者预后之间的关系。这些转化见解可以指导临床应用的生物标志物的开发。来自数千项转化生物标志物研究的数据已被存入公共存储库中,可供重复使用。然而,数据再利用工作需要大量的时间和专业知识,因为转录数据是使用异构的分析技术生成的,使用不同的标准化程序进行预处理,并以非标准的方式进行注释。为了解决这个问题,我们从各种人类疾病中整理了 45 个公开的转化生物标志物数据集。为了增加数据的实用性,我们使用统一的计算流程重新处理了原始表达数据,解决了质量控制问题,将临床注释映射到受控词汇表,并准备了结构一致、可用于分析的数据文件。这些数据以及我们用于准备数据的脚本可在公共存储库中获得。我们相信这些数据对于寻求进行基准测试研究的研究人员特别有用,例如,比较和优化机器学习算法预测生物医学结果的能力。