Hu Zhiyue Tom, Ye Yuting, Newbury Patrick A, Huang Haiyan, Chen Bin
Department of Biostatistics, University of California, Berkeley, USA,
Pac Symp Biocomput. 2019;24:248-259.
The inconsistency of open pharmacogenomics datasets produced by different studies limits the usage of such datasets in many tasks, such as biomarker discovery. Investigation of multiple pharmacogenomics datasets confirmed that the pairwise sensitivity data correlation between drugs, or rows, across different studies (drug-wise) is relatively low, while the pairwise sensitivity data correlation between cell-lines, or columns, across different studies (cell-wise) is considerably strong. This common interesting observation across multiple pharmacogenomics datasets suggests the existence of subtle consistency among the different studies (i.e., strong cell-wise correlation). However, significant noises are also shown (i.e., weak drug-wise correlation) and have prevented researchers from comfortably using the data directly. Motivated by this observation, we propose a novel framework for addressing the inconsistency between large-scale pharmacogenomics data sets. Our method can significantly boost the drug-wise correlation and can be easily applied to re-summarized and normalized datasets proposed by others. We also investigate our algorithm based on many different criteria to demonstrate that the corrected datasets are not only consistent, but also biologically meaningful. Eventually, we propose to extend our main algorithm into a framework, so that in the future when more datasets become publicly available, our framework can hopefully offer a "ground-truth" guidance for references.
不同研究产生的开放药物基因组学数据集的不一致性限制了此类数据集在许多任务中的使用,比如生物标志物发现。对多个药物基因组学数据集的调查证实,不同研究之间(按药物)药物或行之间的成对敏感性数据相关性相对较低,而不同研究之间(按细胞系)细胞系或列之间的成对敏感性数据相关性相当强。多个药物基因组学数据集的这一常见有趣观察结果表明不同研究之间存在细微的一致性(即强细胞系相关性)。然而,也存在显著噪声(即弱药物相关性),这使得研究人员无法直接轻松地使用这些数据。受此观察结果的启发,我们提出了一种新颖的框架来解决大规模药物基因组学数据集之间的不一致性。我们的方法可以显著提高药物相关性,并且可以轻松应用于他人提出的重新汇总和标准化的数据集。我们还基于许多不同标准对我们的算法进行了研究,以证明校正后的数据集不仅一致,而且具有生物学意义。最终,我们提议将我们的主要算法扩展为一个框架,以便在未来当更多数据集公开可用时,我们的框架有望为参考提供“真实”指导。