Tom Hu Zhiyue, Yu Yaodong, Chen Ruoqiao, Yeh Shan-Ju, Chen Bin, Huang Haiyan
Division of Biostatistics, University of California Berkeley, Berkeley, CA 94720, United States.
Department of Electrical Engineer and Computer Science, University of California Berkeley, Berkeley, CA 94720, United States.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf226.
Pharmacogenomics studies are attracting an increasing amount of interest from researchers in precision medicine. The advances in high-throughput experiments and multiplexed approaches allow the large-scale quantification of drug sensitivities in molecularly characterized cancer cell lines (CCLs), resulting in a number of open drug sensitivity datasets for drug biomarker discovery. However, a significant inconsistency in drug sensitivity values among these datasets has been noted. Such inconsistency indicates the presence of substantial noise, subsequently hindering downstream analyses. To address the noise in drug sensitivity data, we introduce a robust and scalable deep learning framework, Residual Thresholded Deep Matrix Factorization (RT-DMF). This method takes a single drug sensitivity data matrix as its sole input and outputs a corrected and imputed matrix. Deep matrix factorization (DMF) excels at uncovering subtle patterns, due to its minimal reliance on data structure assumptions. This attribute significantly boosts DMF's ability to identify complex hidden patterns among nuisance effects in the data, thereby facilitating the detection of signals that are therapeutically relevant. Furthermore, RT-DMF incorporates an iterative residual thresholding procedure, which plays a crucial role in retaining signals more likely to hold therapeutic importance. Validation using simulated datasets and real pharmacogenomics datasets demonstrates the effectiveness of our approach in correcting noise and imputing missing data in drug sensitivity datasets (open-source package available at https://github.com/tomwhoooo/rtdmf).
药物基因组学研究正吸引着精准医学领域研究人员越来越多的关注。高通量实验和多重方法的进展使得在分子特征明确的癌细胞系(CCLs)中能够大规模定量药物敏感性,从而产生了许多用于药物生物标志物发现的公开药物敏感性数据集。然而,已注意到这些数据集之间药物敏感性值存在显著不一致。这种不一致表明存在大量噪声,进而阻碍了下游分析。为了解决药物敏感性数据中的噪声问题,我们引入了一个强大且可扩展的深度学习框架,即残差阈值深度矩阵分解(RT-DMF)。该方法以单个药物敏感性数据矩阵作为唯一输入,并输出一个经过校正和插补的矩阵。深度矩阵分解(DMF)在揭示微妙模式方面表现出色,因为它对数据结构假设的依赖最小。这一特性显著增强了DMF识别数据中干扰效应之间复杂隐藏模式的能力,从而便于检测具有治疗相关性的信号。此外,RT-DMF纳入了一个迭代残差阈值化程序,该程序在保留更可能具有治疗重要性的信号方面起着关键作用。使用模拟数据集和真实药物基因组学数据集进行的验证证明了我们的方法在纠正药物敏感性数据集中的噪声和插补缺失数据方面的有效性(开源软件包可在https://github.com/tomwhoooo/rtdmf获取)。