Lu Jin, Sun Jiangwen, Wang Xinyu, Kranzler Henry, Gelernter Joel, Bi Jinbo
Department of Computer Science and Engineering, University of Connecticut, 371 Fairfield Way, Unit 4155, Storrs, CT, USA.
Department of Psychiatry, University of Pennsylvania Perelman School of Medicine, 3535 Market Street, Suite 500 and Crescenz Veterans Affairs Medical Center, Philadelphia, PA, USA.
BMC Syst Biol. 2018 Nov 22;12(Suppl 6):104. doi: 10.1186/s12918-018-0623-5.
Although substance use disorders (SUDs) are heritable, few genetic risk factors for them have been identified, in part due to the small sample sizes of study populations. To address this limitation, researchers have aggregated subjects from multiple existing genetic studies, but these subjects can have missing phenotypic information, including diagnostic criteria for certain substances that were not originally a focus of study. Recent advances in addiction neurobiology have shown that comorbid SUDs (e.g., the abuse of multiple substances) have similar genetic determinants, which makes it possible to infer missing SUD diagnostic criteria using criteria from another SUD and patient genotypes through statistical modeling.
We propose a new approach based on matrix completion techniques to integrate features of comorbid health conditions and individual's genotypes to infer unreported diagnostic criteria for a disorder. This approach optimizes a bi-linear model that uses the interactions between known disease correlations and candidate genes to impute missing criteria. An efficient stochastic and parallel algorithm was developed to optimize the model with a speed 20 times greater than the classic sequential algorithm. It was tested on 3441 subjects who had both cocaine and opioid use disorders and successfully inferred missing diagnostic criteria with consistently better accuracy than other recent statistical methods.
The proposed matrix completion imputation method is a promising tool to impute unreported or unobserved symptoms or criteria for disease diagnosis. Integrating data at multiple scales or from heterogeneous sources may help improve the accuracy of phenotype imputation.