Pamela Sklar Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Hum Mol Genet. 2020 Sep 30;29(R1):R33-R41. doi: 10.1093/hmg/ddaa192.
The 'discovery' stage of genome-wide association studies required amassing large, homogeneous cohorts. In order to attain clinically useful insights, we must now consider the presentation of disease within our clinics and, by extension, within our medical records. Large-scale use of electronic health record (EHR) data can help to understand phenotypes in a scalable manner, incorporating lifelong and whole-phenome context. However, extending analyses to incorporate EHR and biobank-based analyses will require careful consideration of phenotype definition. Judgements and clinical decisions that occur 'outside' the system inevitably contain some degree of bias and become encoded in EHR data. Any algorithmic approach to phenotypic characterization that assumes non-biased variables will generate compounded biased conclusions. Here, we discuss and illustrate potential biases inherent within EHR analyses, how these may be compounded across time and suggest frameworks for large-scale phenotypic analysis to minimize and uncover encoded bias.
全基因组关联研究的“发现”阶段需要积累大量的、同质的队列。为了获得具有临床应用价值的见解,我们现在必须考虑在我们的诊所中呈现疾病,并且可以扩展到我们的医疗记录中。大规模使用电子健康记录 (EHR) 数据可以帮助以可扩展的方式理解表型,将终生和全基因组的背景纳入考虑。然而,要将分析扩展到纳入 EHR 和生物库分析,就需要仔细考虑表型定义。在系统之外做出的判断和临床决策不可避免地包含一定程度的偏差,并在 EHR 数据中进行编码。任何假设无偏变量的表型特征化算法方法都会生成累积的有偏结论。在这里,我们讨论并说明了 EHR 分析中固有的潜在偏差,以及这些偏差如何随着时间的推移而累积,并提出了用于大规模表型分析的框架,以最小化和揭示编码偏差。