Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02114, USA.
Partners Healthcare, Somerville, MA 02145, USA.
Bioinformatics. 2020 May 1;36(10):3200-3206. doi: 10.1093/bioinformatics/btaa088.
Expert-labeled data are essential to train phenotyping algorithms for cohort identification. However expert labeling is time and labor intensive, and the costs remain prohibitive for scaling phenotyping to wider use-cases.
We present an approach referred to as polar labeling (PL), to create silver standard for training machine learning (ML) for disease classification. We test the hypothesis that ML models trained on the silver standard created by applying PL on unlabeled patient records, are comparable in performance to the ML models trained on gold standard, created by clinical experts through manual review of patient records. We perform experimental validation using health records of 38 023 patients spanning six diseases. Our results demonstrate the superior performance of the proposed approach.
We provide a Python implementation of the algorithm and the Python code developed for this study on Github.
Supplementary data are available at Bioinformatics online.
专家标记的数据对于训练用于队列识别的表型算法至关重要。然而,专家标记既费时又费力,对于将表型分析扩展到更广泛的用例而言,其成本仍然过高。
我们提出了一种称为极性标记(Polar Labeling,PL)的方法,用于创建用于疾病分类的机器学习(ML)训练的银标准。我们检验了一个假设,即通过对未标记的患者记录应用 PL 来训练的 ML 模型,其性能与通过对患者记录进行手动审查由临床专家创建的金标准训练的 ML 模型相当。我们使用跨越六种疾病的 38023 名患者的健康记录进行实验验证。我们的结果表明了所提出方法的优越性能。
我们在 Github 上提供了该算法的 Python 实现以及为这项研究开发的 Python 代码。
补充数据可在《生物信息学》在线获得。