Aphinyanaphongs Yin, Fu Lawrence D, Aliferis Constantin F
Center for Health Informatics and Bioinformatics, NYU Langone Medical Center, NY, NY, USA.
Stud Health Technol Inform. 2013;192:667-71.
Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.
构建能够识别健康网站上未经证实的癌症治疗方法的机器学习模型,是应对向易受影响的健康消费者传播虚假和危险信息的一种有前途的方法。除了准确性这一明显要求外,在实际应用中部署这些模型时,有两个问题具有实际重要性。(a)通用性:模型必须能够推广到所有治疗方法(而不仅仅是用于模型训练的那些)。(b)可扩展性:模型可以有效地应用于健康网站上数十亿的文档。首先,我们提供了方法和相关实证数据,证明了强大的准确性和通用性。其次,通过结合MapReduce分布式架构和通过马尔可夫边界特征选择进行的高维压缩,我们展示了如何将模型的应用扩展到万维网规模的语料库。目前的工作提供了证据,即(a)一小部分未经证实的癌症治疗方法就足以构建一个模型来识别网络上未经证实的治疗方法;(b)未经证实的治疗方法使用独特的语言来宣传其主张,并且这种语言是可学习的;(c)通过分布式并行化和先进的特征选择,可以准备语料库并构建和应用具有高可扩展性的模型。