用于病媒预测建模的数据挖掘和机器学习方法：流行病预测建模。

Data mining and machine learning approaches for prediction modelling of disease vectors: Epidemic disease prediction modelling.

作者信息

Fusco Terence, Bi Yaxin, Wang Haiying, Browne Fiona

机构信息

Faculty of Computing and Engineering, University of Ulster, Newtownabbey, UK.

出版信息

Int J Mach Learn Cybern. 2020;11(6):1159-1178. doi: 10.1007/s13042-019-01029-x. Epub 2019 Nov 18.

DOI:10.1007/s13042-019-01029-x

PMID:33727985

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7224118/

Abstract

This research presents viable solutions for prediction modelling of disease based on vector density. Novel training models proposed in this work aim to address various aspects of interest in the artificial intelligence applications domain. Topics discussed include data imputation, semi-supervised labelling and synthetic instance simulation when using sparse training data. Innovative semi-supervised ensemble learning paradigms are proposed focusing on labelling threshold selection and stringency of classification confidence levels. A regression-correlation combination (RCC) data imputation method is also introduced for handling of partially complete training data. Results presented in this work show data imputation precision improvement over benchmark value replacement using proposed RCC on 70% of test cases. Proposed novel incremental transductive models such as ITSVM have provided interesting findings based on threshold constraints outperforming standard SVM application on 21% of test cases and can be applied with alternative environment-based epidemic disease domains. The proposed incremental transductive ensemble approach model enables the combination of complimentary algorithms to provide labelling for unlabelled vector density instances. Liberal (LTA) and strict training approaches provided varied results with LTA outperforming Stacking ensemble on 29.1% of test cases. Proposed novel synthetic minority over-sampling technique (SMOTE) equilibrium approach has yielded subtle classification performance increases which can be further interrogated to assess classification performance and efficiency relationships with synthetic instance generation.

摘要

本研究提出了基于病媒密度进行疾病预测建模的可行解决方案。这项工作中提出的新型训练模型旨在解决人工智能应用领域中各方面的重要问题。讨论的主题包括在使用稀疏训练数据时的数据插补、半监督标记和合成实例模拟。提出了创新的半监督集成学习范式，重点关注标记阈值选择和分类置信水平的严格性。还引入了一种回归-相关组合（RCC）数据插补方法来处理部分完整的训练数据。这项工作中呈现的结果表明，在70%的测试案例中，使用所提出的RCC进行数据插补的精度比基准值替换有所提高。所提出的新型增量转导模型，如ITSVM，基于阈值约束得出了有趣的结果，在21%的测试案例中优于标准支持向量机应用，并且可以应用于基于替代环境的流行病领域。所提出的增量转导集成方法模型能够结合互补算法为未标记的病媒密度实例提供标记。宽松（LTA）和严格训练方法产生了不同的结果，LTA在29.1%的测试案例中优于堆叠集成。所提出的新型合成少数过采样技术（SMOTE）平衡方法已使分类性能有细微提升，可进一步探究以评估分类性能与合成实例生成之间的效率关系。