Department of Preventive Medicine, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea.
Department of Data and HPC Science, University of Science and Technology, Daejeon 34113, Korea.
Int J Environ Res Public Health. 2020 Dec 17;17(24):9467. doi: 10.3390/ijerph17249467.
Collecting valid information from electronic sources to detect the potential outbreak of infectious disease is time-consuming and labor-intensive. The automated identification of relevant information using machine learning is necessary to respond to a potential disease outbreak. A total of 2864 documents were collected from various websites and subsequently manually categorized and labeled by two reviewers. Accurate labels for the training and test data were provided based on a reviewer consensus. Two machine learning algorithms-ConvNet and bidirectional long short-term memory (BiLSTM)-and two classification methods-DocClass and SenClass-were used for classifying the documents. The precision, recall, F1, accuracy, and area under the curve were measured to evaluate the performance of each model. ConvNet yielded higher average, min, and max accuracies (87.6%, 85.2%, and 91.1%, respectively) than BiLSTM with DocClass, while BiLSTM performed better than ConvNet with SenClass with average, min, and max accuracies of 92.8%, 92.6%, and 93.3%, respectively. The performance of BiLSTM with SenClass yielded an overall accuracy of 92.9% in classifying infectious disease occurrences. Machine learning had a compatible performance with a human expert given a particular text extraction system. This study suggests that analyzing information from the website using machine learning can achieve significant accuracies in the presence of abundant articles/documents.
从电子资源中收集有效信息以检测传染病的潜在爆发是耗时且劳动密集的。使用机器学习自动识别相关信息对于应对潜在的疾病爆发是必要的。总共从各种网站收集了 2864 篇文件,然后由两名评审员手动进行分类和标记。根据评审员的共识,为训练和测试数据提供了准确的标签。使用两种机器学习算法(卷积神经网络和双向长短时记忆网络)和两种分类方法(DocClass 和 SenClass)对文档进行分类。测量了精度、召回率、F1 值、准确性和曲线下面积,以评估每个模型的性能。在使用 DocClass 进行分类时,卷积神经网络的平均、最小和最大准确率(分别为 87.6%、85.2%和 91.1%)均高于双向长短时记忆网络,而在使用 SenClass 进行分类时,双向长短时记忆网络的平均、最小和最大准确率(分别为 92.8%、92.6%和 93.3%)均高于卷积神经网络。使用 SenClass 的双向长短时记忆网络在分类传染病发生方面的整体准确率为 92.9%。机器学习与特定文本提取系统结合,其性能与人类专家相当。本研究表明,在存在大量文章/文档的情况下,使用机器学习分析网站信息可以实现较高的准确率。