Invasion Science & Wildlife Ecology Lab, University of Adelaide, Adelaide, SA, Australia.
School of Mathematical Sciences, University of Adelaide, Adelaide, SA, Australia.
PLoS One. 2021 Jul 9;16(7):e0254007. doi: 10.1371/journal.pone.0254007. eCollection 2021.
Automated monitoring of websites that trade wildlife is increasingly necessary to inform conservation and biosecurity efforts. However, e-commerce and wildlife trading websites can contain a vast number of advertisements, an unknown proportion of which may be irrelevant to researchers and practitioners. Given that many wildlife-trade advertisements have an unstructured text format, automated identification of relevant listings has not traditionally been possible, nor attempted. Other scientific disciplines have solved similar problems using machine learning and natural language processing models, such as text classifiers. Here, we test the ability of a suite of text classifiers to extract relevant advertisements from wildlife trade occurring on the Internet. We collected data from an Australian classifieds website where people can post advertisements of their pet birds (n = 16.5k advertisements). We found that text classifiers can predict, with a high degree of accuracy, which listings are relevant (ROC AUC ≥ 0.98, F1 score ≥ 0.77). Furthermore, in an attempt to answer the question 'how much data is required to have an adequately performing model?', we conducted a sensitivity analysis by simulating decreases in sample sizes to measure the subsequent change in model performance. From our sensitivity analysis, we found that text classifiers required a minimum sample size of 33% (c. 5.5k listings) to accurately identify relevant listings (for our dataset), providing a reference point for future applications of this sort. Our results suggest that text classification is a viable tool that can be applied to the online trade of wildlife to reduce time dedicated to data cleaning. However, the success of text classifiers will vary depending on the advertisements and websites, and will therefore be context dependent. Further work to integrate other machine learning tools, such as image classification, may provide better predictive abilities in the context of streamlining data processing for wildlife trade related online data.
自动监测进行野生动物交易的网站对于通知保护和生物安全工作越来越有必要。然而,电子商务和野生动物交易网站可能包含大量广告,其中未知比例可能与研究人员和从业者无关。鉴于许多野生动物贸易广告采用无结构的文本格式,传统上不可能也没有尝试过使用自动化方法识别相关列表。其他科学学科已经使用机器学习和自然语言处理模型(如文本分类器)解决了类似的问题。在这里,我们测试了一系列文本分类器从互联网上的野生动物交易中提取相关广告的能力。我们从澳大利亚的一个分类广告网站收集数据,人们可以在该网站上发布他们的宠物鸟的广告(n = 16500 条广告)。我们发现,文本分类器可以高度准确地预测哪些列表是相关的(ROC AUC ≥ 0.98,F1 分数 ≥ 0.77)。此外,为了回答“需要多少数据才能使模型表现良好?”这个问题,我们通过模拟样本量的减少进行了敏感性分析,以衡量模型性能的后续变化。从我们的敏感性分析中,我们发现文本分类器需要至少 33%(约 5500 个列表)的最小样本量才能准确识别相关列表(对于我们的数据集),为将来的此类应用提供了参考点。我们的结果表明,文本分类是一种可行的工具,可以应用于野生动物的在线交易,以减少用于数据清理的时间。然而,文本分类器的成功将取决于广告和网站,因此将取决于上下文。进一步整合其他机器学习工具(如图像分类)的工作可能会在简化与野生动物贸易相关的在线数据处理方面提供更好的预测能力。