Department of Computer and Information Science, University of Delaware, Newark, Delaware, United States of America.
Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America.
PLoS One. 2019 Jul 30;14(7):e0216913. doi: 10.1371/journal.pone.0216913. eCollection 2019.
Significant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.
最近,在自然语言处理任务中应用深度学习已经取得了重大进展。然而,深度学习模型通常需要大量的标注训练数据,而对于生物医学文献中的许多自然语言处理任务,通常只有少量的标记数据集。由于构建大规模数据集需要大量的人力投入,并且通常需要在专门领域具备专业知识,因此为深度学习构建大规模数据集是昂贵的。在这项工作中,我们考虑使用远程监督来扩充手动标注数据。然而,远程监督获得的数据通常是嘈杂的,我们首先应用一些启发式方法来删除一些错误的标注。然后,我们使用受迁移学习启发的方法表明,所得到的模型优于在原始手动标注集上训练的模型。
BMC Med Inform Decis Mak. 2021-11-9
J Biomed Semantics. 2021-8-18
J Biomed Inform. 2019-6-18
Bioinformatics. 2024-10-1
BMC Bioinformatics. 2022-4-4
BMC Med Inform Decis Mak. 2021-11-9
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2020
J Am Med Inform Assoc. 2021-3-18
BMC Bioinformatics. 2018-1-17
Nature. 2015-5-28
Database (Oxford). 2014-4-7
Nucleic Acids Res. 2013-11-13
PLoS Comput Biol. 2010-7-1