• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于跨语言文档检索的深度多标签多语言文档学习

Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval.

作者信息

Feng Kai, Huang Lan, Xu Hao, Wang Kangping, Wei Wei, Zhang Rui

机构信息

College of Computer Science and Technology, Jilin University, Changchun 130012, China.

School of International Economics and Trade, Changchun University of Finance and Economics, Changchun 130012, China.

出版信息

Entropy (Basel). 2022 Jul 7;24(7):943. doi: 10.3390/e24070943.

DOI:10.3390/e24070943
PMID:35885166
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9318374/
Abstract

Cross-lingual document retrieval, which aims to take a query in one language to retrieve relevant documents in another, has attracted strong research interest in the last decades. Most studies on this task start with cross-lingual comparisons at the word level and then represent documents via word embeddings, which leads to insufficient structure information. In this work, the cross-lingual comparison at the document level is achieved through the cross-lingual semantic space. Our method, MDL (deep multilabel multilingual document learning), leverages a six-layer fully connected network to project cross-lingual documents into a shared semantic space. The semantic distances can be calculated when the cross-lingual documents are transformed into embeddings in semantic space. The supervision signals are automatically extracted from the data and then used to construct the semantic space via a linear classifier. The ambiguity of manual labels could be avoided and the multilabel supervision signals can be acquired instead of a single label. The representation of the semantic space is enriched by multilabel supervision signals, which improves the discriminative ability of the embeddings. The MDL is easy to extend to other fields since it does not depend on specific data. Furthermore, MDL is more efficient than the models training all languages jointly, since each language is trained individually. Experiments on Wikipedia data showed that the proposed method outperforms the state-of-the-art cross-lingual document retrieval methods.

摘要

跨语言文档检索旨在使用一种语言的查询来检索另一种语言中的相关文档,在过去几十年中引起了强烈的研究兴趣。关于这项任务的大多数研究都从单词层面的跨语言比较开始,然后通过词嵌入来表示文档,这导致结构信息不足。在这项工作中,通过跨语言语义空间实现了文档层面的跨语言比较。我们的方法MDL(深度多标签多语言文档学习)利用一个六层全连接网络将跨语言文档投影到一个共享语义空间中。当跨语言文档在语义空间中转换为嵌入时,可以计算语义距离。监督信号从数据中自动提取,然后通过线性分类器用于构建语义空间。可以避免手动标签的模糊性,并且可以获取多标签监督信号而不是单个标签。多标签监督信号丰富了语义空间的表示,提高了嵌入的判别能力。MDL易于扩展到其他领域,因为它不依赖于特定数据。此外,MDL比联合训练所有语言的模型更高效,因为每种语言都是单独训练的。在维基百科数据上的实验表明,所提出的方法优于当前最先进的跨语言文档检索方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/aa17f251905a/entropy-24-00943-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/f9da8855dced/entropy-24-00943-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/eda47d903ffc/entropy-24-00943-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/dfb4590bd6c5/entropy-24-00943-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/a2c0d2add3f6/entropy-24-00943-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/aa17f251905a/entropy-24-00943-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/f9da8855dced/entropy-24-00943-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/eda47d903ffc/entropy-24-00943-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/dfb4590bd6c5/entropy-24-00943-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/a2c0d2add3f6/entropy-24-00943-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/883c/9318374/aa17f251905a/entropy-24-00943-g005.jpg

相似文献

1
Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval.用于跨语言文档检索的深度多标签多语言文档学习
Entropy (Basel). 2022 Jul 7;24(7):943. doi: 10.3390/e24070943.
2
On cross-lingual retrieval with multilingual text encoders.关于使用多语言文本编码器进行跨语言检索。
Inf Retr Boston. 2022;25(2):149-183. doi: 10.1007/s10791-022-09406-x. Epub 2022 Mar 7.
3
Leveraging Wikipedia knowledge to classify multilingual biomedical documents.利用维基百科知识对多语言生物医学文献进行分类。
Artif Intell Med. 2018 Jun;88:37-57. doi: 10.1016/j.artmed.2018.04.007. Epub 2018 May 3.
4
Multi-level multilingual semantic alignment for zero-shot cross-lingual transfer learning.多层次多语言语义对齐的零镜头跨语言迁移学习。
Neural Netw. 2024 May;173:106217. doi: 10.1016/j.neunet.2024.106217. Epub 2024 Feb 27.
5
A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge*. Spanish-English Cross-language Case Study.一种使用维基百科知识进行生物医学文档分类的概念包方法*。西班牙语-英语跨语言案例研究。
Methods Inf Med. 2017 Oct 26;56(5):370-376. doi: 10.3414/ME17-01-0028. Epub 2017 Aug 16.
6
Multilabel Prediction via Cross-View Search.通过跨视图搜索进行多标签预测。
IEEE Trans Neural Netw Learn Syst. 2018 Sep;29(9):4324-4338. doi: 10.1109/TNNLS.2017.2763967. Epub 2017 Nov 7.
7
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
8
HC L: Hybrid and Cooperative Contrastive Learning for Cross-Lingual Spoken Language Understanding.HCL:用于跨语言口语理解的混合协作对比学习
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8094-8105. doi: 10.1109/TPAMI.2024.3402746. Epub 2024 Nov 6.
9
CODER: Knowledge-infused cross-lingual medical term embedding for term normalization.知识注入的跨语言医学术语嵌入用于术语归一化。
J Biomed Inform. 2022 Feb;126:103983. doi: 10.1016/j.jbi.2021.103983. Epub 2022 Jan 4.
10
A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification.出版物层面癌症文献的多标签文本分类器:医学文本分类方法研究
JMIR Med Inform. 2023 Oct 5;11:e44892. doi: 10.2196/44892.

本文引用的文献

1
On cross-lingual retrieval with multilingual text encoders.关于使用多语言文本编码器进行跨语言检索。
Inf Retr Boston. 2022;25(2):149-183. doi: 10.1007/s10791-022-09406-x. Epub 2022 Mar 7.