• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

针对印尼语、爪哇语和英语混合的推文进行语料库创建与语言识别。

Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets.

作者信息

Hidayatullah Ahmad Fathan, Apong Rosyzie Anna, Lai Daphne T C, Qazi Atika

机构信息

School of Digital Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei.

Department of Informatics, Universitas Islam Indonesia, Sleman, Yogyakarta, Indonesia.

出版信息

PeerJ Comput Sci. 2023 Jun 22;9:e1312. doi: 10.7717/peerj-cs.1312. eCollection 2023.

DOI:10.7717/peerj-cs.1312
PMID:37409088
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10319257/
Abstract

With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT's ability to understand each word's context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.

摘要

在当今社交媒体被大量使用的情况下,社交媒体文本中的语言混合现象很普遍。在语言学中,语言混合现象被称为语码混合。语码混合的普遍存在给自然语言处理(NLP)带来了各种问题和挑战,包括语言识别(LID)任务。本研究提出了一种用于印尼语、爪哇语和英语混合的推特文本的词级语言识别模型。首先,我们引入了一个用于印尼语 - 爪哇语 - 英语语言识别(IJELID)的语码混合语料库。为确保可靠的数据集标注,我们提供了数据收集和标注标准构建过程的全部细节。本文还讨论了语料库创建过程中遇到的一些挑战。然后,我们研究了几种开发语码混合语言识别模型的策略,例如微调BERT、基于双向长短期记忆网络(BLSTM)和条件随机场(CRF)。我们的结果表明,微调后的IndoBERTweet模型比其他技术能更好地识别语言。这是由于BERT能够从给定文本序列中理解每个单词的上下文。最后,我们表明BERT模型中的子词语言表示可以为识别语码混合文本中的语言提供一个可靠的模型。

相似文献

1
Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets.针对印尼语、爪哇语和英语混合的推文进行语料库创建与语言识别。
PeerJ Comput Sci. 2023 Jun 22;9:e1312. doi: 10.7717/peerj-cs.1312. eCollection 2023.
2
Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT.普什图语冒犯性语言检测:一个基准数据集和单语普什图语BERT
PeerJ Comput Sci. 2023 Oct 18;9:e1617. doi: 10.7717/peerj-cs.1617. eCollection 2023.
3
Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data.基于深度学习的多语言混合数据情感分析和攻击性语言识别。
Sci Rep. 2022 Dec 13;12(1):21557. doi: 10.1038/s41598-022-26092-3.
4
Building lexicon-based sentiment analysis model for low-resource languages.为低资源语言构建基于词典的情感分析模型。
MethodsX. 2023 Oct 22;11:102460. doi: 10.1016/j.mex.2023.102460. eCollection 2023 Dec.
5
Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study.揭开高级人工智能语言模型在去识别汉英混合临床文本背后的秘密:开发与验证研究。
J Med Internet Res. 2024 Jan 25;26:e48443. doi: 10.2196/48443.
6
Interdisciplinary Approach to Identify and Characterize COVID-19 Misinformation on Twitter: Mixed Methods Study.跨学科方法识别和表征推特上关于新冠疫情的错误信息:混合方法研究
JMIR Form Res. 2023 Jun 28;7:e41134. doi: 10.2196/41134.
7
Sentiment analysis in tweets: an assessment study from classical to modern word representation models.推特中的情感分析:从经典到现代词表示模型的评估研究
Data Min Knowl Discov. 2023;37(1):318-380. doi: 10.1007/s10618-022-00853-0. Epub 2022 Nov 15.
8
An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian.基于 BERT 的 Twitter 情感分析有效流水线:意大利语案例研究。
Sensors (Basel). 2020 Dec 28;21(1):133. doi: 10.3390/s21010133.
9
Harnessing Indigenous Tweets: The Reo Māori Twitter corpus.利用本土推文:毛利语推特语料库
Lang Resour Eval. 2022;56(4):1229-1268. doi: 10.1007/s10579-022-09580-w. Epub 2022 Feb 14.
10
Protected Health Information Recognition of Unstructured Code-Mixed Electronic Health Records in Taiwan.台湾混合电子健康记录中受保护健康信息的识别。
Stud Health Technol Inform. 2022 Jun 6;290:627-631. doi: 10.3233/SHTI220153.

引用本文的文献

1
Special issue on analysis and mining of social media data.社交媒体数据分析与挖掘特刊。
PeerJ Comput Sci. 2024 Feb 29;10:e1909. doi: 10.7717/peerj-cs.1909. eCollection 2024.