• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution.一个用于生物性别识别和作者身份归属的包含X篇帖子的双注释马来语-英语语码转换(马式英语)数据集。
Data Brief. 2024 Jan 8;52:110034. doi: 10.1016/j.dib.2024.110034. eCollection 2024 Feb.
2
Oral diadochokinetic rates across languages: Multilingual speakers comparison.跨语言的口腔交替运动率:多语言使用者比较。
Int J Lang Commun Disord. 2021 Sep;56(5):1026-1036. doi: 10.1111/1460-6984.12653. Epub 2021 Jul 31.
3
Dataset from Code-switching between English and Malay Languages in Malaysian Premier Polytechnics ESL Classrooms.
Data Brief. 2022 Oct 29;45:108709. doi: 10.1016/j.dib.2022.108709. eCollection 2022 Dec.
4
Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets.针对印尼语、爪哇语和英语混合的推文进行语料库创建与语言识别。
PeerJ Comput Sci. 2023 Jun 22;9:e1312. doi: 10.7717/peerj-cs.1312. eCollection 2023.
5
COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic.新冠疫情与网络欺凌:用于在疫情期间从语码转换语言中识别网络欺凌的深度集成模型
Multimed Tools Appl. 2023;82(6):8773-8789. doi: 10.1007/s11042-021-11601-9. Epub 2022 Jan 8.
6
An open-source dataset for arabic fine-grained emotion recognition of online content amid COVID-19 pandemic.一个用于在新冠疫情期间对在线内容进行阿拉伯语细粒度情感识别的开源数据集。
Data Brief. 2023 Oct 31;51:109745. doi: 10.1016/j.dib.2023.109745. eCollection 2023 Dec.
7
Authorship attribution of source code by using back propagation neural network based on particle swarm optimization.基于粒子群优化的反向传播神经网络对源代码的作者归属分析
PLoS One. 2017 Nov 2;12(11):e0187204. doi: 10.1371/journal.pone.0187204. eCollection 2017.
8
BTSD: A curated transformation of sentence dataset for text classification in Bangla language.BTSD:孟加拉语用于文本分类的句子数据集的精心整理转换。
Data Brief. 2023 Jul 24;50:109445. doi: 10.1016/j.dib.2023.109445. eCollection 2023 Oct.
9
Translation, adaptation and validation of two versions of the Chronic Liver Disease Questionnaire in Malaysian patients for speakers of both English and Malay languages: a cross-sectional study.针对英语和马来语使用者的马来西亚患者,对两个版本的慢性肝病问卷进行翻译、改编及验证:一项横断面研究。
BMJ Open. 2017 May 25;7(5):e013873. doi: 10.1136/bmjopen-2016-013873.
10
Psychometric performance assessment of Malay and Malaysian English version of EQ-5D-5L in the Malaysian population.马来西亚人群中马来语和马来西亚英语版 EQ-5D-5L 的心理测量性能评估。
Qual Life Res. 2019 Jan;28(1):153-162. doi: 10.1007/s11136-018-2027-9. Epub 2018 Oct 13.

一个用于生物性别识别和作者身份归属的包含X篇帖子的双注释马来语-英语语码转换(马式英语)数据集。

A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution.

作者信息

Maskat Ruhaila, Azman Norazmiera Ayunie, Nulizairos Nur Shaheera Shastera, Zahidin Nurul Athirah, Mahadi Adibah Humairah, Norshamsul Siti Rubaya, Sharif Mohd Mukhlis Mohd, Mahdin Hairulnizam

机构信息

College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.

Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja, Batu Pahat, Johor, Malaysia.

出版信息

Data Brief. 2024 Jan 8;52:110034. doi: 10.1016/j.dib.2024.110034. eCollection 2024 Feb.

DOI:10.1016/j.dib.2024.110034
PMID:38282916
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10820639/
Abstract

Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code-switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay-English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low-resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias, and providing targeted recommendations for gender-specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay-English code-switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers.

摘要

像马来语这样的低资源语言,当语言资源变得稀缺时,面临着灭绝的威胁。本文通过补充低资源语言清单来解决稀缺问题,特别关注被称为“马式英语”(Manglish)的马来语 - 英语。说马式英语的人主要分布在马来西亚、印度尼西亚、文莱和新加坡。随着全球对第二语言的采用和社交媒体使用的增加,诸如西班牙式英语(Spanglish)和中式英语(Chinglish)等语码转换现象变得更加普遍。就马来语 - 英语而言,这种现象被称为马式英语。为了提高马来语的地位并使其从低资源类别中转变出来,本文呈现了这个独特的文本语料库,它带有关于生理性别和匿名作者身份的二元注释。这个双注释数据集为各个领域提供了有价值的应用,包括网络欺凌调查、打击性别偏见以及为特定性别的产品提供针对性建议。这个语料库可以使用其中任何一个注释或它们的组合。该数据集由50位马来西亚公众人物的帖子组成,生理男性和女性各占一半。该数据集总共包含709,012条原始X帖子(原推特),生理女性作者的帖子占比53.72%,生理男性作者的帖子占比46.28%,分布相对均衡。使用推特应用程序编程接口(Twitter API)来抓取这些帖子。经过预处理后,帖子总数减少到650,409条,性别差距进一步扩大,生理女性的帖子占比56.88%,生理男性的帖子占比43.12%。这个数据集对于马来语 - 英语语码转换自然语言处理(NLP)领域的研究人员来说是一个宝贵的资源,可用于训练或增强现有的和未来的马式英语语言变换器。