Maskat Ruhaila, Azman Norazmiera Ayunie, Nulizairos Nur Shaheera Shastera, Zahidin Nurul Athirah, Mahadi Adibah Humairah, Norshamsul Siti Rubaya, Sharif Mohd Mukhlis Mohd, Mahdin Hairulnizam
College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja, Batu Pahat, Johor, Malaysia.
Data Brief. 2024 Jan 8;52:110034. doi: 10.1016/j.dib.2024.110034. eCollection 2024 Feb.
Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code-switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay-English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low-resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias, and providing targeted recommendations for gender-specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay-English code-switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers.
像马来语这样的低资源语言,当语言资源变得稀缺时,面临着灭绝的威胁。本文通过补充低资源语言清单来解决稀缺问题,特别关注被称为“马式英语”(Manglish)的马来语 - 英语。说马式英语的人主要分布在马来西亚、印度尼西亚、文莱和新加坡。随着全球对第二语言的采用和社交媒体使用的增加,诸如西班牙式英语(Spanglish)和中式英语(Chinglish)等语码转换现象变得更加普遍。就马来语 - 英语而言,这种现象被称为马式英语。为了提高马来语的地位并使其从低资源类别中转变出来,本文呈现了这个独特的文本语料库,它带有关于生理性别和匿名作者身份的二元注释。这个双注释数据集为各个领域提供了有价值的应用,包括网络欺凌调查、打击性别偏见以及为特定性别的产品提供针对性建议。这个语料库可以使用其中任何一个注释或它们的组合。该数据集由50位马来西亚公众人物的帖子组成,生理男性和女性各占一半。该数据集总共包含709,012条原始X帖子(原推特),生理女性作者的帖子占比53.72%,生理男性作者的帖子占比46.28%,分布相对均衡。使用推特应用程序编程接口(Twitter API)来抓取这些帖子。经过预处理后,帖子总数减少到650,409条,性别差距进一步扩大,生理女性的帖子占比56.88%,生理男性的帖子占比43.12%。这个数据集对于马来语 - 英语语码转换自然语言处理(NLP)领域的研究人员来说是一个宝贵的资源,可用于训练或增强现有的和未来的马式英语语言变换器。