• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用微调的GPT-2模型从文本中识别母语。

Native language identification from text using a fine-tuned GPT-2 model.

作者信息

Nie Yuzhe

机构信息

School of Foreign Languages, Shanghai University, Shanghai, China.

出版信息

PeerJ Comput Sci. 2025 May 28;11:e2909. doi: 10.7717/peerj-cs.2909. eCollection 2025.

DOI:10.7717/peerj-cs.2909
PMID:40567659
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12192634/
Abstract

Native language identification (NLI) is a critical task in computational linguistics, supporting applications such as personalized language learning, forensic analysis, and machine translation. This study investigates the use of a fine-tuned GPT-2 model to enhance NLI accuracy. Using the NLI-PT dataset, we preprocess and fine-tune GPT-2 to classify the native language of learners based on their Portuguese-written texts. Our approach leverages deep learning techniques, including tokenization, embedding extraction, and multi-layer transformer-based classification. Experimental results show that our fine-tuned GPT-2 model significantly outperforms traditional machine learning methods (., SVM, Random Forest) and other pre-trained language models (., BERT, RoBERTa, BioBERT), achieving a weighted F1 score of 0.9419 and an accuracy of 94.65%. These results show that large transformer models work well for native language identification and can help guide future research in personalized language tools and artificial intelligence (AI)-based education.

摘要

母语识别(NLI)是计算语言学中的一项关键任务,为个性化语言学习、法医分析和机器翻译等应用提供支持。本研究调查了使用微调后的GPT-2模型来提高NLI的准确性。使用NLI-PT数据集,我们对GPT-2进行预处理和微调,以便根据学习者的葡萄牙语文本对其母语进行分类。我们的方法利用了深度学习技术,包括词元化、嵌入提取和基于多层变换器的分类。实验结果表明,我们微调后的GPT-2模型显著优于传统机器学习方法(如支持向量机、随机森林)和其他预训练语言模型(如BERT、RoBERTa、BioBERT),加权F1分数达到0.9419,准确率为94.65%。这些结果表明,大型变换器模型在母语识别方面表现良好,有助于指导未来在个性化语言工具和基于人工智能(AI)的教育方面的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d429/12192634/2910af771e5e/peerj-cs-11-2909-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d429/12192634/107ed95fbdbc/peerj-cs-11-2909-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d429/12192634/e737df321577/peerj-cs-11-2909-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d429/12192634/2910af771e5e/peerj-cs-11-2909-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d429/12192634/107ed95fbdbc/peerj-cs-11-2909-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d429/12192634/e737df321577/peerj-cs-11-2909-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d429/12192634/2910af771e5e/peerj-cs-11-2909-g003.jpg

相似文献

1
Native language identification from text using a fine-tuned GPT-2 model.使用微调的GPT-2模型从文本中识别母语。
PeerJ Comput Sci. 2025 May 28;11:e2909. doi: 10.7717/peerj-cs.2909. eCollection 2025.
2
Sentiment Analysis Using a Large Language Model-Based Approach to Detect Opioids Mixed With Other Substances Via Social Media: Method Development and Validation.使用基于大语言模型的方法通过社交媒体检测与其他物质混合的阿片类药物的情感分析:方法开发与验证
JMIR Infodemiology. 2025 Jun 19;5:e70525. doi: 10.2196/70525.
3
A deep learning approach to direct immunofluorescence pattern recognition in autoimmune bullous diseases.深度学习方法在自身免疫性大疱性疾病中的直接免疫荧光模式识别。
Br J Dermatol. 2024 Jul 16;191(2):261-266. doi: 10.1093/bjd/ljae142.
4
The geometry of meaning: evaluating sentence embeddings from diverse transformer-based models for natural language inference.意义的几何学:评估基于不同Transformer模型的句子嵌入用于自然语言推理
PeerJ Comput Sci. 2025 Jun 16;11:e2957. doi: 10.7717/peerj-cs.2957. eCollection 2025.
5
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
6
Large Language Model Architectures in Health Care: Scoping Review of Research Perspectives.医疗保健中的大语言模型架构:研究视角的范围综述
J Med Internet Res. 2025 Jun 19;27:e70315. doi: 10.2196/70315.
7
Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标:模型开发与评估研究
JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.
8
Enhancing Relation Extraction for COVID-19 Vaccine Shot-Adverse Event Associations with Large Language Models.利用大语言模型增强新冠疫苗接种与不良事件关联的关系抽取
Res Sq. 2025 Mar 17:rs.3.rs-6201919. doi: 10.21203/rs.3.rs-6201919/v1.
9
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
10
Using Traditional and Deep Machine Learning to Predict Emergency Room Triage Levels.使用传统机器学习和深度机器学习预测急诊室分诊级别。
J Comput Biol. 2025 Jun;32(6):584-600. doi: 10.1089/cmb.2024.0632. Epub 2025 May 22.

本文引用的文献

1
From SMILES to Enhanced Molecular Property Prediction: A Unified Multimodal Framework with Predicted 3D Conformers and Contrastive Learning Techniques.从SMILES到增强分子性质预测:一个包含预测3D构象和对比学习技术的统一多模态框架。
J Chem Inf Model. 2024 Dec 23;64(24):9173-9195. doi: 10.1021/acs.jcim.4c01240. Epub 2024 Dec 6.
2
Predicting Antimalarial Activity in Natural Products Using Pretrained Bidirectional Encoder Representations from Transformers.使用来自Transformer的预训练双向编码器表示预测天然产物中的抗疟活性。
J Chem Inf Model. 2022 Nov 14;62(21):5050-5058. doi: 10.1021/acs.jcim.1c00584. Epub 2021 Aug 16.
3
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.
BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
4
Linguistic Predictors of Cultural Identification in Bilinguals.双语者文化认同的语言预测因素
Appl Linguist. 2017 Aug 1;38(4):463-488. doi: 10.1093/applin/amv049. Epub 2015 Oct 31.