Suppr超能文献

印度司法背景下法律问答系统的数据集

Dataset for legal question answering system in the Indian judiciary context.

作者信息

K Veningston, Mishra Apratim

机构信息

Department of Computer Science and Engineering, National Institute of Technology Srinagar, Jammu and Kashmir 190006, India.

出版信息

Data Brief. 2025 May 12;60:111647. doi: 10.1016/j.dib.2025.111647. eCollection 2025 Jun.

Abstract

Legal documents, such as court judgments and statutes, are vital for understanding judicial decisions, legal principles, and procedural details. However, these documents are often dense, complex, and abundant, making it challenging for lawyers, researchers, and citizens to quickly and easily locate and retrieve relevant information. The Legal Question Answering (LQA) [1] task involves developing systems that can automatically answer legal questions based on relevant legal documents centred on the and , preferably from delivered judgments that are considered public property of the nation. The need for specialized datasets in LQA is particularly pressing in countries like India, where legal texts follow distinct judicial structures, specialized terminologies, and procedural intricacies [2]. Due to the lack of a relevant dataset for an LQA system [3], this paper presents a comprehensive dataset designed for LQA in the Indian judiciary context that facilitates efficient legal information retrieval. The dataset comprises 10,000 question-answer pairs derived from 1256 Indian Supreme Court judgments across various legal domains, including 538 criminal and 718 civil cases available on Mendeley Data [4]. Each QA pair is derived from detailed legal judgments from the Apex court (i.e. Supreme Court of India), with the questions framed to capture essential legal issues, principles, or facts, and answers extracted directly from the text. The dataset covers a balanced mix of legal topics in criminal and civil law, such as constitutional matters, property disputes, criminal offences, procedural matters, family disputes, employment matters, financial and taxation issues, and public welfare concerns. Additionally, it includes metadata such as case name and judgement date. This dataset supports the development of AI-driven LQA systems to enhance access to precise legal information and aid legal professionals/common citizens about India's complex legal system. To evaluate its effectiveness for legal question-answering tasks, the IndicLegalQA Dataset is fine-tuned on the "meta-llama/Llama-2-7b-hf" model [5] using Parameter-Efficient Fine-Tuning (PEFT), specifically the Low-Rank Adaptation (LoRA) technique [[6], [7], [8]]. The fine-tuned model is evaluated using Sentence-BERT (SBERT) [9], with the "paraphrase-MiniLM-L6-v2" model embedding. Cosine similarity measures how well the model captures the nuances of legal language between actual and generated answers. This ensures the dataset is well-suited for real-world legal applications, making it a valuable resource for improving AI-driven legal information retrieval systems.

摘要

法律文件,如法院判决和法规,对于理解司法判决、法律原则和程序细节至关重要。然而,这些文件往往篇幅冗长、复杂且数量众多,使得律师、研究人员和普通公民难以快速、轻松地查找和检索相关信息。法律问答(LQA)[1]任务涉及开发能够基于以 和 为中心的相关法律文件自动回答法律问题的系统,最好是根据被视为国家公共财产的已公布判决。在印度等国家,LQA对专门数据集的需求尤为迫切,因为印度的法律文本遵循独特的司法结构、专门术语和程序复杂性[2]。由于缺乏适用于LQA系统的相关数据集[3],本文提出了一个专门为印度司法背景下的LQA设计的综合数据集,以促进高效的法律信息检索。该数据集包含从1256份印度最高法院跨各种法律领域的判决中提取的10000个问答对,其中包括门德利数据[4]上提供的538起刑事案件和718起民事案件。每个问答对均源自最高法院(即印度最高法院)的详细法律判决,问题的构建旨在捕捉关键的法律问题、原则或事实,答案则直接从文本中提取。该数据集涵盖了刑法和民法中法律主题的均衡组合,如宪法事务、财产纠纷、刑事犯罪、程序事务、家庭纠纷、就业事务、金融和税收问题以及公共福利问题。此外,它还包括案件名称和判决日期等元数据。这个数据集支持人工智能驱动的LQA系统的开发,以增强对精确法律信息的获取,并帮助法律专业人员/普通公民了解印度复杂的法律体系。为了评估其在法律问答任务中的有效性,IndicLegalQA数据集使用参数高效微调(PEFT),特别是低秩适应(LoRA)技术[[6], [7], [8]]在“meta-llama/Llama-2-7b-hf”模型[5]上进行微调。使用Sentence-BERT(SBERT)[9]和“paraphrase-MiniLM-L6-v2”模型嵌入对微调后的模型进行评估。余弦相似度衡量模型捕捉实际答案和生成答案之间法律语言细微差别的程度。这确保了该数据集非常适合实际的法律应用,使其成为改进人工智能驱动的法律信息检索系统的宝贵资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b059/12166829/9ef05457fa81/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验