Series
Blogs
Series
Blogs
*Update: We updated our benchmark report in February 2025 to assess OpenAI o1 and Gemini 2.0, both of which show significant improvement. The updated report is here.
We have created a comprehensive new benchmark to test the ability of Large Language Models (LLMs) to answer legal questions. It demonstrates that the AI tools we tested should not be used for English law advice without human supervision − their superficially convincing answers are often wrong and include fictitious citations.
The benchmark comprises 50 questions from 10 different areas of legal practice. The questions are hard.
They are the sort of questions that would require advice from a competent mid-level (2 years post qualification experience) lawyer, specialised in that practice area; that is someone traditionally four years out of law school. The intention is to test if the AI models can reasonably replace advice from a human lawyer.
We tested each of those 50 questions against four different models: GPT 2, GPT 3, GPT 4 and Bard. None of these models were specially trained to provide legal advice and, instead, they are just general-purpose LLMs.
Other LLMs were not tested as part of this benchmarking exercise; they may perform better in our benchmark.
The answers were marked by senior lawyers from each practice area. Each answer was given a mark out of 10 comprised of 5 marks for substance (“is the answer right?”), 3 for citations (“is the answer supported by relevant statute, case law, regulations?”) and 2 marks for clarity.
In summary:
The table at the end of this article summarises the performance of these models.
On the one hand, the fact that the LLMs we tested are capable of providing anything close to a sensible answer is amazing – particularly considering that only four years ago the state-of-the-art model (GPT 2) produced gibberish.
On the other hand, the substance of the answers they produced was seriously wrong (scoring 1.3 out of 5) and they frequently produced incorrect or entirely fictitious citations (scoring 0.8 out of 3).
Importantly, they are capable of providing clear and lucid answers (scoring 1.4 out of 2). However, this combination of apparent clarity combined with technical inaccuracy, just makes their use more problematic. The superficially lucid and convincing responses that these models provide gives the answers an air of authority they do not deserve.
This indicates that they should not be used for English law legal advice without human supervision.
The improvement in LLM technology over the last 18 months has been incredible. Whether GPT 5 and other next generation LLMs will show the same tremendous leap in capability is an open question.
In any event, we will reapply the LinksAI English law benchmark to GPT 5 and significant future iterations of this technology and update this report.
Our report is here.
The annexes containing the questions and answers, and other supporting materials are here.
Summary of the benchmarking exercise