UK – The LinksAI English law benchmark
We have created a comprehensive new benchmark to test the ability of Large Language Models (LLMs) to answer legal questions. It demonstrates that the AI tools we tested should not be used for English law advice without human supervision − their superficially convincing answers are often wrong and include fictitious citations.
What questions did we use?
The benchmark comprises 50 questions from 10 different areas of legal practice. The questions are hard.
They are the sort of questions that would require advice from a competent mid-level (2 years post qualification experience) lawyer, specialised in that practice area; that is someone traditionally four years out of law school. The intention is to test if the AI models can reasonably replace advice from a human lawyer.
Which LLMs were tested?
We tested each of those 50 questions against four different models: GPT 2, GPT 3, GPT 4 and Bard. None of these models were specially trained to provide legal advice and, instead, they are just general-purpose LLMs.
Other LLMs were not tested as part of this benchmarking exercise; they may perform better in our benchmark.
How were the answers marked?
The answers were marked by senior lawyers from each practice area. Each answer was given a mark out of 10 comprised of 5 marks for substance (“is the answer right?”), 3 for citations (“is the answer supported by relevant statute, case law, regulations?”) and 2 marks for clarity.
How did each model perform?
- The answers produced by GPT 2 were nonsense and were not individually marked. They were given a presumptive mark of 0 across the board.
- There was relatively little difference in the marks for GPT 3 (3.3 out of 10) and GPT 4 (3.2 out of 10). In other words, there was little improvement moving from GPT 3 to GPT 4.
- Bard achieved the highest mark (4.4 out of 10). It is based on Google’s LaMDA family of LLMs.
The table at the end of this article summarises the performance of these models.
On the one hand, the fact that the LLMs we tested are capable of providing anything close to a sensible answer is amazing – particularly considering that only four years ago the state-of-the-art model (GPT 2) produced gibberish.
On the other hand, the substance of the answers they produced was seriously wrong (scoring 1.3 out of 5) and they frequently produced incorrect or entirely fictitious citations (scoring 0.8 out of 3).
Importantly, they are capable of providing clear and lucid answers (scoring 1.4 out of 2). However, this combination of apparent clarity combined with technical inaccuracy, just makes their use more problematic. The superficially lucid and convincing responses that these models provide gives the answers an air of authority they do not deserve.
This indicates that they should not be used for English law legal advice without human supervision.
What about GPT 5?
The improvement in LLM technology over the last 18 months has been incredible. Whether GPT 5 and other next generation LLMs will show the same tremendous leap in capability is an open question.
In any event, we will reapply the LinksAI English law benchmark to GPT 5 and significant future iterations of this technology and update this report.
Our report is here.
The annexes containing the questions and answers, and other supporting materials are here.
Summary of the benchmarking exercise