Australia – Allens AI Australian Law Benchmark

Article|

11 June 2024

Key findings

The strongest overall performer was GPT-4, followed by Perplexity. The runners-up, LLaMa 2, Claude 2 and Gemini-1, performed similarly.
Even the best-performing LLMs tested were not consistently reliable when asked to answer legal questions. While these LLMs could have a practical role in assisting legal practitioners to summarise relatively well-understood areas of law, inconsistencies in performance means these outputs still need careful review by someone able to verify their accuracy and correctness.
For tasks that involve critical reasoning, none of the general tools can be relied upon to produce correct legal advice. They frequently produced answers that got the law wrong and/or missed the point of the question, while expressing their answers with falsely inflated confidence.
These models should not be used for Australian law legal advice without expert human supervision. There are real risks to using these tools to generate Australian law legal advice if you do not already know the answer.

The “infection” issue

One of the more interesting points in the benchmark is confirmation that the answers for smaller jurisdictions such as Australia are “infected” by legal analysis from larger jurisdictions with similar, but different, laws.

Despite being asked to answer from an Australian law perspective, many of the responses cited authorities from UK and EU law, or incorporated UK and EU law analysis that is not correct for Australian law.

In one notable example, Claude 2 was asked whether intellectual property rights could be used to prevent the re-use of a price index. It responded by citing “Relevant case law interpreting these laws (e.g., Navitaire Inc v Jetstar Airways Pty Ltd [2022] FCAFC 84)”. This is a fictitious case with remarkable similarities to the seminal English law case Navitaire Inc v EasyJet Airline Co. [2004] EWHC 1725 (Ch).

It appears that Claude 2 was seeking to “Australianise” the case name by:

changing the court citation from “EWHC” (referring to the High Court of England and Wales) to “FCAFC” (referring to the Federal Court of Australia); and
replacing the reference to the low-cost UK airline easyJet with the Australian budget airline Jetstar.

It is hard not to applaud Claude 2’s initiative, but creating fictitious cases is not a helpful feature in an electronic legal assistant. (In any event, it is also hard to see why the English law Navitaire decision – which focused on copyright in software – is relevant to this question).

Wider findings – Is the solution RAG?

Allens’ findings are broadly consistent with our LinksAI English law benchmark from October last year.

Since we published that report, we have benchmarked a number of other systems. Some of those new systems perform at a significantly higher level. One system we tested appeared capable of – for example – preparing a sensible first draft of a summary of the law in a particular area, albeit the summary still needs careful review to verify its accuracy, correctness and absence of fictitious citations. However, even that system struggled with questions involving critical reasoning or clause analysis.

There has been some discussion about whether these problems can be addressed through law tools specifically designed to answer legal questions – particularly those using retrieval-augmented generation (“RAG”). RAG works by taking a query, searching through a knowledge base of trusted materials, ranking that material and then using the LLM to synthesise a response.

This approach should deliver more reliable results as it more tightly constrains the materials the LLM works from and, importantly, allows the LLM to link back the source material so the reader can verify the conclusions of the LLM.

Whether this will eliminate hallucinations entirely remains to be seen. The recent study by Stanford University’s Human-Centred Artificial Intelligence unit found that specialised legal AI tools reduce the hallucination problem associated with general purpose LLMs, but are a long way from eliminating it entirely (here). However, Stanford’s findings have been contested by some of the providers of these tools (here).

The Allens AI Australian Law Benchmark can be found here.

Australia – Allens AI Australian Law Benchmark

Share this:

Key findings

The “infection” issue

Wider findings – Is the solution RAG?

Share this:

Australia – Allens AI Australian Law Benchmark

Key findings

The “infection” issue

Wider findings – Is the solution RAG?

Australia – Allens AI Australian Law Benchmark

Share this:

Key findings

The “infection” issue

Wider findings – Is the solution RAG?

​

Share this:

Australia – Allens AI Australian Law Benchmark

​

Key findings

The “infection” issue

Wider findings – Is the solution RAG?