Series
Blogs
Series
Blogs
The last 24 months have seen generative AI advance in leaps and bounds, exemplified by remarkable developments in large language models (LLMs). Their new capabilities are having a significant impact on the way businesses operate, including the legal function. However, is generative AI any good when it comes to the law and, particularly, Australian law?
Our alliance partner in Australia, Allens, has developed the Allens AI Australian Law Benchmark to test the ability of LLMs to answer legal questions. They tested general-purpose implementations of market-leading LLMs, approximating how a lay user might try to answer legal questions using AI, instead of a human lawyer.
One of the more interesting points in the benchmark is confirmation that the answers for smaller jurisdictions such as Australia are “infected” by legal analysis from larger jurisdictions with similar, but different, laws.
Despite being asked to answer from an Australian law perspective, many of the responses cited authorities from UK and EU law, or incorporated UK and EU law analysis that is not correct for Australian law.
In one notable example, Claude 2 was asked whether intellectual property rights could be used to prevent the re-use of a price index. It responded by citing “Relevant case law interpreting these laws (e.g., Navitaire Inc v Jetstar Airways Pty Ltd [2022] FCAFC 84)”. This is a fictitious case with remarkable similarities to the seminal English law case Navitaire Inc v EasyJet Airline Co. [2004] EWHC 1725 (Ch).
It appears that Claude 2 was seeking to “Australianise” the case name by:
It is hard not to applaud Claude 2’s initiative, but creating fictitious cases is not a helpful feature in an electronic legal assistant. (In any event, it is also hard to see why the English law Navitaire decision – which focused on copyright in software – is relevant to this question).
Allens’ findings are broadly consistent with our LinksAI English law benchmark from October last year.
Since we published that report, we have benchmarked a number of other systems. Some of those new systems perform at a significantly higher level. One system we tested appeared capable of – for example – preparing a sensible first draft of a summary of the law in a particular area, albeit the summary still needs careful review to verify its accuracy, correctness and absence of fictitious citations. However, even that system struggled with questions involving critical reasoning or clause analysis.
There has been some discussion about whether these problems can be addressed through law tools specifically designed to answer legal questions – particularly those using retrieval-augmented generation (“RAG”). RAG works by taking a query, searching through a knowledge base of trusted materials, ranking that material and then using the LLM to synthesise a response.
This approach should deliver more reliable results as it more tightly constrains the materials the LLM works from and, importantly, allows the LLM to link back the source material so the reader can verify the conclusions of the LLM.
Whether this will eliminate hallucinations entirely remains to be seen. The recent study by Stanford University’s Human-Centred Artificial Intelligence unit found that specialised legal AI tools reduce the hallucination problem associated with general purpose LLMs, but are a long way from eliminating it entirely (here). However, Stanford’s findings have been contested by some of the providers of these tools (here).
The Allens AI Australian Law Benchmark can be found here.