Series
Blogs
Series
Blogs
The UK Financial Reporting Council’s recent review of the use of AI by audit firms highlights a range of good practices to assess the quality of AI tools. However, this is a work in progress. For example, the firms did not use formal monitoring to quantify the audit quality impact.
This reflects a wider issue. Many companies are moving beyond the “experimental phase” and into real-life deployment of AI tools in everyday business processes. Proper governance requires a serious assessment of the quality of their output.
Linklaters is at the forefront of AI innovation, both in the delivery of our own services and advising clients on their use of these tools. In this article we explore some of the quality assurance challenges and explain how we are addressing them in our own work, including our automated MLOps platform and manual benchmarking process.
The Financial Reporting Council has issued findings from a thematic review of six firms. This identifies that automated tools and techniques have been used by audit firms for many years, but these firms are increasingly also using AI (e.g. to better hunt through financial data to identify fraud or high-risk transactions). The thematic review reflects the specific requirements in the UK audit sector, particularly ISQM (UK). This requires firms to establish quality objectives so that any tools used for audit are appropriately developed, implemented, maintained and used. With this in mind, the Financial Reporting Council highlights a wide range of good practice by these firms including:
However, practice varied between firms. None of the firms had formal monitoring processes to quantify the audit quality impact and they did not use key performance indicators (with the exception of the one firm that tracked use of these tools against take up targets). The Financial Reporting Council has therefore issued new guidance with a series of recommendations for the way firms should document their use of these tools.
The use of AI is also becoming increasingly common in legal services. Both the Solicitors Regulation Authority and Bar Standards Board have issued guidance but it says little about the need to assess the quality of these tools, and nothing about documenting this process.
Why is that guidance so light on quality assurance? There appear to be three key reasons.
However, we think it is important that we carry out a quality assurance assessment of the tools we use.
One of the steps we have taken to assess these tools is the dedicated Legal Machine Learning Operations (“ML Ops”) platform developed by our in-house Data Engineering and Data Science teams designed to provide systematic, objective, and continuous quality assurance for our AI tools.
This bespoke platform is engineered to evaluate the accuracy of AI models against pre-defined legal tasks. At its core, the system continuously assesses model performance by comparing outputs to a manually created 'ground truth' – our established gold standard for accuracy. Beyond simple keyword matching, the platform leverages a Large Language Model (“LLM”) to act as an intelligent 'judge'.
This LLM evaluates the contextual similarity of model outputs to the ground-truth answers, supplementing traditional metrics such as verbosity and Levenshtein distance. This multi-faceted approach provides a nuanced understanding of model quality, capturing both semantic accuracy and ‘conciseness’.
The output of this comprehensive evaluation is a binary numerical determination – a clear 'match' or 'mismatch' signal. This clear indicator enables ongoing, automated monitoring of model performance over time. It also facilitates effective benchmarking of new or updated models against established baselines, ensuring that any changes do not degrade the overall quality.
Critically, this platform is integrated into our development lifecycle. Before any code is committed, data scientists are required to run their changes against the ground truth within the ML Ops platform. This pre-commit validation ensures that the accuracy of our AI tools remains unaffected, embedding quality assurance directly into the development pipeline and upholding the highest standards of reliability.
We also assess these tools through the slower manual process by way of the LinksAI English law benchmark. This tests the ability of Large Language Models to answer legal questions.
The benchmark comprises 50 questions from 10 different areas of legal practice. The questions are hard and are marked by experts in each practice area using a formal mark scheme.
This indicates significant progress by AI tools. We released an updated benchmark in February 2025 that showed the top performer is OpenAI o1 with a score of 6.4 out of 10. Given the difficulty of the questions, that performance is astonishing. However, o1’s answers were still entirely or mostly wrong in 20% of cases. More worryingly, 8% of answers included fictitious citations.
Our alliance partner issued its Allens AI Australian Law Benchmark in June 2025 which applies a very similar approach. The top score went to OpenAI’s 4o (with o1 and DeepSeek R1 close runners up). Most impressively, 4o only provided a fictitious citation in one answer.
There are criticisms that can be made of the LinksAI English law benchmark, such as the fact it relies on a one-shot approach (with no follow up questions) and that the prompt engineering is rudimentary. The manual scoring process is also slow and subjective, and means the benchmarking process can only be carried out periodically.
However, the combination of a ‘fast’ automated process using our ML Ops platform and ‘slow’ manual process allows for a more rounded assessment of these tools. The ML Ops platform allows a rapid quality assessment of new AI tools, while the laborious manual process provides a gold-standard cross-check.
The approach to any quality assurance process must, of course, reflect the underlying concepts of quality. Some use cases have right and wrong answers, whereas for others the issue is much more subjective.
For example, assessing the answers to legal questions often involves a degree of subjective judgement – but there are answers that are very definitely wrong and potentially very damaging. A lawyer who relies on an AI-generated answer could find themselves facing an expensive negligence claim, disbarment or even contempt of court.
In other situations, the concept of quality may be more elusive. For example, it may be hard to objectively assess if AI-generated marketing collateral is “good” or “bad”, and the implications of getting it wrong are less well defined. This is less a question of incurring liability and more whether sending out endless “AI slop” damages your brand. This will require a more subjective assessment and data on the effectiveness of this content in practice.
It is also worth considering if you are going to use a human counterfactual in your quality assessment. After all, AI-generated legal advice is not perfect but human lawyers are slow and expensive, and can still get the answer wrong.
Do you expect the AI system to operate on a sub-human, human or super-human level? Should you benchmark “human-level” performance? The answer depends, but it is not uncommon for AI to be held to a higher standard. Driverless cars are generally safer than human-driven cars but the safety expectations for automated vehicles are higher and crashes involving a robot chauffeur more newsworthy. Similarly, a human lawyer may not always get the law right and could mix up a case citation, but the sanctions are likely to be worse for a human lawyer who lets an AI-generated falsehood slip through the net.
Proper governance of AI should normally include some form of quality assurance; albeit that will need to reflect the context and industry in which the tool is used. For example: