AI governance and quality assurance: Lessons from Linklaters and the audit sector

The UK Financial Reporting Council’s recent review of the use of AI by audit firms highlights a range of good practices to assess the quality of AI tools. However, this is a work in progress. For example, the firms did not use formal monitoring to quantify the audit quality impact.

This reflects a wider issue. Many companies are moving beyond the “experimental phase” and into real-life deployment of AI tools in everyday business processes. Proper governance requires a serious assessment of the quality of their output.

Linklaters is at the forefront of AI innovation, both in the delivery of our own services and advising clients on their use of these tools. In this article we explore some of the quality assurance challenges and explain how we are addressing them in our own work, including our automated MLOps platform and manual benchmarking process.

Use of AI in audit – Good and bad practice

The Financial Reporting Council has issued findings from a thematic review of six firms. This identifies that automated tools and techniques have been used by audit firms for many years, but these firms are increasingly also using AI (e.g. to better hunt through financial data to identify fraud or high-risk transactions). The thematic review reflects the specific requirements in the UK audit sector, particularly ISQM (UK). This requires firms to establish quality objectives so that any tools used for audit are appropriately developed, implemented, maintained and used. With this in mind, the Financial Reporting Council highlights a wide range of good practice by these firms including:

  • Using formal mechanisms to identify and assess new use cases, including innovation portals and meetings.
  • Requiring each system to have a formal “owner” and to go through a formal certification process. These certification processes include applying limitations and restrictions on use of these tools. Some firms also require annual re-certification.
  • Implementing formal processes to standardise and promote certain common data analytics routines for use in audits.
  • Providing a variety of specific training so that the audit teams can identify the correct tools to use.

However, practice varied between firms. None of the firms had formal monitoring processes to quantify the audit quality impact and they did not use key performance indicators (with the exception of the one firm that tracked use of these tools against take up targets). The Financial Reporting Council has therefore issued new guidance with a series of recommendations for the way firms should document their use of these tools.

Use of AI in the legal sector

The use of AI is also becoming increasingly common in legal services. Both the Solicitors Regulation Authority and Bar Standards Board have issued guidance but it says little about the need to assess the quality of these tools, and nothing about documenting this process.

Why is that guidance so light on quality assurance? There appear to be three key reasons.

  • First the legal sector has a general “zero trust” model for AI-generated legal advice. Put differently, the working assumption is that everything the AI says is wrong until proved otherwise.
  • Second, law firms range from the very large to the very small. Expecting sole practitioners to put formal quality assurance measures in place is not realistic. Similarly, most barristers do not have the capacity or capability to conduct their own quality assurance assessment. It is neither efficient nor realistic to expect every market player to conduct this type of assessment.
  • Third, the technology is still moving fast as new agentic/chain-of-thought/RAG etc. models become available. It is difficult to make firm regulatory interventions when the underlying market is changing so rapidly.
Linklaters’ approach: Automated Machine Learning Ops platform

However, we think it is important that we carry out a quality assurance assessment of the tools we use.

One of the steps we have taken to assess these tools is the dedicated Legal Machine Learning Operations (“ML Ops”) platform developed by our in-house Data Engineering and Data Science teams designed to provide systematic, objective, and continuous quality assurance for our AI tools.

This bespoke platform is engineered to evaluate the accuracy of AI models against pre-defined legal tasks. At its core, the system continuously assesses model performance by comparing outputs to a manually created 'ground truth' – our established gold standard for accuracy. Beyond simple keyword matching, the platform leverages a Large Language Model (“LLM”) to act as an intelligent 'judge'.

This LLM evaluates the contextual similarity of model outputs to the ground-truth answers, supplementing traditional metrics such as verbosity and Levenshtein distance. This multi-faceted approach provides a nuanced understanding of model quality, capturing both semantic accuracy and ‘conciseness’.

The output of this comprehensive evaluation is a binary numerical determination – a clear 'match' or 'mismatch' signal. This clear indicator enables ongoing, automated monitoring of model performance over time. It also facilitates effective benchmarking of new or updated models against established baselines, ensuring that any changes do not degrade the overall quality.

Critically, this platform is integrated into our development lifecycle. Before any code is committed, data scientists are required to run their changes against the ground truth within the ML Ops platform. This pre-commit validation ensures that the accuracy of our AI tools remains unaffected, embedding quality assurance directly into the development pipeline and upholding the highest standards of reliability.

Linklaters’ approach: Manual benchmarking

We also assess these tools through the slower manual process by way of the LinksAI English law benchmark. This tests the ability of Large Language Models to answer legal questions.

The benchmark comprises 50 questions from 10 different areas of legal practice. The questions are hard and are marked by experts in each practice area using a formal mark scheme.

This indicates significant progress by AI tools. We released an updated benchmark in February 2025 that showed the top performer is OpenAI o1 with a score of 6.4 out of 10. Given the difficulty of the questions, that performance is astonishing. However, o1’s answers were still entirely or mostly wrong in 20% of cases. More worryingly, 8% of answers included fictitious citations.

Our alliance partner issued its Allens AI Australian Law Benchmark in June 2025 which applies a very similar approach. The top score went to OpenAI’s 4o (with o1 and DeepSeek R1 close runners up). Most impressively, 4o only provided a fictitious citation in one answer.

There are criticisms that can be made of the LinksAI English law benchmark, such as the fact it relies on a one-shot approach (with no follow up questions) and that the prompt engineering is rudimentary. The manual scoring process is also slow and subjective, and means the benchmarking process can only be carried out periodically.

However, the combination of a ‘fast’ automated process using our ML Ops platform and ‘slow’ manual process allows for a more rounded assessment of these tools. The ML Ops platform allows a rapid quality assessment of new AI tools, while the laborious manual process provides a gold-standard cross-check.

Differing concepts of quality

The approach to any quality assurance process must, of course, reflect the underlying concepts of quality. Some use cases have right and wrong answers, whereas for others the issue is much more subjective.

For example, assessing the answers to legal questions often involves a degree of subjective judgement – but there are answers that are very definitely wrong and potentially very damaging. A lawyer who relies on an AI-generated answer could find themselves facing an expensive negligence claim, disbarment or even contempt of court.

In other situations, the concept of quality may be more elusive. For example, it may be hard to objectively assess if AI-generated marketing collateral is “good” or “bad”, and the implications of getting it wrong are less well defined. This is less a question of incurring liability and more whether sending out endless “AI slop” damages your brand. This will require a more subjective assessment and data on the effectiveness of this content in practice.

Human counterfactual

It is also worth considering if you are going to use a human counterfactual in your quality assessment. After all, AI-generated legal advice is not perfect but human lawyers are slow and expensive, and can still get the answer wrong.

Do you expect the AI system to operate on a sub-human, human or super-human level? Should you benchmark “human-level” performance? The answer depends, but it is not uncommon for AI to be held to a higher standard. Driverless cars are generally safer than human-driven cars but the safety expectations for automated vehicles are higher and crashes involving a robot chauffeur more newsworthy. Similarly, a human lawyer may not always get the law right and could mix up a case citation, but the sanctions are likely to be worse for a human lawyer who lets an AI-generated falsehood slip through the net.

Quality assurance in practice

Proper governance of AI should normally include some form of quality assurance; albeit that will need to reflect the context and industry in which the tool is used. For example:

  • Flexible v formal: The approach to quality assurance needs to reflect the risks associated with the AI output and the wider regulatory environment. For audit firms, the strict requirements of ISQM (UK) 1 point towards a formal documented quality assurance process (as does the risks of a tool misfiring and systematically failing to identify problematic issues). For other uses, something more informal might be more suitable.
  • Fast v formal: The quality assurance process also needs to reflect the rapid development of this technology, which should be flexible enough to allow the rapid testing of new tools and subsequent changes to the framework for their use.
  • Quality in context: These tools are never going to be perfect. Their complexity and sophistication almost inevitably mean they will get things wrong from time to time. The question is more ‘how often’ and ‘how badly’ wrong they are, and what that means for the controls and safeguards applied in practice. For example, even the very latest Large Language Models get legal questions wrong and hallucinate cases. This means that – for the time being at least – for legal advice there needs to be a “zero trust” model in which everything must be checked.
  • Employee literacy: Finally, the employees using these tools need to understand their limitations. The EU AI Act recently introduced AI literacy obligations for all employees using AI systems. That will require employees to understand the risks associated with those AI systems and the steps they need to take to mitigate those risks. That is all founded on a clear understanding of the quality of the output from these tools.