The EU’s new risk-based framework aims to standardise compliance for AI developers, as a pioneering benchmarking suite emerges to evaluate large language models against legal standards.
The European Union (EU) has become a frontrunner in regulating artificial intelligence (AI) with the introduction of a risk-based framework earlier this year. This legislative move places the EU ahead of many countries still deliberating on how to govern AI technologies effectively. The law, which took effect in August, sets the groundwork for a comprehensive regime to manage AI applications, although full details are yet to be finalized, with key elements like Codes of Practice still under development.
The law aims to impose tiered obligations on AI developers, creating a compliance structure that will gradually impact those working with AI applications and models. This has sparked a race to understand and evaluate how AI technologies meet these new legal requirements. Large language models (LLMs), seen as foundational components of most AI applications, are under significant scrutiny.
One notable initiative addressing this compliance evaluation is a project by LatticeFlow AI, a company that emerged from ETH Zurich, a prestigious public research university. LatticeFlow has taken on the challenge of aligning technical evaluations with the EU’s regulatory demands. On a recent Wednesday, they announced the creation of what they call Compl-AI, an open-source LLM validation framework.
This framework, considered to be the first regulation-oriented LLM benchmarking suite, is the product of a long-term collaboration between the Swiss Federal Institute of Technology and Bulgaria’s Institute for Computer Science, Artificial Intelligence and Technology (INSAIT). It allows AI model creators to request evaluations of their technology’s compliance with the EU AI Act.
LatticeFlow’s platform evaluates numerous mainstream large language models, including versions of Meta’s Llama and OpenAI’s GPT, and ranks them on their adherence to the EU’s legal standards. A compliance leaderboard has also been published, assessing performance on various criteria against requirements set by the EU AI Act. The ranking system is based on a scale from 0, indicating no compliance, to 1, indicating full compliance.
The framework tackles 27 benchmarks, such as evaluating AI responses to toxic content, adherence to harmful instructions, truthfulness, and reasoning skills. Results vary widely, with notable achievements and deficiencies spread across different measures. For example, all models showed strength in avoiding harmful instructions and not producing prejudiced content. However, fairness, measured through recommendation consistency, showed universally low performance with scores below the midpoint for all models evaluated.
Challenges remain, with certain aspects like copyright and privacy proving difficult to assess accurately due to current limitations in benchmarks. Smaller models, particularly those with fewer than 13 billion parameters, were found to score poorly in areas like technical robustness and safety, as well as in diversity, non-discrimination, and fairness. Scientists suggest that these deficiencies may arise from a disproportionate focus on enhancing model capabilities rather than on satisfying the new regulatory obligations.
LatticeFlow’s framework, while a work in progress, represents an initial step toward a comprehensive evaluation system tailored to the EU AI Act. The framework’s creators envision it as a dynamic tool that will evolve alongside updates to the Act.
Petar Tsankov, CEO of LatticeFlow, noted that current AI models are primarily optimised for functionality over compliance, indicating substantial performance disparities. Cybersecurity resilience and fairness emerged as critical areas for improvement, with many models struggling to exceed half marks in these categories.
Looking ahead, researchers like Professor Martin Vechev of ETH Zurich, also the founder and scientific director at INSAIT, are encouraging the global AI research community to adopt and refine the open-source framework. Vechev envisions this tool being adaptable for assessing AI models against future regulatory measures worldwide, making it a valuable asset for organisations across varying jurisdictions.
Source: Noah Wire Services


