As generative AI transforms healthcare, the urgent development of standardised evaluation frameworks becomes essential to ensure safety, efficacy, and trust in AI applications.
The intersection of generative artificial intelligence (AI) and health care is experiencing rapid advancement, propelled by the pressing need for technological capabilities to perform complex tasks such as generating clinical summaries and interpreting multimodal data—including videos, text, and images. This pace of development has introduced substantial challenges in validating the accuracy and efficacy of these AI models, a critical concern given the intricate nature of medical data and the potential repercussions on patient safety. While expert validation through surgeons remains the current gold standard, the process is unsustainable in the long term due to its immense demands on specialised human resources.
Forecasts indicate a significant expansion of the global generative AI market, expected to escalate from $1 billion in 2022 to $22 billion by 2032. This projected growth underscores the urgent need for developing reliable metrics to keep pace and ensure the technology is both safe and effective. New methods, such as adapting the CLIP score to assess the congruence between textual and visual data, illustrate potential advancements. These advancements call attention to the necessity of establishing standardized evaluation frameworks to safely integrate and apply AI solutions in clinical settings.
Evaluating generative AI, particularly within the health care domain, presents unique challenges. The integration of diverse data types such as text descriptions, medical imaging, and extensive surgical videos is a complex endeavour that makes consistent and accurate evaluation difficult. For example, accurately summarising surgical procedures captured on video is critical, as oversimplified narratives can overlook crucial details, thereby compromising the quality of post-operative reports or educational content. These surgeries often require real-time adjustments and employ various tools, demanding an evaluation process that can effectively capture both visual and procedural nuances while accurately tracking transitions between different surgical phases.
In robot-assisted surgeries, generative AI needs to precisely articulate how a surgeon manipulates instruments and correlates these movements with changes in patient anatomy. Current tools are inadequate in measuring the AI’s ability to integrate dynamic data over extended periods, highlighting the necessity for robust evaluation frameworks that can reliably assess the AI’s performance.
Several metrics have been developed for assessing generative AI models, but many fall short when addressing the complexities inherent to health care multimodal data. SPICE (Semantic Propositional Image Caption Evaluation) and BERTScore, for example, provide value in evaluating text- or image-based outputs but struggle with the integration of multimodal data and the intricacies of compositional reasoning required in surgical contexts.
One promising evaluation method is the use of the CLIP score, which assesses the alignment between text and images, calculating the cosine similarity between vectors that represent input images or videos and output text. This approach offers an objective and standardised means to appraise the fidelity of AI-generated content to visual data inputs, thereby enhancing trust in AI applications within health care environments.
As generative AI becomes further embedded in health care processes, the formulation of standardised evaluation frameworks becomes increasingly critical. The risks posed by inaccurate AI-generated outputs, whether in reporting medical summaries or interpreting surgical videos, necessitate consistent evaluation methodologies. Standardised criteria would ensure that AI outputs are not only precise and reliable but also transparent. Integrating AI explainability mechanisms within these frameworks would enable users to comprehend and trust the AI’s decision-making processes.
Entities such as HealthAI and the Coalition for Health AI are actively pursuing the development of validation mechanisms and assurance labs designed for health care AI. These efforts are pivotal, aiming to establish standardised frameworks that align with regulatory requirements like the EU AI Act and the U.S. Executive Order on AI.
Neeraj Mainkar, PhD, who serves as vice president of software engineering and advanced technology at Proprio, is a key figure in addressing these challenges. With over 25 years of experience in the regulated software industry and a background as a computational physicist, Dr Mainkar is a strong proponent of harnessing advanced digital technologies in medical settings to support enhanced surgical performance and improve patient outcomes. His insights highlight the need for developing sophisticated, objective standards that can adequately address the demands and intricacies of health care AI applications.
Source: Noah Wire Services












