Description
Evaluating the performance of language models and AI agents can be challenging, especially across diverse tasks and domains. In this session, we'll introduce Unitxt, an open-source framework for unified text evaluation, and explore how it simplifies the process of benchmarking LLMs and agents using a standardized format.
We'll walk through the core ideas behind LLM evaluation—what to measure, how to measure it, and why it matters—and then dive into hands-on examples of evaluating LLMs for quality, reliability, safety and more, as well as evaluating multi-modalities and agentic tool invocation.
Whether you're just getting started with evaluation or looking for a powerful and flexible tool to streamline your workflows, this session will offer practical insights and code-based demos to help you get up and running.
Bring your questions, ideas, or examples—we’ll have time for discussion and Q&A at the end!
Speaker Bio
Elron Bandel (LinkedIn) works to redefine how language models are tested and used at scale. At IBM Research, he leads projects that enhance researchers' abilities to test and utilize language models at transformative scales. Elron co-authored IBM's standard evaluation platform for large language models and spearheads the development of Unitxt, an open-source Python library for AI performance assessment. His academic record supervised by Prof. Yoav Goldberg included work on developing AlephBERT and its innovative evaluation suite, and research into robust language model testing.
About the AI Alliance
The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.