Machine Learning

AI Insights from ICML 2025 Part 2: Reinforcement learning, agent evaluation, and confidence

Aug 14, 2025

6 min read

Content

ICML 2025 (International Conference on Machine Learning) brought together leading minds from academia and industry to share ideas and research shaping the future of AI. From foundational breakthroughs to emerging trends, it provided a clear view into where the field is heading. Our very own Jordy Van Landeghem (Senior Software Engineer, Machine Learning) attended to present his work—and brought back insights from one of the field’s most influential gatherings.

In Part 1 of AI Insights from ICML 2025, we explored how AI systems still struggle with grounding — from brittle context retrieval to the limits of multimodal reasoning — challenges that are especially relevant in enterprise document workflows. In Part 2, we’ll discuss what comes next: decision-making, delegation, and trust. At ICML 2025, some of the most thought-provoking work centered on evaluating agent performance, improving reliability through reinforcement learning, and rethinking how we measure confidence. These aren’t just academic concerns — they’re essential questions for anyone deploying AI in the wild, especially in high-stakes enterprise environments. Let’s dive into what it takes to move from capable models to trustworthy systems.

Reinforcement learning takes the lead from instruction tuning

If ICML 2025 made one thing clear, it’s this: reinforcement learning (RL) is having a moment. While instruction tuning still has its place, RL is increasingly taking center stage — especially in settings where learning from trial-and-error better reflects real-world decision-making.

Reinforcement learning helps models learn to make decisions by interacting with their environment, using a reward-based system to optimize outcomes over time. This improves a model’s ability to interact in enterprise settings because many processes involve sequential decision-making, delayed outcomes, and dynamic environments. Unlike more static models that map inputs to outputs, RL can learn policies that optimize for long-term objectives, handle feedback loops, and adjust strategies as workflows evolve. Take processes like client onboarding or claims handling, RL can help agents learn to optimize entire workflows rather than just individual actions.

But with great reward comes great complexity: aligning reward functions with human values and improving observability in these systems is now more critical than ever.

This theme was front and center in a keynote on the slippery question of “what should we optimize for?” — spoiler: even humans get it wrong sometimes 🤖🙈. And when rewards aren't perfectly aligned, well… let's just say RL models get creative (see: reward hacking). Humans provide feedback that’s mostly right but often inconsistent — what we call "noisy rationality." In real-world settings like insurance, education, or enterprise workflows, this means AI systems must learn not just from ideal examples, but from messy, diverse, and occasionally suboptimal behavior — much like humans do.

Approaches like Reinforcement Learning with Verifiable Rewards (RLVR) stood out as especially promising — particularly for domains where rewards can be expressed through objective, rule-based systems (coding, math, science) rather than fuzzy and subjective human feedback (language, documents, business). Curious? Check out VERL on GitHub — it's gaining momentum as a framework for implementing RLVR.

Evaluating agents with better questions, not just better benchmarks

It’s one thing to prototype an AI agent — it’s another to bring it to production. Unlike a single LLM call, agents require tracking far more moving parts: tool selection, memory state, long-horizon planning, and coordination between components. Getting any of that wrong can cause inconsistency, bloated compute costs, or just brittle behavior.

That’s why strong evaluation isn’t a post-hoc step — it’s foundational. With the right error analysis and ablation strategies in place early, training better agents becomes significantly easier later. You can start making data-driven decisions on routing logic, tool selection sequences, or when and what to hand off between agents — all of which compound to stronger, more reliable performance. We’re also seeing reinforcement learning play a growing role in improving agent systems by simulating conversations and exploring the latent space of enterprise workflows.

This emphasis on evaluation was echoed across multiple ICML papers this year. For example, the authors of Towards Enterprise-Ready Computer Using Generalist Agent achieved state-of-the-art results on WebArena and AppWorld — not by improving visual perception, but by focusing on multi-agent architecture and rigorous planning and state tracking. Their work validates the importance of strong internal frameworks over flashier capabilities.

Similarly, ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks introduces a comprehensive benchmarking framework to evaluate AI agents on real-world IT workflows across SRE, CISO, and FinOps domains. It reveals that even state-of-the-art agents often struggle–achieving as low as 0-33% success rates. This underscores the urgent need for real-world benchmarking and the rising importance of federated, domain-specific agents optimized for specialized problem spaces.

All of this reinforces a core belief at Instabase: that federated agent design is essential for reliable collaboration, especially when modeling the complex, multi-role interactions found in real enterprise environments.

What’s especially exciting is how research is converging with production-level concerns: memory compression, multi-agent communication costs, tool routing, and interpretability — all challenges we’ve seen directly at Instabase as we scale agentic enterprise AI systems. Better evaluation isn’t just academic — it’s a prerequisite for dependable automation.

Closing the Loop: agents, RL, and the role of confidence

As AI systems grow more capable — and more deeply embedded in enterprise workflows — confidence estimation is no longer just a nice-to-have. It’s fundamental. A core question that surfaced repeatedly was: what does confidence even mean in the age of LLM-based agents?

This came into focus during the Uncertainty Estimation in LLM-Generated Content tutorial, followed by a memorable hallway chat with local legend Professor Geoff Pleiss of the University of British Columbia — whose early work on Expected Calibration Error was what first opened my eyes to just how overconfident neural networks can be, and how rarely we measure that properly.

Today, most large language models struggle to express calibrated uncertainty. There’s no universally accepted method for determining whether a model should be confident in its response. In production, this creates serious challenges — especially in enterprise automation, where incorrect outputs aren’t just bad answers but risks to business operations, compliance, and customer trust.

Teams today are relying on a mix of approaches:

Black-box scoring, like logits aggregations, P(True) or an LLMJudge to score answer groundedness, context relevance, and instruction fidelity.
White-box methods, such as natural language inference models, paraphrase agreement, and multi-sample self-consistency — all to better gauge when a model might be bluffing.

I even posed this question to the presenter of the expo talk on The Next Frontier in Enterprise AI: A Vision for Generalist Agents: how do we know when agents are confidently wrong? Their answer: by measuring for consistency. While consistency builds predictability — a prerequisite for trusting autonomy — we still need calibrated uncertainty to complete it and provide a foundation for trusting automation.

As someone who studied uncertainty estimation during my PhD — when models were still measured in millions, not trillions, of parameters — I’ve seen how improving uncertainty estimation almost always involves tradeoffs. For instance, Bayesian deep learning methods can double parameter counts to model distributions over possible values — which quickly breaks down at frontier scales like Kimi K2.

Still, there are promising directions. Reinforcement Learning with Calibration Rewards is one such approach that stands out — combining uncertainty estimation with agent learning objectives. It suggests a future where agents not only learn what to do, but also when not to act. This is particularly powerful in enterprise settings, where saying "I don't know" at the right time can prevent downstream failure.

In many ways, this closes the loop — tying together automation, agents, and reinforcement learning around a central question: confidence. As systems become more autonomous, the ability to express uncertainty and defer action responsibly will define what makes automation not just intelligent, but trustworthy.

Final thoughts

Attending ICML 2025 was both energizing and grounding — a reminder of how far the field has come, and how many challenges still remain at the intersection of research and real-world deployment. One thing became abundantly clear: building truly capable, trustworthy AI agents takes more than clever prompting — it demands rigorous evaluation, calibrated decision-making, and careful design.

Much like hiring a new teammate, you want to score an agent thoroughly before trusting it to make decisions on your behalf. What’s often presented in the market today as agentic intelligence is, in many cases, either a deterministic multi-step workflow or a basic ReAct loop dressed up as autonomy. We’re aiming higher — toward agents that can adapt, collaborate, and know when to ask for help.

Looking ahead, the big questions in ML aren’t just about accuracy or speed — they’re about connecting the dots between reasoning, calibration, and autonomy at scale. The next wave of innovation will likely hinge on better human-agent coordination, improved feedback loops, and fine-grained control over orchestration with the right level of autonomy guided by reliable automation — all of which are critical for applying AI in high-stakes domains like finance, law, and government. We’re continuing to push the boundaries of what’s possible with agentic systems — and we’re hiring. Check out our careers page if you’d like to join us for the ride.

AI Insights from ICML 2025 Part 2: Reinforcement learning, agent evaluation, and confidence

Reinforcement learning takes the lead from instruction tuning

Evaluating agents with better questions, not just better benchmarks

Closing the Loop: agents, RL, and the role of confidence

Final thoughts

Further Reading