Table: Mean performance of all evaluated models across nine base task difficulty levels in HeroBench. Columns show success rate (%), score (mean ± SD), and tokens (mean ± SD). SD is computed across ...