unit2_smolagents_quiz

Running

The evaluator was a painful (and long) lesson in how poor LLM agents can be

by hardknee - opened 9 days ago

9 days ago

Way way too strict in requiring answers match the reference verbatim. Incapable of recognising that code using different naming etc' is also correct answer. Many examples but,

#2 insists on creating a new visit_webpage tool when importing VisitWebPage is just as correct.
#5 rejects answer for referencing anthropic/claude-3.5-sonnet instead of anthropic/claude-3-sonnet.

Consistently inconsistent: previously accepted answers later rejected.
Reference files are incorrect: #3 only accepts an answer that is incorrect and inconsistent with smolagents documentation. No smolagents import named E2BSandbox. Rejected correct example given in both smolagents and E2B documentation.

Aizdes

9 days ago

•

edited 9 days ago

Same problem. I don't understand what is E2BSandbox. #3 seems unsolvable.

cristuf

8 days ago

Same here, the matching should allowed for moore flexibility.
There are errors given to the candidate like: "too strict requirement for tooling" in question # 4. I don't even understand...
E2BSandbox seems unsolvable by looking at the HF blog, smolagents source code which seems a bit hard to satisfy...

GatinhoEducado

7 days ago

•

edited 7 days ago

The answer that I get heuristically is:
from smolagents import CodeAgent, E2BSandbox agent = CodeAgent( tools=[], model=model, sandbox=E2BSandbox(), additional_authorized_imports=["numpy"] )
This sandbox parameter doesn't even mentioned in documentation LOL.

connorads

7 days ago

I've also given up. E2B sandbox seems to have a different API according to the docs. Which doesn't pass the quiz.
https://huggingface.co/docs/smolagents/main/tutorials/secure_code_execution

from smolagents import HfApiModel, CodeAgent

agent = CodeAgent(model=HfApiModel(), tools=[], executor_type="e2b")

agent.run("Can you give me the 100th Fibonacci number?")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment