The evaluator was a painful (and long) lesson in how poor LLM agents can be
- Way way too strict in requiring answers match the reference verbatim. Incapable of recognising that code using different naming etc' is also correct answer. Many examples but,
#2 insists on creating a new visit_webpage tool when importing VisitWebPage is just as correct.
#5 rejects answer for referencing anthropic/claude-3.5-sonnet instead of anthropic/claude-3-sonnet.
- Consistently inconsistent: previously accepted answers later rejected.
- Reference files are incorrect: #3 only accepts an answer that is incorrect and inconsistent with smolagents documentation. No smolagents import named E2BSandbox. Rejected correct example given in both smolagents and E2B documentation.
Same problem. I don't understand what is E2BSandbox. #3 seems unsolvable.
Same here, the matching should allowed for moore flexibility.
There are errors given to the candidate like: "too strict requirement for tooling" in question # 4. I don't even understand...
E2BSandbox seems unsolvable by looking at the HF blog, smolagents source code which seems a bit hard to satisfy...
The answer that I get heuristically is:
from smolagents import CodeAgent, E2BSandbox
agent = CodeAgent(
tools=[],
model=model,
sandbox=E2BSandbox(),
additional_authorized_imports=["numpy"]
)
This sandbox parameter doesn't even mentioned in documentation LOL.
I've also given up. E2B sandbox seems to have a different API according to the docs. Which doesn't pass the quiz.
https://huggingface.co/docs/smolagents/main/tutorials/secure_code_execution
from smolagents import HfApiModel, CodeAgent
agent = CodeAgent(model=HfApiModel(), tools=[], executor_type="e2b")
agent.run("Can you give me the 100th Fibonacci number?")