Failure Modes of OpenAI Operator
By Zengyi Qin from MIT. 01/23/2025
- Author's twitter: https://x.com/qinzytech
- Author's homepage: https://www.qinzy.tech
Background: Our MIT team has developed an internal Agent benchmark for computer-use agents. We tested OpenAI Operator and show 5 cases here. We did not cherrypick but Operator simply failed in all the 5 tasks. See below for details.
Key takeaways:
- Operator does very well in visual grounding.
- Operator does not fully understand the interactive logic. It is almost surely lower than a college-school level of computer use.
- The OpenAI Operator team seems to devote a lot of effort in post-train but not pre-train, because Operator does not even know some basic web-use knowledge, which should be no problem at all if sufficient pre-training is done.
BTW - Our MIT team is collaborating with data vendors to collect a hundred-billion-token scale pre-training data for computer-use. If you are interested in what we are doing, welcome to contact.
Task 1
Get a image from google. Open the image, then apply a 20% decrease in brightness and a 15% increase in contrast.
Failure reason: entered the wrong number
Operator screen recording:
Task 2
Create a new solid color layer with #0000FF, then apply the Outer Glow effect with a 10px size.
Failure reason: does not know how to use online tools
Operator screen recording:
Task 3
Solve advanced trig question #5 from https://tutorial.math.lamar.edu confirm final angles or identities using an online trig solver.
Failure reason: cannot find the question at all.
Operator screen recording:
Task 4
Look for question #2063 in the book 3000 Solved Problems in Calculus and solve it
Failure reason: cannot find question #2063 at all.
Operator screen recording:
Task 5
Design a low-pass filter using a resistor and capacitor (R = 10kΩ, C = 1μF) in place of RL, and analyze its effect on the output waveform.
Failure reason: does not know how to use online tools.
Operator screen recording (it failed to generate a video so I just put a screenshot placeholder here):