Skip to content

OpenAI GPT-5.4 Release: The Battle for "Native Computer Use"

Published: Duration: 5:23
0:00 0:00

Transcript

Guest: Thanks for having me, Alex. It’s a wild time to be looking at a monitor, I’ll tell you that much. Host: It really is! So, let’s jump straight in. We’ve had "wrappers" and third-party tools trying to let AI control our computers for a while now, but OpenAI is calling this "Native Computer Use." What’s the actual difference? Is it just marketing, or has something fundamentally changed under the hood? Guest: Oh, it’s definitely not just marketing. In the past, if you wanted an AI to use your computer, you were basically "gluing" things together. You’d take a screenshot, send it to the model, the model would guess a coordinate, and a script would try to click it. It was... well, it was brittle. If a notification popped up, the whole thing would break. Host: That sounds a lot more stable. And they’ve split it into these "Thinking" and "Pro" variants, right? I saw that in the release notes. Why the split? Guest: This is actually the most interesting part for me. The "Thinking" model is all about deep reasoning—what we call "internal chain-of-thought." If you tell it to, say, "update the quarterly budget in Excel based on three different Slack threads," the Thinking model doesn't just start clicking wildly. It actually pauses—you can see the "thinking" state—and it builds a mental map. It parses the Slack threads, identifies the data points, plans the Excel navigation, and validates the math *before* it moves the mouse. Host: Context drift... I’ve heard that term. Is that when the AI kind of "forgets" what it was doing halfway through a long task? Guest: Exactly. It’s the "wait, why am I in this folder again?" moment. In older models, if an agent had to perform 50 steps to refactor a codebase, by step 30, it might lose the original goal. OpenAI claims the Pro model solves this by having a much tighter feedback loop with the OS. If a terminal command fails or a port is blocked, it sees that in real-time and self-corrects immediately. Host: Interesting! It’s like it’s finally getting a sense of "object permanence" for our digital workspaces. I actually saw a conceptual snippet of the new API—it looked like you can just pass permissions like "terminal" or "browser" to the agent and let it rip. Does that scare you as a dev? Guest: (Laughs) Honestly? A little bit! We’re talking about giving a model direct control over your environment. I actually ran a test yesterday where I asked the Pro agent to set up a localized Docker environment and fix any port conflicts. It navigated my VS Code, opened the terminal, ran `docker-compose up`, saw a 5432 port conflict with my local Postgres, killed the local process, and updated the config. It did it in about 20 seconds. Host: Wow. That usually takes me ten minutes and three searches on Stack Overflow. Guest: Right! But the "aha moment" was when a system update pop-up appeared right in the middle of it. Usually, that would blindside an agent. But 5.4 just... swiped it away and kept going. That’s the "Native Computer Use" in action. It treats the OS as a primary input/output stream, not just a static image. Host: It feels like a direct response to Anthropic’s Claude Code. I know a lot of our listeners are using Claude for coding right now. How does this stack up? Is OpenAI reclaiming the lead here? Guest: It’s a real battle. Anthropic’s Claude Code is incredible for terminal-based tasks—it’s very "dev-centric." But OpenAI is aiming for something more holistic. They want to own the whole desktop. GPT-5.4 isn't just for coding; it’s for moving between the IDE, Slack, Notion, and the browser. They’re moving toward this "Digital Employee" concept. Host: "Hallucination of action"? Tell me more about that. Guest: So, we know LLMs can hallucinate facts. But an agentic LLM can hallucinate *actions*. It might try to click a "Submit" button that it *expects* to be there, but isn't. Or it might try to run a command in a directory that doesn't exist yet. OpenAI is using the "Thinking" model to simulate these actions internally before executing them to reduce those errors, but it’s still not 100% perfect. You still get those moments where it just... stares at the screen for a second, confused. Host: (Laughs) So it really *is* becoming more human! But seriously, Marcus, the security implications here feel massive. If I give an AI permission to use my terminal and my browser, I’m essentially giving it the keys to the kingdom. How is OpenAI addressing the "Day 0" risk? Guest: That’s the million-dollar question. They’ve implemented sandboxed execution environments, which is a start. But the real guardrail is "Human-in-the-loop," or HITL. For high-privilege actions—like deleting a repo or sending a bank transfer—the model is supposed to trigger a checkpoint where a human has to click "Approve." Host: It’s a total shift in how we define a "SaaS" product, too, right? If GPT-5.4 can navigate any UI, then the "moat" of having a pretty interface kind of disappears. Guest: Absolutely. The value shifts from the UI to the data and the API. If the AI can just "use" the software for me, I don't care how the dashboard looks. I just care that the task gets done. Software is going to start being "built for AI to use" as much as it is for humans. Host: That is a wild thought to end on. Software for AI users. Marcus, this has been fascinating. I feel like I need to go home and re-organize my desktop before the agents take over! Guest: (Laughs) Just make sure you don't have any sensitive passwords in "Sticky Notes" on your screen, Alex! Host: Noted! Where can people find you and follow your work with AetherCode? Guest: You can find me on X at @mthorne_dev or at AetherCode.io. We’re actually building some of those sandboxed environments I mentioned, trying to make this agentic future a bit safer for everyone. Host: Awesome. Thanks again, Marcus. Host: Wow. The "Native Computer Use" era. It really feels like we’re moving away from the AI being a tab in our browser to it being a layer over our entire operating system. The takeaway for me is that we need to start thinking about our workflows not as a series of manual clicks, but as a set of outcomes that an agent can execute.