Vibium for AI Engineers and Agent Builders
Vibium for AI engineers and agent builders: a hands-on guide to driving a browser from LLM agents via MCP, semantic finds, and the accessibility tree.
Vibium is an AI-native browser automation tool that gives LLM agents a clean, reliable way to see and drive a real browser, and for AI engineers its headline feature is a built-in MCP server plus a structured, model-friendly view of any page. Created by Jason Huggins — co-creator of Selenium and Appium — Vibium ships as a single Go binary that auto-downloads Chrome, speaks the WebDriver BiDi standard, and exposes browser actions as MCP tools an agent can call with zero glue code. Instead of scraping raw HTML into your prompt, you hand the model a compact accessibility tree and semantic selectors (find by role, text, or label) that mirror how a human perceives a page. That combination — MCP for control, the a11y tree for perception, and deterministic BiDi actions underneath — is what makes Vibium a fit for building browsing agents, evaluation harnesses, and AI-driven test flows. This guide is an independent learning hub for that workflow.
Why does browser automation matter for AI engineers?
Browser automation matters for AI engineers because most useful agent tasks end at a web page — logging into a dashboard, filling a form, reading a result, verifying a UI. An LLM that can only produce text is stuck the moment a task requires clicking something. Giving the model a browser turns a chatbot into an agent that can act.
The hard part has never been "open a URL." It is giving the model a reliable interface: a way to perceive the page without drowning in markup, a way to act without brittle coordinate-clicking, and a control channel that does not require hand-writing a wrapper for every action. Vibium is built around exactly those three needs.
Traditional automation tools were designed for humans writing scripts. Vibium keeps that deterministic core but adds an agent-facing layer on top — the MCP server, semantic finds, and the accessibility tree — so the same engine serves both a QA engineer and a Claude or GPT agent.
What makes Vibium AI-native?
Vibium is AI-native because agent support is a first-class part of the product, not a plugin bolted on later. Three shipped capabilities carry that claim, and one more is on the roadmap.
- Built-in MCP server. Every Vibium install includes a Model Context Protocol server. Start it with
vibium mcp(ornpx -y vibium mcp) and an agent host can call browser actions as tools. No custom HTTP shim, no per-action wrapper. - Semantic finds. One
find()method takes either a CSS string or a semantic object —find({ role: 'button', text: 'Submit' }). Agents pick selectors the way a person describes an element, which survives DOM churn far better than deep CSS paths. - Accessibility tree.
a11yTree()returns a compact, structured snapshot — roles, names, states — instead of raw HTML. It is the ideal input for an LLM: small, semantic, and close to how a user reads the page. - Roadmap: natural-language methods.
page.do("log in as admin")andpage.check("cart is empty")are designed but not in the current 26.2 release. Today the LLM does the planning and calls structured tools; those methods will fold the planning into Vibium itself.
For the bigger picture on what Vibium is, see what is Vibium. For an honest read on maturity, see is Vibium production-ready.
How does the agent-plus-Vibium loop work?
The loop is simple: the agent decides what to do, calls a Vibium tool, Vibium executes it deterministically in Chrome, and the result flows back to the model for the next step. The diagram near the top of this article shows the five stages.
Here is the same loop as a table, mapped to what each stage actually does:
| Stage | Who acts | What happens | Vibium's role |
|---|---|---|---|
| Prompt / goal | LLM | Reads the task and current page state | Provides page state (a11y tree, text) |
| Tool call | LLM | Emits an MCP tool call, e.g. browser_click | Exposes the tool schema |
| Execute | Vibium | Translates the call into WebDriver BiDi | Runs the action, auto-waits for actionability |
| Chrome action | Chrome | Clicks, types, navigates | Driven over BiDi WebSocket |
| Result | Vibium → LLM | Returns text, URL, or a screenshot | Serializes a clean, model-readable result |
The important detail for reliability: Vibium auto-waits. Before it clicks or types, it checks that the element is present, visible, stable, and enabled. Agents are prone to firing an action a beat too early; auto-waiting absorbs that, so you get fewer flaky, "element not found" dead-ends in an agent run. Read how it works in how actionability works, or start from find element.
How do I connect Vibium to an LLM agent via MCP?
The fastest path is Vibium's built-in MCP server, which most agent hosts can register in one command. MCP (Model Context Protocol) is the open standard for exposing tools to an LLM; Vibium speaks it out of the box.
In Claude Code, add the server:
claude mcp add vibium -- npx -y vibium mcpVerify it connected:
claude mcp list
# vibium: npx -y vibium mcp - ✓ ConnectedThen just ask the agent to do browser work — "take a screenshot of example.com and tell me the headline" — and it will call browser_launch, browser_navigate, browser_get_text, and browser_screenshot on its own. The same vibium mcp server works with Cursor, Gemini CLI, Claude Desktop, and Windsurf; each host has its own registration syntax. See Vibium MCP in Claude Code for the full walkthrough.
Vibium exposes a focused tool set an agent can plan against. The core ones:
| MCP tool | What the agent uses it for |
|---|---|
browser_launch / browser_quit | Start and stop a session |
browser_navigate | Go to a URL |
browser_find / browser_find_all | Locate elements (returns tag, text, bounds) |
browser_get_text / browser_get_html | Read page or element content |
browser_click / browser_type | Act on elements by CSS selector |
browser_screenshot | Capture the page for a vision model |
browser_evaluate | Run JavaScript for anything custom |
browser_new_tab / browser_switch_tab | Manage multiple tabs |
Because the tools are stateless request/response, callback-style APIs (event listeners, request routing) are intentionally not exposed over MCP — an agent uses the higher-level actions instead.
How do I drive Vibium directly from agent code?
When you want tighter control than MCP tools give — custom retry logic, your own tool schema, or Vibium as one tool among many — call the client library directly and wrap it as an agent tool. This is the common pattern for LangChain, custom Claude/GPT loops, and evaluation harnesses.
The JavaScript sync API is the quickest to reason about inside a tool function:
const { browser } = require('vibium/sync')
// A single tool your agent can call: "read a page's main text"
function readPage(url) {
const bro = browser.launch({ headless: true })
try {
const vibe = bro.page()
vibe.go(url)
// Hand the model a compact, semantic view instead of raw HTML
const tree = vibe.a11yTree()
const heading = vibe.find({ role: 'heading' }).text()
return { heading, tree }
} finally {
bro.close()
}
}
console.log(readPage('https://example.com'))The Python sync client is the natural fit for LangChain and most AI stacks:
from vibium import browser_sync as browser
def click_by_role(url: str, role: str, text: str) -> str:
"""An agent tool: navigate and click an element chosen the way a human would."""
vibe = browser.launch(headless=True)
try:
vibe.go(url)
vibe.find(role=role, text=text).click() # auto-waits for actionability
return f"clicked {role} '{text}', now at {vibe.url()}"
finally:
vibe.quit()
print(click_by_role("https://example.com", "link", "More information..."))Two things make these safe as agent tools. Each call opens and closes its own session in a finally block, so a failed step never leaks a Chrome process. And find(role=..., text=...) uses a semantic selector, so the tool keeps working when the site's markup shifts under it. For a deeper LangChain integration, see Vibium with LangChain, and build an AI agent that browses.
How does the accessibility tree help an agent perceive a page?
The accessibility tree helps because it is the smallest, most semantic representation of a page an agent can reason over. Feeding raw HTML into a prompt burns thousands of tokens on <div> soup and inline styles the model does not need; the a11y tree strips that down to what actually matters — what each element is and what it is called.
a11yTree() returns nodes like this for a login form:
{
"role": "WebArea",
"name": "Login",
"children": [
{ "role": "heading", "level": 1 },
{ "role": "textbox", "name": "Username" },
{ "role": "textbox", "name": "Password" },
{ "role": "checkbox", "name": "Remember me", "checked": false },
{ "role": "button", "name": "Sign in" }
]
}An agent reads that, decides "fill the Username textbox and click the Sign in button," then acts with matching semantic selectors:
const { browser } = require('vibium/sync')
const bro = browser.launch()
const vibe = bro.page()
vibe.go('https://example.com/login')
const tree = vibe.a11yTree() // compact perception for the model
vibe.find({ role: 'textbox', label: 'Username' }).type('alice')
vibe.find({ role: 'button', label: 'Sign in' }).click()
bro.close()Note the mapping the model must respect: when a node's name comes from an aria-label or a <label>, target it with label; when the name comes from visible text (buttons, links), use text. Getting that right is the difference between a selector that holds and one that silently misses. The full rules live in the Vibium accessibility tree and the glossary.
Two scoping options keep the tree agent-sized on real pages. a11yTree({ root: 'nav' }) limits the snapshot to one section, and the default already hides generic containers so the model sees only meaningful nodes.
When should an AI engineer choose Vibium over the alternatives?
Choose Vibium when your agent targets Chrome, you want a built-in MCP server, and a lean single-binary footprint matters more than breadth. It is not the answer for every stack, and being honest about that builds trust.
| Factor | Vibium | Playwright | Selenium |
|---|---|---|---|
| Built-in MCP server | Yes, ships in the binary | Yes (separate Playwright MCP) | No official MCP |
| Install footprint | One Go binary, auto-gets Chrome | Node package + browser binaries | Driver + language bindings |
| Browser coverage | Chrome only (today) | Chromium, Firefox, WebKit | All major browsers + Grid |
| Semantic finds | One find(), CSS or semantic | 8 getBy* methods, chainable | CSS/XPath, less semantic |
| Languages | Python, JS/TS | JS/TS, Python, Java, .NET | Java, C#, Python, Ruby, JS |
| Ecosystem maturity | New (v1 late 2025) | Large, mature | Largest, 20 years |
When to choose Vibium: you are building a Chrome-based browsing agent, an MCP-driven tool, or an AI test harness and you value auto-waiting plus a tiny setup. It is a genuinely strong default for "give my LLM a browser."
When to choose Playwright: you need cross-browser coverage today, Java or .NET clients, or its deep ecosystem of reporters and integrations. Its MCP server is capable too. Compare them head-to-head in Vibium vs Playwright and Vibium vs Playwright MCP.
When to choose Selenium: you have an existing Selenium suite, need Grid, or must run the widest browser matrix. See Vibium vs Selenium.
The fair verdict: Vibium wins on agent ergonomics and setup speed for Chrome; Playwright and Selenium win on breadth and maturity. For a new AI-agent project scoped to Chrome, Vibium's MCP-first design is often the least-friction path — but pair it with another tool the day you need Firefox, Safari, or Java.
How do I build a complete browsing tool an agent can call?
A complete tool follows one shape: take the agent's intent, do the minimum browser work to satisfy it, and return a small structured result the model can chain on. The example below is a self-contained "search and extract" tool — the kind of building block a research agent calls dozens of times.
from vibium import browser_sync as browser
def search_and_extract(query: str, result_index: int = 0) -> dict:
"""Agent tool: run a site search, open the Nth result, return its heading + text.
Returns a compact dict the LLM can reason over — never raw HTML.
"""
vibe = browser.launch(headless=True)
try:
vibe.go("https://example.com")
# Act like a user: find the search box by its accessible role, not a brittle #id
box = vibe.find(role="searchbox")
box.type(query)
box.press("Enter")
# Grab result links semantically, then pick the one the agent asked for
results = vibe.find_all(role="link")
if result_index >= len(results):
return {"ok": False, "reason": f"only {len(results)} results found"}
results[result_index].click()
heading = vibe.find(role="heading").text()
body = vibe.find("main").text()[:2000] # cap tokens
return {
"ok": True,
"url": vibe.url(),
"heading": heading,
"excerpt": body,
}
except Exception as e:
# Return the failure as data so the agent can retry or re-plan
return {"ok": False, "reason": str(e)}
finally:
vibe.quit()Three design choices make this production-grade for an agent, not just a demo:
- Errors are returned, not raised. An agent cannot catch a Python exception mid-plan, but it can read
{"ok": false, "reason": ...}and decide to retry with a different index or query. Structured failure is what keeps a loop from dead-ending. - Output is capped and semantic. Slicing
bodyto 2,000 characters bounds token cost, and returningheadingplusurlgives the model anchors to reason about without re-sending the whole page. - The session is scoped to the call.
launch()andquit()bracket every invocation, so a hundred agent calls do not accumulate a hundred Chrome processes.
The JavaScript equivalent follows the same contract if your agent runtime is Node:
const { browser } = require('vibium/sync')
function extractHeadline(url) {
const bro = browser.launch({ headless: true })
try {
const vibe = bro.page()
vibe.go(url)
return { ok: true, headline: vibe.find({ role: 'heading' }).text(), url: vibe.url() }
} catch (e) {
return { ok: false, reason: String(e) }
} finally {
bro.close()
}
}Register either function as a tool in your framework — a LangChain @tool, an MCP tool, or an entry in a custom function-calling schema — and the LLM can now research the live web with a reliable, low-token interface. For a full end-to-end agent, see build an AI agent that browses; for form-heavy flows, see let an LLM fill forms.
Should agents use MCP tools or the client library?
Use MCP when you want the agent host to own the browser, and the client library when your code owns it. Both are valid; the choice is about where control and state live.
| Consideration | MCP server | Client library (as a tool) |
|---|---|---|
| Setup | One command (claude mcp add ...) | Write a tool function |
| Who plans actions | The host LLM, tool by tool | Your code, exposing coarse tools |
| State across calls | Host keeps the session alive | You decide (per-call or shared) |
| Custom retry / caching | Limited to tool granularity | Full control in your function |
| Best for | Interactive agents (Claude Code, Cursor) | Harnesses, LangChain, batch eval |
For an interactive coding or research agent, MCP is the least work and the model orchestrates naturally. For an evaluation harness that runs the same flow across a hundred URLs, wrapping the client library gives you the retry, caching, and concurrency control MCP's stateless tools do not. Many teams use both: MCP for exploration, the library for production runs.
What are the gotchas when building agents on Vibium?
The gotchas are the usual edges of a young, Chrome-only tool plus a few specific to agent workloads. Knowing them upfront saves debugging later.
- Chrome only. No Firefox, Safari, or Edge yet. If your agent must verify a page across browsers, Vibium alone will not do it.
- Two official clients. Python and JavaScript/TypeScript. There is no official Java or .NET client, so JVM-based agent stacks need a bridge.
- Natural-language methods are roadmap. Do not promise stakeholders
page.do()orpage.check()today. Build on structured tools and let the LLM plan; adopt those methods when they ship. - Manage sessions per task. Agents can wander. Open and close the browser inside each tool call (a
finallyblock), or an errored step can leak a Chrome process. Vibium's 26.2 release hardened process cleanup, but tidy tool code is still your job. - Prefer semantic over CSS in agent tools. Deep CSS paths break when a site ships a redesign.
find({ role, text })and the a11y tree keep tools resilient across those changes. - Screenshots for vision, tree for reasoning. Send
screenshot()to a vision model when layout matters; senda11yTree()when you want cheap, structured reasoning. Mixing them thoughtfully keeps token cost down.
None of these are correctness problems — they are scope and maturity trade-offs. Weigh them against your project, and Vibium is a capable base for browsing agents. To go deeper on structuring resilient automation, read the page object model and automate login with Vibium.
Next steps
- What is Vibium — the one-page overview
- Install Vibium — get set up in minutes
- Vibium MCP in Claude Code — full agent setup
- Build an AI agent that browses — end-to-end example
- find element and screenshot — core commands agents call
- Vibium vs Playwright — pick the right tool
- The 45-day roadmap and the course — structured learning paths
Frequently asked questions
What is Vibium for AI engineers?
Vibium is an AI-native browser automation tool built on WebDriver BiDi. For AI engineers it ships a built-in MCP server that exposes browser actions as tools an LLM agent can call, plus semantic finds and an accessibility tree that give models a clean, structured view of any page.
How do I give an LLM agent browser access with Vibium?
Run Vibium's built-in MCP server and register it with your agent host. In Claude Code the command is `claude mcp add vibium -- npx -y vibium mcp`. The agent then calls tools like browser_navigate, browser_find, and browser_click without you writing glue code for each action.
Is Vibium better than Playwright for AI agents?
For agent work Vibium's edge is a built-in MCP server, a single Go binary, and semantic finds in one method. Playwright has a larger ecosystem, cross-browser support, and its own MCP server too. Choose Vibium for lean Chrome-plus-MCP agents; Playwright when you need broad browser coverage.
Does Vibium have natural-language browser commands?
Shipped Vibium gives agents structured tools (find by role, text, label; click; type; screenshot) plus the accessibility tree, and the LLM does the planning. Fully natural-language methods like page.do() and page.check() are on Vibium's roadmap, not in the current 26.2 release.
What LLM frameworks work with Vibium?
Any framework that speaks MCP or can shell out works with Vibium. That includes Claude Code, Cursor, Gemini CLI, Claude Desktop, and Windsurf via MCP, plus LangChain and custom agents that call the Python or JavaScript client directly as tools.
Why use the accessibility tree instead of raw HTML for agents?
Raw HTML is huge and noisy, which wastes tokens and confuses models. Vibium's a11yTree() returns a compact structured view — roles, names, and states — that mirrors how a user perceives the page, so an agent can reason about it and pick reliable semantic selectors with far fewer tokens.
Vibium is created by Jason Huggins. This is an independent tutorial — see the official Vibium site and GitHub repo for canonical docs.
Related guides
Is Vibium Worth Learning in 2026?
Is Vibium worth learning in 2026? An honest breakdown of who it fits, what it costs to learn, and when to pick it over Playwright or Selenium.
14 min read→Getting StartedLearn Vibium in a Weekend
Learn Vibium in a weekend: a 2-day plan to install it, write real browser scripts, add semantic locators, wire up the MCP server, and ship a project.
14 min read→Getting StartedVibium Cheat Sheet (2026)
The complete Vibium cheat sheet for 2026: install, launch, find, click, type, wait, screenshot, and MCP commands with copy-paste JavaScript and Python.
13 min read→Getting StartedUnderstanding Vibium's Installed Folder Structure
What npm install vibium actually puts on disk — the package, the bundled Go binary, the auto-downloaded Chrome, and where Vibium caches everything.
1 min read→