VLearnVibium

Vibium for AI Engineers and Agent Builders

Vibium for AI engineers and agent builders: a hands-on guide to driving a browser from LLM agents via MCP, semantic finds, and the accessibility tree.

By Pramod Dutta··15 min read·Verified with Vibium 26.2
▶ Animated overview · made with Remotion

Vibium is an AI-native browser automation tool that gives LLM agents a clean, reliable way to see and drive a real browser, and for AI engineers its headline feature is a built-in MCP server plus a structured, model-friendly view of any page. Created by Jason Huggins — co-creator of Selenium and Appium — Vibium ships as a single Go binary that auto-downloads Chrome, speaks the WebDriver BiDi standard, and exposes browser actions as MCP tools an agent can call with zero glue code. Instead of scraping raw HTML into your prompt, you hand the model a compact accessibility tree and semantic selectors (find by role, text, or label) that mirror how a human perceives a page. That combination — MCP for control, the a11y tree for perception, and deterministic BiDi actions underneath — is what makes Vibium a fit for building browsing agents, evaluation harnesses, and AI-driven test flows. This guide is an independent learning hub for that workflow.

Why does browser automation matter for AI engineers?

Browser automation matters for AI engineers because most useful agent tasks end at a web page — logging into a dashboard, filling a form, reading a result, verifying a UI. An LLM that can only produce text is stuck the moment a task requires clicking something. Giving the model a browser turns a chatbot into an agent that can act.

The hard part has never been "open a URL." It is giving the model a reliable interface: a way to perceive the page without drowning in markup, a way to act without brittle coordinate-clicking, and a control channel that does not require hand-writing a wrapper for every action. Vibium is built around exactly those three needs.

Traditional automation tools were designed for humans writing scripts. Vibium keeps that deterministic core but adds an agent-facing layer on top — the MCP server, semantic finds, and the accessibility tree — so the same engine serves both a QA engineer and a Claude or GPT agent.

What makes Vibium AI-native?

Vibium is AI-native because agent support is a first-class part of the product, not a plugin bolted on later. Three shipped capabilities carry that claim, and one more is on the roadmap.

  • Built-in MCP server. Every Vibium install includes a Model Context Protocol server. Start it with vibium mcp (or npx -y vibium mcp) and an agent host can call browser actions as tools. No custom HTTP shim, no per-action wrapper.
  • Semantic finds. One find() method takes either a CSS string or a semantic object — find({ role: 'button', text: 'Submit' }). Agents pick selectors the way a person describes an element, which survives DOM churn far better than deep CSS paths.
  • Accessibility tree. a11yTree() returns a compact, structured snapshot — roles, names, states — instead of raw HTML. It is the ideal input for an LLM: small, semantic, and close to how a user reads the page.
  • Roadmap: natural-language methods. page.do("log in as admin") and page.check("cart is empty") are designed but not in the current 26.2 release. Today the LLM does the planning and calls structured tools; those methods will fold the planning into Vibium itself.

For the bigger picture on what Vibium is, see what is Vibium. For an honest read on maturity, see is Vibium production-ready.

How does the agent-plus-Vibium loop work?

The loop is simple: the agent decides what to do, calls a Vibium tool, Vibium executes it deterministically in Chrome, and the result flows back to the model for the next step. The diagram near the top of this article shows the five stages.

Here is the same loop as a table, mapped to what each stage actually does:

StageWho actsWhat happensVibium's role
Prompt / goalLLMReads the task and current page stateProvides page state (a11y tree, text)
Tool callLLMEmits an MCP tool call, e.g. browser_clickExposes the tool schema
ExecuteVibiumTranslates the call into WebDriver BiDiRuns the action, auto-waits for actionability
Chrome actionChromeClicks, types, navigatesDriven over BiDi WebSocket
ResultVibium → LLMReturns text, URL, or a screenshotSerializes a clean, model-readable result

The important detail for reliability: Vibium auto-waits. Before it clicks or types, it checks that the element is present, visible, stable, and enabled. Agents are prone to firing an action a beat too early; auto-waiting absorbs that, so you get fewer flaky, "element not found" dead-ends in an agent run. Read how it works in how actionability works, or start from find element.

How do I connect Vibium to an LLM agent via MCP?

The fastest path is Vibium's built-in MCP server, which most agent hosts can register in one command. MCP (Model Context Protocol) is the open standard for exposing tools to an LLM; Vibium speaks it out of the box.

In Claude Code, add the server:

claude mcp add vibium -- npx -y vibium mcp

Verify it connected:

claude mcp list
# vibium: npx -y vibium mcp - ✓ Connected

Then just ask the agent to do browser work — "take a screenshot of example.com and tell me the headline" — and it will call browser_launch, browser_navigate, browser_get_text, and browser_screenshot on its own. The same vibium mcp server works with Cursor, Gemini CLI, Claude Desktop, and Windsurf; each host has its own registration syntax. See Vibium MCP in Claude Code for the full walkthrough.

Vibium exposes a focused tool set an agent can plan against. The core ones:

MCP toolWhat the agent uses it for
browser_launch / browser_quitStart and stop a session
browser_navigateGo to a URL
browser_find / browser_find_allLocate elements (returns tag, text, bounds)
browser_get_text / browser_get_htmlRead page or element content
browser_click / browser_typeAct on elements by CSS selector
browser_screenshotCapture the page for a vision model
browser_evaluateRun JavaScript for anything custom
browser_new_tab / browser_switch_tabManage multiple tabs

Because the tools are stateless request/response, callback-style APIs (event listeners, request routing) are intentionally not exposed over MCP — an agent uses the higher-level actions instead.

How do I drive Vibium directly from agent code?

When you want tighter control than MCP tools give — custom retry logic, your own tool schema, or Vibium as one tool among many — call the client library directly and wrap it as an agent tool. This is the common pattern for LangChain, custom Claude/GPT loops, and evaluation harnesses.

The JavaScript sync API is the quickest to reason about inside a tool function:

const { browser } = require('vibium/sync')
 
// A single tool your agent can call: "read a page's main text"
function readPage(url) {
  const bro = browser.launch({ headless: true })
  try {
    const vibe = bro.page()
    vibe.go(url)
    // Hand the model a compact, semantic view instead of raw HTML
    const tree = vibe.a11yTree()
    const heading = vibe.find({ role: 'heading' }).text()
    return { heading, tree }
  } finally {
    bro.close()
  }
}
 
console.log(readPage('https://example.com'))

The Python sync client is the natural fit for LangChain and most AI stacks:

from vibium import browser_sync as browser
 
def click_by_role(url: str, role: str, text: str) -> str:
    """An agent tool: navigate and click an element chosen the way a human would."""
    vibe = browser.launch(headless=True)
    try:
        vibe.go(url)
        vibe.find(role=role, text=text).click()   # auto-waits for actionability
        return f"clicked {role} '{text}', now at {vibe.url()}"
    finally:
        vibe.quit()
 
print(click_by_role("https://example.com", "link", "More information..."))

Two things make these safe as agent tools. Each call opens and closes its own session in a finally block, so a failed step never leaks a Chrome process. And find(role=..., text=...) uses a semantic selector, so the tool keeps working when the site's markup shifts under it. For a deeper LangChain integration, see Vibium with LangChain, and build an AI agent that browses.

How does the accessibility tree help an agent perceive a page?

The accessibility tree helps because it is the smallest, most semantic representation of a page an agent can reason over. Feeding raw HTML into a prompt burns thousands of tokens on <div> soup and inline styles the model does not need; the a11y tree strips that down to what actually matters — what each element is and what it is called.

a11yTree() returns nodes like this for a login form:

{
  "role": "WebArea",
  "name": "Login",
  "children": [
    { "role": "heading", "level": 1 },
    { "role": "textbox", "name": "Username" },
    { "role": "textbox", "name": "Password" },
    { "role": "checkbox", "name": "Remember me", "checked": false },
    { "role": "button", "name": "Sign in" }
  ]
}

An agent reads that, decides "fill the Username textbox and click the Sign in button," then acts with matching semantic selectors:

const { browser } = require('vibium/sync')
 
const bro = browser.launch()
const vibe = bro.page()
vibe.go('https://example.com/login')
 
const tree = vibe.a11yTree()          // compact perception for the model
vibe.find({ role: 'textbox', label: 'Username' }).type('alice')
vibe.find({ role: 'button', label: 'Sign in' }).click()
 
bro.close()

Note the mapping the model must respect: when a node's name comes from an aria-label or a <label>, target it with label; when the name comes from visible text (buttons, links), use text. Getting that right is the difference between a selector that holds and one that silently misses. The full rules live in the Vibium accessibility tree and the glossary.

Two scoping options keep the tree agent-sized on real pages. a11yTree({ root: 'nav' }) limits the snapshot to one section, and the default already hides generic containers so the model sees only meaningful nodes.

When should an AI engineer choose Vibium over the alternatives?

Choose Vibium when your agent targets Chrome, you want a built-in MCP server, and a lean single-binary footprint matters more than breadth. It is not the answer for every stack, and being honest about that builds trust.

FactorVibiumPlaywrightSelenium
Built-in MCP serverYes, ships in the binaryYes (separate Playwright MCP)No official MCP
Install footprintOne Go binary, auto-gets ChromeNode package + browser binariesDriver + language bindings
Browser coverageChrome only (today)Chromium, Firefox, WebKitAll major browsers + Grid
Semantic findsOne find(), CSS or semantic8 getBy* methods, chainableCSS/XPath, less semantic
LanguagesPython, JS/TSJS/TS, Python, Java, .NETJava, C#, Python, Ruby, JS
Ecosystem maturityNew (v1 late 2025)Large, matureLargest, 20 years

When to choose Vibium: you are building a Chrome-based browsing agent, an MCP-driven tool, or an AI test harness and you value auto-waiting plus a tiny setup. It is a genuinely strong default for "give my LLM a browser."

When to choose Playwright: you need cross-browser coverage today, Java or .NET clients, or its deep ecosystem of reporters and integrations. Its MCP server is capable too. Compare them head-to-head in Vibium vs Playwright and Vibium vs Playwright MCP.

When to choose Selenium: you have an existing Selenium suite, need Grid, or must run the widest browser matrix. See Vibium vs Selenium.

The fair verdict: Vibium wins on agent ergonomics and setup speed for Chrome; Playwright and Selenium win on breadth and maturity. For a new AI-agent project scoped to Chrome, Vibium's MCP-first design is often the least-friction path — but pair it with another tool the day you need Firefox, Safari, or Java.

How do I build a complete browsing tool an agent can call?

A complete tool follows one shape: take the agent's intent, do the minimum browser work to satisfy it, and return a small structured result the model can chain on. The example below is a self-contained "search and extract" tool — the kind of building block a research agent calls dozens of times.

from vibium import browser_sync as browser
 
def search_and_extract(query: str, result_index: int = 0) -> dict:
    """Agent tool: run a site search, open the Nth result, return its heading + text.
 
    Returns a compact dict the LLM can reason over — never raw HTML.
    """
    vibe = browser.launch(headless=True)
    try:
        vibe.go("https://example.com")
 
        # Act like a user: find the search box by its accessible role, not a brittle #id
        box = vibe.find(role="searchbox")
        box.type(query)
        box.press("Enter")
 
        # Grab result links semantically, then pick the one the agent asked for
        results = vibe.find_all(role="link")
        if result_index >= len(results):
            return {"ok": False, "reason": f"only {len(results)} results found"}
 
        results[result_index].click()
 
        heading = vibe.find(role="heading").text()
        body = vibe.find("main").text()[:2000]   # cap tokens
 
        return {
            "ok": True,
            "url": vibe.url(),
            "heading": heading,
            "excerpt": body,
        }
    except Exception as e:
        # Return the failure as data so the agent can retry or re-plan
        return {"ok": False, "reason": str(e)}
    finally:
        vibe.quit()

Three design choices make this production-grade for an agent, not just a demo:

  1. Errors are returned, not raised. An agent cannot catch a Python exception mid-plan, but it can read {"ok": false, "reason": ...} and decide to retry with a different index or query. Structured failure is what keeps a loop from dead-ending.
  2. Output is capped and semantic. Slicing body to 2,000 characters bounds token cost, and returning heading plus url gives the model anchors to reason about without re-sending the whole page.
  3. The session is scoped to the call. launch() and quit() bracket every invocation, so a hundred agent calls do not accumulate a hundred Chrome processes.

The JavaScript equivalent follows the same contract if your agent runtime is Node:

const { browser } = require('vibium/sync')
 
function extractHeadline(url) {
  const bro = browser.launch({ headless: true })
  try {
    const vibe = bro.page()
    vibe.go(url)
    return { ok: true, headline: vibe.find({ role: 'heading' }).text(), url: vibe.url() }
  } catch (e) {
    return { ok: false, reason: String(e) }
  } finally {
    bro.close()
  }
}

Register either function as a tool in your framework — a LangChain @tool, an MCP tool, or an entry in a custom function-calling schema — and the LLM can now research the live web with a reliable, low-token interface. For a full end-to-end agent, see build an AI agent that browses; for form-heavy flows, see let an LLM fill forms.

Should agents use MCP tools or the client library?

Use MCP when you want the agent host to own the browser, and the client library when your code owns it. Both are valid; the choice is about where control and state live.

ConsiderationMCP serverClient library (as a tool)
SetupOne command (claude mcp add ...)Write a tool function
Who plans actionsThe host LLM, tool by toolYour code, exposing coarse tools
State across callsHost keeps the session aliveYou decide (per-call or shared)
Custom retry / cachingLimited to tool granularityFull control in your function
Best forInteractive agents (Claude Code, Cursor)Harnesses, LangChain, batch eval

For an interactive coding or research agent, MCP is the least work and the model orchestrates naturally. For an evaluation harness that runs the same flow across a hundred URLs, wrapping the client library gives you the retry, caching, and concurrency control MCP's stateless tools do not. Many teams use both: MCP for exploration, the library for production runs.

What are the gotchas when building agents on Vibium?

The gotchas are the usual edges of a young, Chrome-only tool plus a few specific to agent workloads. Knowing them upfront saves debugging later.

  • Chrome only. No Firefox, Safari, or Edge yet. If your agent must verify a page across browsers, Vibium alone will not do it.
  • Two official clients. Python and JavaScript/TypeScript. There is no official Java or .NET client, so JVM-based agent stacks need a bridge.
  • Natural-language methods are roadmap. Do not promise stakeholders page.do() or page.check() today. Build on structured tools and let the LLM plan; adopt those methods when they ship.
  • Manage sessions per task. Agents can wander. Open and close the browser inside each tool call (a finally block), or an errored step can leak a Chrome process. Vibium's 26.2 release hardened process cleanup, but tidy tool code is still your job.
  • Prefer semantic over CSS in agent tools. Deep CSS paths break when a site ships a redesign. find({ role, text }) and the a11y tree keep tools resilient across those changes.
  • Screenshots for vision, tree for reasoning. Send screenshot() to a vision model when layout matters; send a11yTree() when you want cheap, structured reasoning. Mixing them thoughtfully keeps token cost down.

None of these are correctness problems — they are scope and maturity trade-offs. Weigh them against your project, and Vibium is a capable base for browsing agents. To go deeper on structuring resilient automation, read the page object model and automate login with Vibium.

Next steps

Frequently asked questions

What is Vibium for AI engineers?

Vibium is an AI-native browser automation tool built on WebDriver BiDi. For AI engineers it ships a built-in MCP server that exposes browser actions as tools an LLM agent can call, plus semantic finds and an accessibility tree that give models a clean, structured view of any page.

How do I give an LLM agent browser access with Vibium?

Run Vibium's built-in MCP server and register it with your agent host. In Claude Code the command is `claude mcp add vibium -- npx -y vibium mcp`. The agent then calls tools like browser_navigate, browser_find, and browser_click without you writing glue code for each action.

Is Vibium better than Playwright for AI agents?

For agent work Vibium's edge is a built-in MCP server, a single Go binary, and semantic finds in one method. Playwright has a larger ecosystem, cross-browser support, and its own MCP server too. Choose Vibium for lean Chrome-plus-MCP agents; Playwright when you need broad browser coverage.

Does Vibium have natural-language browser commands?

Shipped Vibium gives agents structured tools (find by role, text, label; click; type; screenshot) plus the accessibility tree, and the LLM does the planning. Fully natural-language methods like page.do() and page.check() are on Vibium's roadmap, not in the current 26.2 release.

What LLM frameworks work with Vibium?

Any framework that speaks MCP or can shell out works with Vibium. That includes Claude Code, Cursor, Gemini CLI, Claude Desktop, and Windsurf via MCP, plus LangChain and custom agents that call the Python or JavaScript client directly as tools.

Why use the accessibility tree instead of raw HTML for agents?

Raw HTML is huge and noisy, which wastes tokens and confuses models. Vibium's a11yTree() returns a compact structured view — roles, names, and states — that mirrors how a user perceives the page, so an agent can reason about it and pick reliable semantic selectors with far fewer tokens.

Vibium is created by Jason Huggins. This is an independent tutorial — see the official Vibium site and GitHub repo for canonical docs.

Related guides