Bulk Data Extraction with Vibium
Bulk data extraction with Vibium: build a repeatable scrape pipeline over a URL list, extract with findAll(), and write clean JSON, CSV, or a database.
Bulk data extraction with Vibium means turning a whole list of pages into clean, structured records in one repeatable run — not clicking through them by hand. The pattern is always the same four stages: collect the URLs, load each page and extract its fields with find() and findAll(), normalise the raw strings into typed values, then write everything to JSON, CSV, or a database. Vibium is AI-native browser automation built on WebDriver BiDi and shipped as a single Go binary that auto-downloads Chrome for Testing, so a scraper you write on your laptop runs unchanged on a fresh server — pip install vibium or npm install vibium and you are ready, with no driver to match to a Chrome version. Because Vibium drives a real browser, it captures data rendered by JavaScript after load, and its auto-waiting means a lazy-loaded table will not race your script. Created by Jason Huggins, co-creator of Selenium and Appium, Vibium makes a large scrape a short, dependable program you can schedule, resume, and trust.
What does a Vibium bulk extraction pipeline look like?
A bulk extraction pipeline is four stages that run for every URL in your list: collect, extract, normalise, write. Keeping them separate is what makes a scrape maintainable — you can change how you read a field without touching how you save it, and you can swap the output format without rewriting the extraction.
Think of the extraction step as producing one dictionary per page — a record — and everything downstream just moves records around. Once you have a function that turns one URL into one clean record, scaling to thousands of pages is a loop and an output file.
Here is the smallest end-to-end version in Python, using the sync client that most Vibium scrapers use:
import json
from vibium import browser_sync as browser
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
# ... hundreds more
]
def extract(vibe, url):
vibe.go(url)
return {
"url": url,
"title": vibe.find("h1").text(),
"price": vibe.find(".price").text(),
}
vibe = browser.launch(headless=True)
try:
with open("products.jsonl", "w", encoding="utf-8") as out:
for url in urls:
record = extract(vibe, url)
out.write(json.dumps(record) + "\n")
print(f"scraped {url}")
finally:
vibe.quit()This reuses one browser across the whole list, writes each record as its own line to products.jsonl, and guarantees the browser closes with try/finally. The next sections make each stage sturdier.
How does each stage of the pipeline work?
Each stage has one job, and the record dictionary is the contract between them. Understanding the four stages separately is what lets you debug a bulk scrape quickly when one page misbehaves.
- Collect — assemble the URL list. It might be hardcoded, read from a file, pulled from a sitemap, or discovered by scraping a paginated index (see paginate through results).
- Extract — for each URL,
vibe.go(url)loads the page andfind()/findAll()read the fields.find()returns the first match and auto-waits;findAll()returns every match as a plain list and resolves immediately. - Normalise — turn raw strings into clean values: strip whitespace, parse
"$1,299.00"into a number, resolve relative URLs, drop empties. Do this once, at extraction time, so downstream data is already clean. - Write — append each record to your output. Appending as you go (not building one giant list in memory) is what keeps memory flat and protects a long run from a mid-way crash.
- Resume/retry — record which URLs succeeded so a re-run skips them, and capture failures as data instead of letting one bad page sink the batch.
The golden rule of the extract stage: read exactly the fields you need with narrow selectors. A scrape that reads #product-title survives a page redesign far better than one that depends on deep div > div > span chains.
How do I extract many records with findAll()?
Use findAll() when a single page holds many rows, cards, or list items, and read each field with a selector scoped to one item. findAll() returns a normal list, so you iterate exactly as you would over any Python list or JavaScript array.
This example scrapes a listing page of repeated cards into one list of records:
from vibium import browser_sync as browser
vibe = browser.launch(headless=True)
vibe.go("https://example.com/catalog")
# Wait for the first card so late-rendering content is in the DOM.
vibe.find(".card")
card_count = len(vibe.findAll(".card"))
records = []
for i in range(card_count):
scope = f".card:nth-child({i + 1})"
records.append({
"name": vibe.find(f"{scope} .name").text(),
"price": vibe.find(f"{scope} .price").text(),
"link": vibe.find(f"{scope} a").attr("href"),
})
print(f"extracted {len(records)} records")
vibe.quit()The :nth-child selector scopes each field lookup to a single card, so row three's price never leaks into row four. The single find(".card") before the loop matters: because findAll() resolves immediately and returns an empty list when nothing matches, waiting on one card first guarantees the grid has rendered before you count it.
Reading an attribute like href uses attr(); reading visible text uses text(). Those two cover the large majority of extraction fields.
The same shape in JavaScript, using Vibium's sync client:
const fs = require('fs')
const { browser } = require('vibium/sync')
const bro = browser.launch({ headless: true })
const page = bro.page()
page.go('https://example.com/catalog')
page.find('.card') // wait for first card
const cards = page.findAll('.card')
const records = cards.map((card) => ({
name: card.find('.name').text(),
price: card.find('.price').text(),
link: card.find('a').attr('href'),
}))
fs.writeFileSync('catalog.json', JSON.stringify(records, null, 2))
console.log(`extracted ${records.length} records`)
bro.close()Here card.find(...) scopes each lookup to that card element, which is cleaner than building :nth-child selectors — element-scoped find() searches only within the parent.
How do I scrape a whole list of URLs reliably?
Wrap the per-page scrape in a function, then loop over your URL list and append every record to disk as you go. The key to reliability is that a failure on one URL must not lose the records you already have.
The pattern below turns errors into data — a failed page becomes a record with an error field instead of an exception that halts the run:
import json
from vibium import browser_sync as browser
def read_urls(path):
with open(path) as f:
return [line.strip() for line in f if line.strip()]
def extract(vibe, url):
vibe.go(url)
return {
"url": url,
"title": vibe.find("h1").text(),
"price": vibe.find(".price").text(),
"ok": True,
}
urls = read_urls("urls.txt")
vibe = browser.launch(headless=True)
with open("out.jsonl", "w", encoding="utf-8") as out:
for url in urls:
try:
record = extract(vibe, url)
except Exception as e:
record = {"url": url, "error": str(e), "ok": False}
out.write(json.dumps(record) + "\n")
vibe.quit()
# Later: inspect what failed and retry just those.Writing JSON Lines — one JSON object per line — is ideal for bulk output. You append a single line per page, never re-read or re-parse the whole file, and you can start reading results while the scrape is still running. If the process dies at URL 4,000, the first 3,999 records are already safely on disk.
Should I scrape serially or in parallel?
Scrape serially when the list is small or the target site is fragile; scrape in parallel when you have thousands of pages and need throughput. Because a bulk scrape is dominated by network and browser I/O — waiting on page loads — parallelism buys large speedups, but it also multiplies memory use and load on the target.
| Approach | Speed | Memory | Best for | Trade-off |
|---|---|---|---|---|
| Serial, one browser | Baseline | Lowest (one Chrome) | Small lists, fragile or rate-limited sites | Slow on large lists |
| Thread pool, browser per worker | Near-linear to core count | ~300–400 MB per worker | Large lists, sturdy targets | Uses real RAM; can overwhelm a site |
| Multiple machines / queue | Highest | Distributed | Millions of pages | Operational overhead, coordination |
For parallel runs, the rule is one browser per worker — never share a single browser across threads, because one worker's go() would yank the page out from under another. Launch a fresh headless browser inside the worker, scrape, and quit() in a finally:
from concurrent.futures import ThreadPoolExecutor
from vibium import browser_sync as browser
def scrape(url):
vibe = browser.launch(headless=True)
try:
vibe.go(url)
return {"url": url, "title": vibe.find("h1").text(), "ok": True}
except Exception as e:
return {"url": url, "error": str(e), "ok": False}
finally:
vibe.quit()
with ThreadPoolExecutor(max_workers=8) as pool:
results = list(pool.map(scrape, urls))
failed = [r for r in results if not r["ok"]]
print(f"{len(results) - len(failed)} ok, {len(failed)} failed")Start with roughly one worker per CPU core and tune down if the machine begins to swap — memory, not CPU, is usually the ceiling since each headless Chrome is a real process. For a deeper treatment of pool sizing and throttling, see how to parallelize scraping with Vibium.
How do I write bulk data to CSV or a database?
Pick the output format by how the data is shaped and what you will do with it next. Flat tabular data belongs in CSV; large or irregular records belong in JSON Lines; anything you need to query, de-duplicate, or resume belongs in a database.
| Format | Use when | Why |
|---|---|---|
| JSON Lines (.jsonl) | Records are large, nested, or vary in shape | Append-only, streamable, crash-safe, no re-parse |
| CSV | Data is flat and headed for a spreadsheet | Universally openable; csv ships with Python |
| SQLite / Postgres | You need queries, de-dup, or resumable runs | Indexes, uniqueness constraints, restartable |
For a flat dataset, Python's built-in csv module needs no extra dependency. Write the header once, then one row per record:
import csv
from vibium import browser_sync as browser
fields = ["url", "title", "price"]
vibe = browser.launch(headless=True)
with open("products.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
for url in urls:
vibe.go(url)
writer.writerow({
"url": url,
"title": vibe.find("h1").text(),
"price": vibe.find(".price").text(),
})
vibe.quit()Always pass newline="" so the csv module does not add blank rows, and encoding="utf-8" so accented characters survive. See export a scraped table to CSV for the CSV-specific patterns in depth.
For a resumable bulk run, a database is the cleaner choice. Store scraped URLs with a UNIQUE constraint and use an upsert so re-running the job skips pages already captured:
import sqlite3
from vibium import browser_sync as browser
db = sqlite3.connect("scrape.db")
db.execute("""
CREATE TABLE IF NOT EXISTS products (
url TEXT PRIMARY KEY,
title TEXT,
price TEXT
)
""")
# Only scrape URLs we do not already have.
done = {row[0] for row in db.execute("SELECT url FROM products")}
todo = [u for u in urls if u not in done]
vibe = browser.launch(headless=True)
for url in todo:
vibe.go(url)
db.execute(
"INSERT OR REPLACE INTO products VALUES (?, ?, ?)",
(url, vibe.find("h1").text(), vibe.find(".price").text()),
)
db.commit()
vibe.quit()Committing after each record means an interrupted run resumes exactly where it stopped — the done set skips everything already stored. This is what turns an overnight scrape of a million pages from a gamble into a routine job.
How do I handle content that loads after the page?
Wait for the first meaningful element before you extract, because findAll() resolves immediately and would return an empty list if the content has not rendered yet. A single find() on one row or card is enough to pull the whole section into the DOM.
vibe.go("https://example.com/catalog")
vibe.find(".card").wait_until("visible") # block until the grid renders
records = vibe.findAll(".card")wait_until() polls until the element reaches the requested state — visible, hidden, attached, or detached — or the timeout is hit, so a slow API behind the page will not race your script. For pages that stream in more content as you scroll, you extract in a loop, scrolling and re-reading until the count stops growing; the infinite-scroll guide covers that loop in detail.
Because find() auto-waits on actionability, you rarely need fixed sleep() calls — waiting on the element is both faster and more reliable than guessing a delay. This auto-waiting is one of the ways Vibium keeps parity with modern tools; see Vibium vs Playwright for how the two compare on waiting and API design.
What are the rules for polite, unblockable bulk scraping?
Scrape in a way that does not harm the target or get you blocked: throttle your rate, cap concurrency, identify yourself honestly, and never re-fetch data you already have. Aggressive scraping is both an ethical problem and a practical one — a site that notices a flood of requests will start returning captchas or 429s.
- Add a delay between requests. A small pause (even 0.5–2 seconds) on a single domain dramatically lowers your footprint. When parallel, a handful of workers per domain is plenty.
- Set a real user agent. Launch with a genuine browser identity rather than a default automation string, so you look like a normal visitor.
- Cache what you fetch. Store raw pages or use the resumable-database pattern above so a re-run reads from disk, not the network, for URLs you already scraped.
- Respect robots.txt and terms of service. Check what a site permits before you scrape it, and stay within any rate limits it publishes. Only scrape data you have the right to collect.
- Back off on errors. If you start seeing 429 or 503 responses, slow down or pause — hammering a struggling server helps no one.
The single biggest lever is caching plus resumability: a scrape that never asks for the same page twice is faster for you and lighter on the target. Behind a login, the same politeness rules apply — see scrape behind a login for handling authenticated sessions cleanly.
Why is Vibium a good fit for bulk extraction?
Vibium suits bulk work because it removes the two things that usually break scrapers on a new machine: driver mismatches and headless plumbing. It ships as one Go binary that auto-downloads a matching Chrome for Testing, so the scraper you tested locally deploys to a bare server without a chromedriver version dance. Its auto-waiting means you write extraction logic, not timing hacks, and findAll() returning a plain list keeps the code close to ordinary Python or JavaScript.
| Concern | How Vibium helps |
|---|---|
| Deploying to a fresh server | Single binary, auto-downloads Chrome — nothing else to install |
| Driver / browser version drift | No separate driver; the binary manages the browser |
| Flaky timing on dynamic pages | Built-in auto-waiting on element actionability |
| Terse extraction code | find() / findAll() return elements and plain lists |
| Running many browsers at once | Chrome downloads once and is reused, so per-launch cost stays small |
None of this makes Vibium magic — for very large operations you still design for throttling, retries, and storage the same way you would with any tool. What it changes is the setup and maintenance tax, which is exactly where scraping projects tend to rot. For a broader comparison against the incumbent, see Vibium vs Selenium, and for driving a scrape from natural language, see the Vibium MCP with Claude Code guide.
Next steps
Frequently asked questions
What is bulk data extraction with Vibium?
Bulk data extraction is scraping many pages or records in one repeatable run instead of one page by hand. With Vibium you build a small pipeline: load each URL, extract fields with find() and findAll(), normalise them, and write the whole set to JSON, CSV, or a database.
How do I extract data from hundreds of pages with Vibium?
Put your URLs in a list, wrap the per-page scrape in a function that returns a record, then loop or fan out with a thread pool. Give each worker its own headless browser, use try/finally to always quit(), and append every record to one output file so a crash never loses the batch.
Can Vibium scrape data rendered by JavaScript?
Yes. Vibium drives a real Chrome browser over WebDriver BiDi, so it sees content injected after load, just like a user. If rows appear late, wait for the first row with find() before calling findAll(), since findAll() resolves immediately and would miss late-rendering content.
What format should I export bulk-scraped data to?
Use JSON Lines for large, mixed-shape records because you append one object per line and never re-read the file. Use CSV for flat, tabular data headed to a spreadsheet. Load into SQLite or Postgres when you need to query, de-duplicate, or resume a partial run.
How do I avoid getting blocked during bulk extraction?
Scrape politely: add a small delay between requests, cap concurrency to a handful of workers, set a real user agent, and respect robots.txt and the site's terms. Cache pages you have already fetched so a re-run does not hammer the server twice for the same data.
Is Vibium good for bulk scraping compared to Playwright or Selenium?
Vibium ships as a single Go binary that auto-downloads Chrome, so there is no driver to match and no runner to configure — handy when you deploy a scraper to a fresh server. It auto-waits on elements like Playwright, and its findAll() plus plain lists keep extraction code short.
Vibium is created by Jason Huggins. This is an independent tutorial — see the official Vibium site and GitHub repo for canonical docs.
Related guides
Accessibility Testing with Vibium
Accessibility testing with Vibium — read the a11y tree, assert on roles, names, and states, and catch WCAG issues in CI with no driver setup.
14 min read→How-To RecipesMixing API + Web Testing with Vibium
Mix API and web testing with Vibium — assert on backend JSON with waitForResponse and route while driving the real UI, in one script.
14 min read→How-To RecipesE-commerce Test Automation with Vibium
E-commerce test automation with Vibium: script cart, checkout, and payment flows in JS or Python with auto-waiting, AI checks, and CI-ready smoke tests.
15 min read→How-To RecipesUsing Vibium for RPA (Robotic Process Automation)
Use Vibium for RPA to automate repetitive browser tasks — logins, data entry, report downloads, and scraping — code-first, headless, and free.
13 min read→