ondrej.
· 8 min read

B2B lead pipeline: Bun + SQLite + Playwright (98,640 companies in a weekend)

The Krtek pipeline from scratch — pulling the CSO RES dump, NACE filtering, ARES enrichment, Playwright scraping. Bun + bun:sqlite as the batch engine. Code, schema, NACE strategy.

b2bdata-pipelinebunscraping

Most B2B lead-gen tools in the Czech market cost CZK 5,000–20,000 per month and hand you generic contacts that everyone else also has. Instead of paying, I built my own pipeline over a weekend. The result: 98,640 companies in a database, sharply filtered by NACE codes, enriched with ARES data, and with emails scraped from company websites.

The project is called Krtek — here is the full architecture.

Pipeline overview

CSO RES dump (1.5 GB CSV)
   ↓ NACE filter (692xx → accounting)
companies with VAT ID + address
   ↓ ARES API enrichment
+ region, tax ID, legal form
   ↓ web search heuristics
company URL
   ↓ Playwright / fetch
HTML pages
   ↓ contact extraction
emails + phones
   ↓
SQLite (bun:sqlite)

Five steps, each as a standalone Bun script. SQLite holds state and lets the pipeline resume.

Step 1 — the CSO RES dump

The Czech Statistical Office publishes the Register of Economic Subjects as a public CSV dump (around 1.5 GB). It contains VAT IDs, names, addresses, NACE codes, legal forms, founding dates — everything you need for segmentation.

Download + parsing in Bun:

import { Database } from 'bun:sqlite';
 
const db = new Database('krtek.sqlite');
db.exec(`
  CREATE TABLE IF NOT EXISTS firmy (
    ico TEXT PRIMARY KEY,
    nazev TEXT NOT NULL,
    nace TEXT,
    kraj TEXT,
    adresa TEXT,
    pf TEXT,
    web TEXT,
    email TEXT,
    telefon TEXT,
    discovered_at INTEGER DEFAULT (unixepoch())
  );
  CREATE INDEX IF NOT EXISTS idx_nace ON firmy(nace);
  CREATE INDEX IF NOT EXISTS idx_kraj ON firmy(kraj);
`);
 
const file = Bun.file('res-dump.csv');
const stream = file.stream();
const decoder = new TextDecoder('windows-1250'); // CSO still ships in 1250
 
const insert = db.prepare(
  'INSERT OR IGNORE INTO firmy (ico, nazev, nace, adresa, pf) VALUES (?, ?, ?, ?, ?)',
);
const tx = db.transaction((rows: string[][]) => {
  for (const r of rows) insert.run(...r);
});

Key detail: Bun.file().stream() chews through 1.5 GB without OOM. Node would not give you that without an explicit streaming setup. Bun has it built in.

bun:sqlite exposes a sync API → batched inserts inside a transaction run an order of magnitude faster than going through Postgres + drizzle/prisma. For this use case (single machine, single writer, read-heavy), SQLite is the right choice.

Step 2 — NACE filter

NACE is the European classification system for economic activities. 6920 = "accounting, bookkeeping, and tax consulting". This was my first target, since DokladBot is positioned at accounting firms.

const accountingFirms = db
  .query("SELECT * FROM firmy WHERE nace LIKE '6920%'")
  .all();
console.log(`${accountingFirms.length} accounting firms`);
// → ~12,800

Planned expansion:

  • NACE 620 — programming, IT consulting (target audience for AI consulting services)
  • NACE 631 — data processing, hosting (target for infra products)

NACE is an underrated filter. Most tools give you fuzzy "industry tags". A 5-digit NACE code gives you a precise audience.

Step 3 — ARES API enrichment

The CSO dump only has the region as a postal code. For region-level segmentation, I want a normalized region name. ARES (the Administrative Registry of Economic Subjects) has a free REST endpoint:

type AresResponse = { sidlo: { kodKraje: string; nazevKraje: string }; dic?: string };
 
async function enrichFromAres(ico: string): Promise<AresResponse | null> {
  const url = `https://ares.gov.cz/ekonomicke-subjekty-v-be/rest/ekonomicke-subjekty/${ico}`;
  const res = await fetch(url, { headers: { Accept: 'application/json' } });
  if (!res.ok) return null;
  return (await res.json()) as AresResponse;
}
 
// rate-limited parallel enrichment
async function enrichBatch(icos: string[], concurrency = 10) {
  const queue = [...icos];
  const workers = Array.from({ length: concurrency }, async () => {
    while (queue.length) {
      const ico = queue.shift();
      if (!ico) break;
      const ares = await enrichFromAres(ico);
      if (ares) {
        db.run('UPDATE firmy SET kraj = ?, dic = ? WHERE ico = ?', [
          ares.sidlo.nazevKraje,
          ares.dic ?? null,
          ico,
        ]);
      }
    }
  });
  await Promise.all(workers);
}

ARES has no public rate limit, but I do not push 100 parallel requests. Concurrency 10 is the sweet spot — no 429s, full enrichment of 12,800 firms in around 25 minutes.

Step 4 — Playwright scraping

I solve company-website scraping in two layers:

  1. Raw fetch for static HTML — 80% of sites, fast
  2. Playwright for JS-rendered pages (React/Vue) — 20%, slower but reliable
async function scrapeContact(url: string): Promise<{ email?: string; phone?: string }> {
  const html = await fetchOrPlaywright(url);
  const $ = cheerio.load(html);
  const text = $('body').text();
 
  const emailMatch = text.match(/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}/i);
  const phoneMatch = text.match(/(?:\+420\s?)?\d{3}\s?\d{3}\s?\d{3}/);
 
  return {
    email: emailMatch?.[0],
    phone: phoneMatch?.[0]?.replace(/\s/g, ''),
  };
}
 
async function fetchOrPlaywright(url: string): Promise<string> {
  try {
    const res = await fetch(url, { signal: AbortSignal.timeout(5000) });
    const html = await res.text();
    if (html.length > 500 && !html.includes('app-root')) return html;
  } catch {}
  // fall back to Playwright
  const page = await browser.newPage();
  await page.goto(url, { timeout: 15000 });
  return page.content();
}

The heuristic html.length > 500 && !html.includes('app-root') catches 95% of SPA pages. Playwright is slow (3–5 s per page), but Bun + 8 parallel browsers keeps up.

Schema design

ColumnTypeWhy
icoTEXT PRIMARY KEYCzech VAT ID is 8 chars, unique
naceTEXTfull 5-digit code, indexed
krajTEXTnormalized name from ARES
webTEXTnullable — not every firm has a site
email, telefonTEXTnullable, scraping can fail
discovered_atINTEGERunix timestamp for freshness

No JSON columns. Everything relational, indexable.

Output after one weekend

  • 98,640 companies across all NACE codes in the Czech Republic
  • ~12,800 in the primary target (accounting, NACE 692)
  • ~62% have a scraped email or phone (the rest = no website or unparseable)
  • SQLite DB ~180 MB, queryable directly from a dev terminal
$ bun -e "import {Database} from 'bun:sqlite'; \
  const db = new Database('krtek.sqlite'); \
  console.log(db.query('SELECT kraj, COUNT(*) c FROM firmy WHERE nace LIKE \"6920%\" GROUP BY kraj ORDER BY c DESC LIMIT 5').all())"
 
[
  { kraj: 'Prague', c: 4521 },
  { kraj: 'South Moravia', c: 1834 },
  { kraj: 'Moravian-Silesian', c: 1215 },
  ...
]

Where to next

If you need a similar pipeline for your own audience (different NACE, region, contact type), drop me a line. Building a new audience takes around two days.