B2B lead pipeline: Bun + SQLite + Playwright (98,640 companies in a weekend)
The Krtek pipeline from scratch — pulling the CSO RES dump, NACE filtering, ARES enrichment, Playwright scraping. Bun + bun:sqlite as the batch engine. Code, schema, NACE strategy.
Most B2B lead-gen tools in the Czech market cost CZK 5,000–20,000 per month and hand you generic contacts that everyone else also has. Instead of paying, I built my own pipeline over a weekend. The result: 98,640 companies in a database, sharply filtered by NACE codes, enriched with ARES data, and with emails scraped from company websites.
The project is called Krtek — here is the full architecture.
Pipeline overview
CSO RES dump (1.5 GB CSV)
↓ NACE filter (692xx → accounting)
companies with VAT ID + address
↓ ARES API enrichment
+ region, tax ID, legal form
↓ web search heuristics
company URL
↓ Playwright / fetch
HTML pages
↓ contact extraction
emails + phones
↓
SQLite (bun:sqlite)
Five steps, each as a standalone Bun script. SQLite holds state and lets the pipeline resume.
Step 1 — the CSO RES dump
The Czech Statistical Office publishes the Register of Economic Subjects as a public CSV dump (around 1.5 GB). It contains VAT IDs, names, addresses, NACE codes, legal forms, founding dates — everything you need for segmentation.
Download + parsing in Bun:
import { Database } from 'bun:sqlite';
const db = new Database('krtek.sqlite');
db.exec(`
CREATE TABLE IF NOT EXISTS firmy (
ico TEXT PRIMARY KEY,
nazev TEXT NOT NULL,
nace TEXT,
kraj TEXT,
adresa TEXT,
pf TEXT,
web TEXT,
email TEXT,
telefon TEXT,
discovered_at INTEGER DEFAULT (unixepoch())
);
CREATE INDEX IF NOT EXISTS idx_nace ON firmy(nace);
CREATE INDEX IF NOT EXISTS idx_kraj ON firmy(kraj);
`);
const file = Bun.file('res-dump.csv');
const stream = file.stream();
const decoder = new TextDecoder('windows-1250'); // CSO still ships in 1250
const insert = db.prepare(
'INSERT OR IGNORE INTO firmy (ico, nazev, nace, adresa, pf) VALUES (?, ?, ?, ?, ?)',
);
const tx = db.transaction((rows: string[][]) => {
for (const r of rows) insert.run(...r);
});Key detail: Bun.file().stream() chews through 1.5 GB without OOM. Node would not give you that without an explicit streaming setup. Bun has it built in.
bun:sqlite exposes a sync API → batched inserts inside a transaction run an order of magnitude faster than going through Postgres + drizzle/prisma. For this use case (single machine, single writer, read-heavy), SQLite is the right choice.
Step 2 — NACE filter
NACE is the European classification system for economic activities. 6920 = "accounting, bookkeeping, and tax consulting". This was my first target, since DokladBot is positioned at accounting firms.
const accountingFirms = db
.query("SELECT * FROM firmy WHERE nace LIKE '6920%'")
.all();
console.log(`${accountingFirms.length} accounting firms`);
// → ~12,800Planned expansion:
- NACE 620 — programming, IT consulting (target audience for AI consulting services)
- NACE 631 — data processing, hosting (target for infra products)
NACE is an underrated filter. Most tools give you fuzzy "industry tags". A 5-digit NACE code gives you a precise audience.
Step 3 — ARES API enrichment
The CSO dump only has the region as a postal code. For region-level segmentation, I want a normalized region name. ARES (the Administrative Registry of Economic Subjects) has a free REST endpoint:
type AresResponse = { sidlo: { kodKraje: string; nazevKraje: string }; dic?: string };
async function enrichFromAres(ico: string): Promise<AresResponse | null> {
const url = `https://ares.gov.cz/ekonomicke-subjekty-v-be/rest/ekonomicke-subjekty/${ico}`;
const res = await fetch(url, { headers: { Accept: 'application/json' } });
if (!res.ok) return null;
return (await res.json()) as AresResponse;
}
// rate-limited parallel enrichment
async function enrichBatch(icos: string[], concurrency = 10) {
const queue = [...icos];
const workers = Array.from({ length: concurrency }, async () => {
while (queue.length) {
const ico = queue.shift();
if (!ico) break;
const ares = await enrichFromAres(ico);
if (ares) {
db.run('UPDATE firmy SET kraj = ?, dic = ? WHERE ico = ?', [
ares.sidlo.nazevKraje,
ares.dic ?? null,
ico,
]);
}
}
});
await Promise.all(workers);
}ARES has no public rate limit, but I do not push 100 parallel requests. Concurrency 10 is the sweet spot — no 429s, full enrichment of 12,800 firms in around 25 minutes.
Step 4 — Playwright scraping
I solve company-website scraping in two layers:
- Raw fetch for static HTML — 80% of sites, fast
- Playwright for JS-rendered pages (React/Vue) — 20%, slower but reliable
async function scrapeContact(url: string): Promise<{ email?: string; phone?: string }> {
const html = await fetchOrPlaywright(url);
const $ = cheerio.load(html);
const text = $('body').text();
const emailMatch = text.match(/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}/i);
const phoneMatch = text.match(/(?:\+420\s?)?\d{3}\s?\d{3}\s?\d{3}/);
return {
email: emailMatch?.[0],
phone: phoneMatch?.[0]?.replace(/\s/g, ''),
};
}
async function fetchOrPlaywright(url: string): Promise<string> {
try {
const res = await fetch(url, { signal: AbortSignal.timeout(5000) });
const html = await res.text();
if (html.length > 500 && !html.includes('app-root')) return html;
} catch {}
// fall back to Playwright
const page = await browser.newPage();
await page.goto(url, { timeout: 15000 });
return page.content();
}The heuristic html.length > 500 && !html.includes('app-root') catches 95% of SPA pages. Playwright is slow (3–5 s per page), but Bun + 8 parallel browsers keeps up.
Schema design
| Column | Type | Why |
|---|---|---|
ico | TEXT PRIMARY KEY | Czech VAT ID is 8 chars, unique |
nace | TEXT | full 5-digit code, indexed |
kraj | TEXT | normalized name from ARES |
web | TEXT | nullable — not every firm has a site |
email, telefon | TEXT | nullable, scraping can fail |
discovered_at | INTEGER | unix timestamp for freshness |
No JSON columns. Everything relational, indexable.
Output after one weekend
- 98,640 companies across all NACE codes in the Czech Republic
- ~12,800 in the primary target (accounting, NACE 692)
- ~62% have a scraped email or phone (the rest = no website or unparseable)
- SQLite DB ~180 MB, queryable directly from a dev terminal
$ bun -e "import {Database} from 'bun:sqlite'; \
const db = new Database('krtek.sqlite'); \
console.log(db.query('SELECT kraj, COUNT(*) c FROM firmy WHERE nace LIKE \"6920%\" GROUP BY kraj ORDER BY c DESC LIMIT 5').all())"
[
{ kraj: 'Prague', c: 4521 },
{ kraj: 'South Moravia', c: 1834 },
{ kraj: 'Moravian-Silesian', c: 1215 },
...
]Where to next
- Krtek case study → — more on the product around the database
- Claude Code workflow → — how Claude Code wrote 80% of this pipeline
- Multi-language Next.js → — another weekend project
If you need a similar pipeline for your own audience (different NACE, region, contact type), drop me a line. Building a new audience takes around two days.