Akce Ostrava - Event Aggregator with AI Enrichment

Auto-scrapers (TicketPortal, GoOut), bulk AI enrichment, personalized recommendations, admin dashboard.

VitePlaywrightCheerioAnthropic SDK

Brief

Akce Ostrava is an event discovery aggregator - one place to find everything happening in Ostrava and the region, regardless of which ticketing vendor sells it. The problem space is trivial to describe and non-trivial to solve: the Czech cultural scene is fragmented across 6+ ticketing platforms (TicketPortal, GoOut, Goinout, Smsticket, NaVstupenky, Webticket), each with different UX, different notifications, and different data quality. A techno fan who wants to know what's on in Ostrava next weekend has to either watch 6 sites or live on Instagram and hope. Akce Ostrava fills that gap.

The intent was never to compete with the ticketing platforms themselves - nobody is going to replace GoOut in their segment - but to offer local-first discovery with better filtering, better memory (a returning user doesn't lose track of an event they bookmarked a month ago), and multi-language support for expats and foreign students (VŠB has ~3,000 international students who speak English, Ukrainian, or Polish).

The second motive was a pilot run for my DataForSEO workflow - Akce Ostrava was the first project where I generated a 574-keyword baseline and built the article generator. Whatever proved itself here later scaled up to DokladBot and Maruška.

Architecture: Vite SPA + scraper services + admin

I went with a Vite SPA (not Next.js) for concrete reasons:

  • The frontend is a read-only catalog - no server actions, no forms, just filtering and "buy ticket" deeplinks to vendors
  • Static CDN hosting is dramatically cheaper than Vercel serverless for ~10k MAU
  • An SPA is faster when a user scrolls 200+ events with 5 filters applied - no roundtrip, just client-side state
  • SEO is solved by a prerender pipeline (static HTML for each event detail page)
apps/
├── web/                ← Vite SPA, React, i18next (6 locales)
├── scraper/            ← Node service, Playwright + Cheerio
├── enricher/           ← bulk AI enrichment via Anthropic Batch API
└── admin/              ← internal dashboard, review queue
packages/
├── db/                 ← Prisma + Postgres schema
└── shared/             ← TS types, event schema

The frontend talks to Postgres via a minimal REST API (read-only aggregated event lists). Mutating operations (scrape, enrich, approve) live in the admin and scraper services.

Scraper layer: TicketPortal (HTML) vs GoOut (JSON API)

The hardest layer. Every vendor has a different strategy.

TicketPortal is traditional server-rendered HTML, no official API. I used cheerio for parsing:

// apps/scraper/src/adapters/ticketportal.ts
import { load } from 'cheerio';
import type { ScrapedEvent } from '@prace/shared';
 
export async function scrapeTicketPortal(city: string): Promise<ScrapedEvent[]> {
  const url = `https://www.ticketportal.cz/category/Hudba?city=${city}`;
  const html = await fetch(url, {
    headers: { 'User-Agent': 'AkceOstravaBot/1.0 (+https://akce-ostrava.cz/bot)' },
  }).then((r) => r.text());
 
  const $ = load(html);
  const events: ScrapedEvent[] = [];
 
  $('.event-card').each((_, el) => {
    const $el = $(el);
    events.push({
      vendor: 'ticketportal',
      vendorId: $el.attr('data-event-id') ?? '',
      title: $el.find('.event-title').text().trim(),
      venue: $el.find('.venue-name').text().trim(),
      startsAt: parseCzechDate($el.find('.event-date').text()),
      priceCzk: parseCzkPrice($el.find('.price').text()),
      url: new URL($el.find('a').attr('href') ?? '', url).toString(),
      rawHtml: $el.html() ?? '', // for audit
    });
  });
 
  return events;
}

GoOut, on the other hand, has an internal JSON API - I found it via DevTools Network tab. There a fetch with JSON parsing is enough, no DOM. Some events render the description client-side via JS (interactive calendar), so I fall back to Playwright for JS-rendered pages:

// apps/scraper/src/adapters/goout-detail.ts
import { chromium } from 'playwright';
 
export async function scrapeGoOutDetail(slug: string) {
  const browser = await chromium.launch();
  try {
    const page = await browser.newPage();
    await page.goto(`https://goout.net/cs/akce/${slug}/`, {
      waitUntil: 'networkidle',
    });
    const description = await page.$eval('[data-description]', (el) => el.textContent);
    return { description };
  } finally {
    await browser.close();
  }
}

Cheerio is 50× faster for static HTML, Playwright is reserved for the JS-rendered details. The cron runs every 4 hours, scrapes ~2,000 events/run, dedup via (vendor, vendorId) composite key.

AI enrichment: bulk Claude calls

Raw vendor data is inconsistent. TicketPortal writes "Iva Bittová a host" as the title, GoOut shows "Iva Bittová & Special Guest", Smsticket "I.BITTOVÁ + HOST". For UX (filtering, related events, search) I need canonicalization.

Bulk enrichment calls Claude on 5–20 events at a time, with a structured (JSON-only) prompt that returns unified metadata:

// apps/enricher/src/mass-generate.ts
import Anthropic from '@anthropic-ai/sdk';
import { z } from 'zod';
 
const enrichedSchema = z.object({
  canonicalTitle: z.string(),
  primaryArtist: z.string(),
  supportingArtists: z.array(z.string()),
  genre: z.enum(['rock', 'electronic', 'pop', 'classical', 'jazz', 'folk', 'metal', 'hip-hop', 'other']),
  audienceAge: z.tuple([z.number(), z.number()]),
  isFamilyFriendly: z.boolean(),
  language: z.string(),
});
 
const client = new Anthropic();
 
export async function enrichBatch(events: RawEvent[]) {
  const res = await client.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 4000,
    system: `You normalize Czech event metadata. Return strictly JSON array, one object per input event, in the exact order. Schema: ${enrichedSchema.toString()}.`,
    messages: [
      {
        role: 'user',
        content: JSON.stringify(events.map((e) => ({ id: e.id, title: e.title, raw: e.rawHtml }))),
      },
    ],
  });
 
  const parsed = JSON.parse(extractJson(res.content[0].text));
  return parsed.map((p: unknown, i: number) => ({
    eventId: events[i].id,
    enriched: enrichedSchema.parse(p),
  }));
}

Important optimizations:

  • Batch size 10. Larger saves tokens but Claude starts trimming detail beyond event #15. Sweet spot is 10.
  • Anthropic Batch API for nightly enrichment (50 % discount, 24h SLA - fine, the scraper runs every 4h anyway).
  • Cache enrichment keyed by raw HTML hash - if an event hasn't changed, no new API call.

Cost: ~$0.40 / 1,000 enriched events. At 50k events scraped, that's $20/month in AI bill.

Admin dashboard: review queue

AI classification is correct 92 % of the time. The remaining 8 % (genre miss, wrong primary artist for festivals) is handled in a review queue: every enriched event goes through an admin pass before publishing. Bulk operations (change genre for all "Karneval"-titled events) in one click, full keyboard shortcuts (j/k/x), inline edit.

Concrete KPIs:

  • 50,000+ events scraped per year
  • 8,000+ live events indexed (after dedup and stale filtering)
  • Review throughput: 200+ events/h with keyboard shortcuts (vs 30/h clicking manually)

SEO play: DataForSEO baseline

This was the first project where I deployed the DataForSEO baseline. I generated 574 KW for Ostrava, filtered by volume × intent, and picked 30 target phrases:

KeywordMonthly volumePosition beforePosition after (T+90)
akce ostrava8,100n/a#3
co dělat v ostravě2,400n/a#5
vstupenky zoo ostrava1,600n/a#2
ostrava akce dnes1,300n/a#4
koncerty ostrava 2026880n/a#6
goout ostrava alternativa110n/a#1

"alternativa"-style KW are a gold mine: low volume, but 100 % buying intent. I won them because nobody else was targeting them.

Lessons

  • Vendor scraping is fragile. TicketPortal changed DOM structure twice in a year, GoOut closed one API endpoint and opened another. I built a smoke-test runner that scrapes one known event every hour and alerts when parsing fails.
  • Anthropic Batch API is underrated. A 50 % discount in exchange for a 24h SLA on nightly jobs is a no-brainer. I now use it in DokladBot too.
  • 6 languages is not 6× the work. i18next + JSON keys, sure, but AI translation of event descriptions (CS → EN/DE/PL/SK/UA) in a single Claude call is trivial. Real cost: 30 minutes of copy review per new event batch.
  • Multi-language SEO has its own rules. UK users search "events Ostrava" (in English), Poles "wydarzenia Ostrawa". One URL prefix per locale (/en/events/..., /pl/wydarzenia/...), hreflang tags, separate sitemap per locale. Without that Google only indexes the CS version.
  • Build the aggregator before content. Once you have 8,000 events in DB, automatic landing pages like "techno events Ostrava 2026" and "rooftop concerts June 2026" build themselves - long-tail SEO writes itself.