Files

10 KiB

Agent Context — willhaben-tracker

What It Is

A Telegram bot that monitors classified ads on willhaben.at and notifies users about new listings and price drops matching their keywords. Notifications include photos, postcode, published/modified dates. Deployed locally via Docker Compose with a Supabase-style Postgres stack.

Project Location

/home/lago/Documents/Projects/willhaben-tracker — git repo on dev branch.

Live Credentials (Do Not Commit)

  • Telegram bot token: in .env as TELEGRAM_BOT_TOKEN
  • Admin Telegram ID: 298181113 (seeded as admin user on first boot)
  • Postgres password: in .env, referenced by all services

Stack Overview

Service Image Host Port Purpose
db postgres:15-alpine 55632→5432 Primary datastore, migrations on boot
rest postgrest/postgrest:v12.2.0 internal Auto-generated REST API over schema
kong kong:2.8.1 55621→8000 Reverse proxy, JWT auth, CORS
studio supabase/studio 55630→3000 Web admin dashboard (http://localhost:55630)
meta supabase/postgres-meta:v0.84.2 internal Schema metadata service (feeds Studio)
worker custom (./worker/Dockerfile) none Telegram bot + scraping scheduler

Worker connects directly to Postgres via asyncpg, bypassing PostgREST/Kong entirely. The Kong chain exists for Studio and potential future API clients.

Docker Commands (Always from Project Root)

docker compose up -d --build   # build + start all services
docker compose logs -f worker  # tail worker logs
docker compose exec -T db psql -U postgres -d postgres -c "..."  # run SQL
docker compose restart worker  # restart just the worker

Database — Current Schema (Post Migration 03)

Tables

  • users — whitelisted Telegram users (telegram_id UNIQUE, is_admin, is_active)
  • keywords — global keyword tracking (keyword UNIQUE case-insensitive via index, interval_minutes DEFAULT 60, is_active, last_scraped_at, initial_loaded)
  • keyword_subscriptions — many-to-many: (keyword_id, user_id) PK
  • ads — deduplicated ad snapshots (wh_ad_id UNIQUE, raw_json JSONB, title, price NUMERIC, location, url, published_at TIMESTAMPTZ, first_seen_at, main_image_url TEXT, postcode TEXT, modified_at TIMESTAMPTZ)
  • price_history — price change audit log (ad_id FK→ads, old_price, new_price, changed_at, unique on all three)
  • notifications — sent message audit trail (user_id FK→users, ad_id FK→ads, message_id INT)
  • scrape_logs — per-run health logs (keyword_id FK→keywords, status IN ('success','error'), ads_found, new_ads, error_message, scraped_at)

Key Indexes

  • idx_keywords_unique_lower — unique index on LOWER(keyword) (used as conflict target via ON CONFLICT DO NOTHING)
  • idx_keyword_subscriptions_user — fast user subscription lookups
  • idx_price_history_ad_id — price timeline per ad

Migration Order

  1. 01-init.sql — creates users, ads, notifications, scrape_logs (original tables)
  2. 02-image-and-pricing.sql — adds main_image_url to ads, creates price_history
  3. 03-global-keywords.sql — refactors: new keywords + keyword_subscriptions, drops old search_queries + query_ads, adds postcode and modified_at to ads, renames scrape_logs.search_query_idkeyword_id
  4. post-boot.sql — creates supabase_admin role with grants, seeds admin user 298181113

Important: Postgres migrations only run on fresh volume init (first boot). To apply new migration files on a running stack, either:

  • Execute manually via docker compose exec -T db psql ...
  • Or delete data/db/ and restart (⚠️ loses all data)

Worker Code — Python 3.12 (worker/src/)

File Responsibility
main.py Entry point: dotenv, asyncpg pool, PTB v21 app + scheduler loop
bot.py Telegram command handlers: /start, /add <kw>, /list, /delete <kw_id>, /stats, admin commands
scraper.py Willhaben API client: httpx with 3 retries, attribute parser, field extractor
notifier.py Telegram message builder + sender: photo or text, HTML formatting, inline buttons
db.py asyncpg pool singleton (get_pool(), close_pool())

Scheduler Flow (main.py:scheduler_task)

  1. Poll keywords for due entries (is_active AND last_scraped_at stale vs interval)
  2. For each keyword: fetch all active subscribers from keyword_subscriptions JOIN users
  3. Scrape API once (30 newest ads, page 1, sort by recency)
  4. For each ad: global dedup on wh_ad_id → INSERT if new, UPDATE price if dropped
  5. Fan-out notifications to ALL subscribers: notify_new_ad(bot, tg_id, ad) and notify_price_drop(bot, tg_id, ad)
  6. Mark initial_loaded = true after first cycle (prevents baseline spam)
  7. Deactivate keyword if zero active subscribers

Scraper Details (scraper.py)

  • API endpoint: https://www.willhaben.at/webapi/ad-search/search/atz/seo/kaufen-und-verkaufen/marktplatz?keyword=...&rows=30&sort=1
  • Headers: must include x-wh-client: api@willhaben.at;responsive_web;server;1.0.0;desktop
  • Attribute format: [{"name": "KEY", "values": ["val"]}] (NOT {key, value})
  • Key attribute mappings:
    • HEADING → title (fallback to description)
    • PRICE/AMOUNT → price (strip commas, parse float)
    • LOCATION → display location string
    • POSTCODE → postal code string
    • SEO_URL → full URL = /iad/{seo_url}
    • PUBLISHED_String → published datetime (ISO 8601)
    • CHANGED_String → modified datetime (ISO 8601)
  • Image: from advertImageList.advertImage[0].referenceImageUrl — use this, NOT mainImageUrl

Notifier Details (notifier.py)

  • Sends photo via bot.send_photo() when main_image_url exists; falls back to bot.send_message()
  • Uses parse_mode="HTML" for bold titles
  • Message format: header emoji + title (bold) + price/location/postcode + published/modified dates + inline "View Ad →" button
  • Returns Telegram message_id (int | None)

Critical asyncpg Rule

Never pass ISO string to a TIMESTAMPTZ column. asyncpg expects native datetime objects. The scraper must return published_at and modified_at as Python datetime instances, not strings. (This was the main bug in v1.)

Telegram Bot — Commands

Command Access Description
/start Anyone Auto-registers user, shows help
/add <keyword> Registered Subscribe to keyword (shared across users)
/list Registered List subscriptions with subscriber counts
/delete <keyword_id> Registered Unsubscribe; deactivates keyword if last subscriber
/stats Registered Total keywords, ads, notifications
/adduser <tg_id> [admin] Admin only Add/promote user
/removeuser <tg_id> Admin only Remove user
/users Admin only List all users

No /pause or /resume — replaced by subscribe/unsubscribe model.

Known Quirks & Gotchas

  1. Postgres auth stale volume: The bind-mounted data/db/ keeps old pg_hba.conf. Fixed via POSTGRES_HOST_AUTH_METHOD=trust in docker-compose. If you see auth errors, delete data/db/ and restart (⚠️ data loss).
  2. Supabase Studio "unhealthy": Health check reports unhealthy but APIs work. Cosmetic — ignore unless tables don't load.
  3. supabase_admin role: Required for pg-meta (feeds Studio UI). Must exist as LOGIN role with grants. Created in post-boot.sql.
  4. Migration 03 conflict target: Uses index name (idx_keywords_unique_lower) via ON CONFLICT DO NOTHING. Cannot use ON CONFLICT ON CONSTRAINT ... because it's an index, not a constraint. This caused the initial migration failure — the fixed SQL uses plain ON CONFLICT DO NOTHING (Postgres auto-detects the unique index).
  5. Migration 03 data migration: PL/pgSQL block migrates from old search_queries → new tables. If running manually, ensure search_queries still exists before executing.
  6. Worker container rebuild required after code changes: docker compose up -d --build worker

File Map

├── .env                    # secrets (gitignored)
├── .env.example            # template
├── docker-compose.yml      # 6 services
├── README.md               # user-facing setup guide + deployment section
├── ARCHITECTURE.md         # technical deep-dive: schema, scheduler, notification engine
├── supabase/
│   ├── migrations/
│   │   ├── 00-run-init.sh
│   │   ├── 01-init.sql     # original schema (users, ads, notifications, etc.)
│   │   ├── 02-image-and-pricing.sql  # main_image_url + price_history table
│   │   ├── 03-global-keywords.sql    # keywords + subscriptions refactor
│   │   └── post-boot.sql   # supabase_admin role + admin user seed
│   ├── pg_hba.conf
│   └── kong.yml
├── worker/
│   ├── Dockerfile          # python:3.12-slim, pip install requirements.txt
│   ├── requirements.txt    # PTB 21.4, asyncpg 0.30, httpx 0.27, dotenv
│   └── src/
│       ├── main.py         # entry point + scheduler loop
│       ├── bot.py          # Telegram command handlers
│       ├── scraper.py      # willhaben API client
│       ├── notifier.py     # message builder + sender
│       └── db.py           # asyncpg pool singleton
└── data/
    └── db/                 # Postgres persistent volume (gitignored)

Design Decisions Recap

  • Single worker container — bot + scheduler share asyncio event loop and DB pool. Simpler ops at low scale. Crash takes down both, but Docker auto-restarts.
  • Global keyword dedup — one scrape feeds all subscribers via fan-out. Eliminates wasted API calls. Tradeoff: shared interval per keyword (first user sets it).
  • asyncpg direct to Postgres — no HTTP overhead on hot path. Full SQL flexibility. Worker bypasses Kong entirely.
  • Page 1 only — newest 30 ads sorted by recency. Trades completeness for speed + API politeness.
  • Best-effort notifications — failed sends to one subscriber don't block others. No retry queue.