10 KiB
Agent Context — willhaben-tracker
What It Is
A Telegram bot that monitors classified ads on willhaben.at and notifies users about new listings and price drops matching their keywords. Notifications include photos, postcode, published/modified dates. Deployed locally via Docker Compose with a Supabase-style Postgres stack.
Project Location
/home/lago/Documents/Projects/willhaben-tracker — git repo on dev branch.
Live Credentials (Do Not Commit)
- Telegram bot token: in
.envasTELEGRAM_BOT_TOKEN - Admin Telegram ID:
298181113(seeded as admin user on first boot) - Postgres password: in
.env, referenced by all services
Stack Overview
| Service | Image | Host Port | Purpose |
|---|---|---|---|
db |
postgres:15-alpine |
55632→5432 | Primary datastore, migrations on boot |
rest |
postgrest/postgrest:v12.2.0 |
internal | Auto-generated REST API over schema |
kong |
kong:2.8.1 |
55621→8000 | Reverse proxy, JWT auth, CORS |
studio |
supabase/studio |
55630→3000 | Web admin dashboard (http://localhost:55630) |
meta |
supabase/postgres-meta:v0.84.2 |
internal | Schema metadata service (feeds Studio) |
worker |
custom (./worker/Dockerfile) |
none | Telegram bot + scraping scheduler |
Worker connects directly to Postgres via asyncpg, bypassing PostgREST/Kong entirely. The Kong chain exists for Studio and potential future API clients.
Docker Commands (Always from Project Root)
docker compose up -d --build # build + start all services
docker compose logs -f worker # tail worker logs
docker compose exec -T db psql -U postgres -d postgres -c "..." # run SQL
docker compose restart worker # restart just the worker
Database — Current Schema (Post Migration 03)
Tables
users— whitelisted Telegram users (telegram_id UNIQUE,is_admin,is_active)keywords— global keyword tracking (keyword UNIQUE case-insensitive via index,interval_minutes DEFAULT 60,is_active,last_scraped_at,initial_loaded)keyword_subscriptions— many-to-many:(keyword_id, user_id) PKads— deduplicated ad snapshots (wh_ad_id UNIQUE,raw_json JSONB,title,price NUMERIC,location,url,published_at TIMESTAMPTZ,first_seen_at,main_image_url TEXT,postcode TEXT,modified_at TIMESTAMPTZ)price_history— price change audit log (ad_id FK→ads,old_price,new_price,changed_at, unique on all three)notifications— sent message audit trail (user_id FK→users,ad_id FK→ads,message_id INT)scrape_logs— per-run health logs (keyword_id FK→keywords,status IN ('success','error'),ads_found,new_ads,error_message,scraped_at)
Key Indexes
idx_keywords_unique_lower— unique index onLOWER(keyword)(used as conflict target viaON CONFLICT DO NOTHING)idx_keyword_subscriptions_user— fast user subscription lookupsidx_price_history_ad_id— price timeline per ad
Migration Order
01-init.sql— creates users, ads, notifications, scrape_logs (original tables)02-image-and-pricing.sql— addsmain_image_urlto ads, createsprice_history03-global-keywords.sql— refactors: newkeywords+keyword_subscriptions, drops oldsearch_queries+query_ads, addspostcodeandmodified_atto ads, renamesscrape_logs.search_query_id→keyword_idpost-boot.sql— createssupabase_adminrole with grants, seeds admin user298181113
Important: Postgres migrations only run on fresh volume init (first boot). To apply new migration files on a running stack, either:
- Execute manually via
docker compose exec -T db psql ... - Or delete
data/db/and restart (⚠️ loses all data)
Worker Code — Python 3.12 (worker/src/)
| File | Responsibility |
|---|---|
main.py |
Entry point: dotenv, asyncpg pool, PTB v21 app + scheduler loop |
bot.py |
Telegram command handlers: /start, /add <kw>, /list, /delete <kw_id>, /stats, admin commands |
scraper.py |
Willhaben API client: httpx with 3 retries, attribute parser, field extractor |
notifier.py |
Telegram message builder + sender: photo or text, HTML formatting, inline buttons |
db.py |
asyncpg pool singleton (get_pool(), close_pool()) |
Scheduler Flow (main.py:scheduler_task)
- Poll
keywordsfor due entries (is_active AND last_scraped_at stale vs interval) - For each keyword: fetch all active subscribers from
keyword_subscriptions JOIN users - Scrape API once (30 newest ads, page 1, sort by recency)
- For each ad: global dedup on
wh_ad_id→ INSERT if new, UPDATE price if dropped - Fan-out notifications to ALL subscribers:
notify_new_ad(bot, tg_id, ad)andnotify_price_drop(bot, tg_id, ad) - Mark
initial_loaded = trueafter first cycle (prevents baseline spam) - Deactivate keyword if zero active subscribers
Scraper Details (scraper.py)
- API endpoint:
https://www.willhaben.at/webapi/ad-search/search/atz/seo/kaufen-und-verkaufen/marktplatz?keyword=...&rows=30&sort=1 - Headers: must include
x-wh-client: api@willhaben.at;responsive_web;server;1.0.0;desktop - Attribute format:
[{"name": "KEY", "values": ["val"]}](NOT{key, value}) - Key attribute mappings:
HEADING→ title (fallback todescription)PRICE/AMOUNT→ price (strip commas, parse float)LOCATION→ display location stringPOSTCODE→ postal code stringSEO_URL→ full URL =/iad/{seo_url}PUBLISHED_String→ published datetime (ISO 8601)CHANGED_String→ modified datetime (ISO 8601)
- Image: from
advertImageList.advertImage[0].referenceImageUrl— use this, NOTmainImageUrl
Notifier Details (notifier.py)
- Sends photo via
bot.send_photo()whenmain_image_urlexists; falls back tobot.send_message() - Uses
parse_mode="HTML"for bold titles - Message format: header emoji + title (bold) + price/location/postcode + published/modified dates + inline "View Ad →" button
- Returns Telegram message_id (
int | None)
Critical asyncpg Rule
Never pass ISO string to a TIMESTAMPTZ column. asyncpg expects native datetime objects. The scraper must return published_at and modified_at as Python datetime instances, not strings. (This was the main bug in v1.)
Telegram Bot — Commands
| Command | Access | Description |
|---|---|---|
/start |
Anyone | Auto-registers user, shows help |
/add <keyword> |
Registered | Subscribe to keyword (shared across users) |
/list |
Registered | List subscriptions with subscriber counts |
/delete <keyword_id> |
Registered | Unsubscribe; deactivates keyword if last subscriber |
/stats |
Registered | Total keywords, ads, notifications |
/adduser <tg_id> [admin] |
Admin only | Add/promote user |
/removeuser <tg_id> |
Admin only | Remove user |
/users |
Admin only | List all users |
No /pause or /resume — replaced by subscribe/unsubscribe model.
Known Quirks & Gotchas
- Postgres auth stale volume: The bind-mounted
data/db/keeps oldpg_hba.conf. Fixed viaPOSTGRES_HOST_AUTH_METHOD=trustin docker-compose. If you see auth errors, deletedata/db/and restart (⚠️ data loss). - Supabase Studio "unhealthy": Health check reports unhealthy but APIs work. Cosmetic — ignore unless tables don't load.
supabase_adminrole: Required for pg-meta (feeds Studio UI). Must exist as LOGIN role with grants. Created inpost-boot.sql.- Migration 03 conflict target: Uses index name (
idx_keywords_unique_lower) viaON CONFLICT DO NOTHING. Cannot useON CONFLICT ON CONSTRAINT ...because it's an index, not a constraint. This caused the initial migration failure — the fixed SQL uses plainON CONFLICT DO NOTHING(Postgres auto-detects the unique index). - Migration 03 data migration: PL/pgSQL block migrates from old
search_queries→ new tables. If running manually, ensuresearch_queriesstill exists before executing. - Worker container rebuild required after code changes:
docker compose up -d --build worker
File Map
├── .env # secrets (gitignored)
├── .env.example # template
├── docker-compose.yml # 6 services
├── README.md # user-facing setup guide + deployment section
├── ARCHITECTURE.md # technical deep-dive: schema, scheduler, notification engine
├── supabase/
│ ├── migrations/
│ │ ├── 00-run-init.sh
│ │ ├── 01-init.sql # original schema (users, ads, notifications, etc.)
│ │ ├── 02-image-and-pricing.sql # main_image_url + price_history table
│ │ ├── 03-global-keywords.sql # keywords + subscriptions refactor
│ │ └── post-boot.sql # supabase_admin role + admin user seed
│ ├── pg_hba.conf
│ └── kong.yml
├── worker/
│ ├── Dockerfile # python:3.12-slim, pip install requirements.txt
│ ├── requirements.txt # PTB 21.4, asyncpg 0.30, httpx 0.27, dotenv
│ └── src/
│ ├── main.py # entry point + scheduler loop
│ ├── bot.py # Telegram command handlers
│ ├── scraper.py # willhaben API client
│ ├── notifier.py # message builder + sender
│ └── db.py # asyncpg pool singleton
└── data/
└── db/ # Postgres persistent volume (gitignored)
Design Decisions Recap
- Single worker container — bot + scheduler share asyncio event loop and DB pool. Simpler ops at low scale. Crash takes down both, but Docker auto-restarts.
- Global keyword dedup — one scrape feeds all subscribers via fan-out. Eliminates wasted API calls. Tradeoff: shared interval per keyword (first user sets it).
- asyncpg direct to Postgres — no HTTP overhead on hot path. Full SQL flexibility. Worker bypasses Kong entirely.
- Page 1 only — newest 30 ads sorted by recency. Trades completeness for speed + API politeness.
- Best-effort notifications — failed sends to one subscriber don't block others. No retry queue.