# Agent Context — willhaben-tracker ## What It Is A Telegram bot that monitors classified ads on [willhaben.at](https://www.willhaben.at) and notifies users about new listings and price drops matching their keywords. Notifications include photos, postcode, published/modified dates. Deployed locally via Docker Compose with a Supabase-style Postgres stack. ## Project Location `/home/lago/Documents/Projects/willhaben-tracker` — git repo on `dev` branch. ## Live Credentials (Do Not Commit) - **Telegram bot token**: in `.env` as `TELEGRAM_BOT_TOKEN` - **Admin Telegram ID**: `298181113` (seeded as admin user on first boot) - **Postgres password**: in `.env`, referenced by all services ## Stack Overview | Service | Image | Host Port | Purpose | |---------|-------|-----------|---------| | `db` | `postgres:15-alpine` | 55632→5432 | Primary datastore, migrations on boot | | `rest` | `postgrest/postgrest:v12.2.0` | internal | Auto-generated REST API over schema | | `kong` | `kong:2.8.1` | 55621→8000 | Reverse proxy, JWT auth, CORS | | `studio` | `supabase/studio` | 55630→3000 | Web admin dashboard (http://localhost:55630) | | `meta` | `supabase/postgres-meta:v0.84.2` | internal | Schema metadata service (feeds Studio) | | `worker` | custom (`./worker/Dockerfile`) | none | Telegram bot + scraping scheduler | Worker connects directly to Postgres via `asyncpg`, bypassing PostgREST/Kong entirely. The Kong chain exists for Studio and potential future API clients. ## Docker Commands (Always from Project Root) ```bash docker compose up -d --build # build + start all services docker compose logs -f worker # tail worker logs docker compose exec -T db psql -U postgres -d postgres -c "..." # run SQL docker compose restart worker # restart just the worker ``` ## Database — Current Schema (Post Migration 03) ### Tables - **`users`** — whitelisted Telegram users (`telegram_id UNIQUE`, `is_admin`, `is_active`) - **`keywords`** — global keyword tracking (`keyword UNIQUE case-insensitive via index`, `interval_minutes DEFAULT 60`, `is_active`, `last_scraped_at`, `initial_loaded`) - **`keyword_subscriptions`** — many-to-many: `(keyword_id, user_id) PK` - **`ads`** — deduplicated ad snapshots (`wh_ad_id UNIQUE`, `raw_json JSONB`, `title`, `price NUMERIC`, `location`, `url`, `published_at TIMESTAMPTZ`, `first_seen_at`, `main_image_url TEXT`, `postcode TEXT`, `modified_at TIMESTAMPTZ`) - **`price_history`** — price change audit log (`ad_id FK→ads`, `old_price`, `new_price`, `changed_at`, unique on all three) - **`notifications`** — sent message audit trail (`user_id FK→users`, `ad_id FK→ads`, `message_id INT`) - **`scrape_logs`** — per-run health logs (`keyword_id FK→keywords`, `status IN ('success','error')`, `ads_found`, `new_ads`, `error_message`, `scraped_at`) ### Key Indexes - `idx_keywords_unique_lower` — unique index on `LOWER(keyword)` (used as conflict target via `ON CONFLICT DO NOTHING`) - `idx_keyword_subscriptions_user` — fast user subscription lookups - `idx_price_history_ad_id` — price timeline per ad ### Migration Order 1. `01-init.sql` — creates users, ads, notifications, scrape_logs (original tables) 2. `02-image-and-pricing.sql` — adds `main_image_url` to ads, creates `price_history` 3. `03-global-keywords.sql` — refactors: new `keywords` + `keyword_subscriptions`, drops old `search_queries` + `query_ads`, adds `postcode` and `modified_at` to ads, renames `scrape_logs.search_query_id` → `keyword_id` 4. `post-boot.sql` — creates `supabase_admin` role with grants, seeds admin user `298181113` **Important**: Postgres migrations only run on fresh volume init (first boot). To apply new migration files on a running stack, either: - Execute manually via `docker compose exec -T db psql ...` - Or delete `data/db/` and restart (⚠️ loses all data) ## Worker Code — Python 3.12 (`worker/src/`) | File | Responsibility | |------|---------------| | `main.py` | Entry point: dotenv, asyncpg pool, PTB v21 app + scheduler loop | | `bot.py` | Telegram command handlers: `/start`, `/add `, `/list`, `/delete `, `/stats`, admin commands | | `scraper.py` | Willhaben API client: httpx with 3 retries, attribute parser, field extractor | | `notifier.py` | Telegram message builder + sender: photo or text, HTML formatting, inline buttons | | `db.py` | asyncpg pool singleton (`get_pool()`, `close_pool()`) | ### Scheduler Flow (`main.py:scheduler_task`) 1. Poll `keywords` for due entries (`is_active AND last_scraped_at stale vs interval`) 2. For each keyword: fetch all active subscribers from `keyword_subscriptions JOIN users` 3. Scrape API once (30 newest ads, page 1, sort by recency) 4. For each ad: global dedup on `wh_ad_id` → INSERT if new, UPDATE price if dropped 5. Fan-out notifications to ALL subscribers: `notify_new_ad(bot, tg_id, ad)` and `notify_price_drop(bot, tg_id, ad)` 6. Mark `initial_loaded = true` after first cycle (prevents baseline spam) 7. Deactivate keyword if zero active subscribers ### Scraper Details (`scraper.py`) - **API endpoint**: `https://www.willhaben.at/webapi/ad-search/search/atz/seo/kaufen-und-verkaufen/marktplatz?keyword=...&rows=30&sort=1` - **Headers**: must include `x-wh-client: api@willhaben.at;responsive_web;server;1.0.0;desktop` - **Attribute format**: `[{"name": "KEY", "values": ["val"]}]` (NOT `{key, value}`) - **Key attribute mappings**: - `HEADING` → title (fallback to `description`) - `PRICE/AMOUNT` → price (strip commas, parse float) - `LOCATION` → display location string - `POSTCODE` → postal code string - `SEO_URL` → full URL = `/iad/{seo_url}` - `PUBLISHED_String` → published datetime (ISO 8601) - `CHANGED_String` → modified datetime (ISO 8601) - **Image**: from `advertImageList.advertImage[0].referenceImageUrl` — use this, NOT `mainImageUrl` ### Notifier Details (`notifier.py`) - Sends photo via `bot.send_photo()` when `main_image_url` exists; falls back to `bot.send_message()` - Uses `parse_mode="HTML"` for bold titles - Message format: header emoji + title (bold) + price/location/postcode + published/modified dates + inline "View Ad →" button - Returns Telegram message_id (`int | None`) ### Critical asyncpg Rule **Never pass ISO string to a TIMESTAMPTZ column.** asyncpg expects native `datetime` objects. The scraper must return `published_at` and `modified_at` as Python `datetime` instances, not strings. (This was the main bug in v1.) ## Telegram Bot — Commands | Command | Access | Description | |---------|--------|-------------| | `/start` | Anyone | Auto-registers user, shows help | | `/add ` | Registered | Subscribe to keyword (shared across users) | | `/list` | Registered | List subscriptions with subscriber counts | | `/delete ` | Registered | Unsubscribe; deactivates keyword if last subscriber | | `/stats` | Registered | Total keywords, ads, notifications | | `/adduser [admin]` | Admin only | Add/promote user | | `/removeuser ` | Admin only | Remove user | | `/users` | Admin only | List all users | No `/pause` or `/resume` — replaced by subscribe/unsubscribe model. ## Known Quirks & Gotchas 1. **Postgres auth stale volume**: The bind-mounted `data/db/` keeps old `pg_hba.conf`. Fixed via `POSTGRES_HOST_AUTH_METHOD=trust` in docker-compose. If you see auth errors, delete `data/db/` and restart (⚠️ data loss). 2. **Supabase Studio "unhealthy"**: Health check reports unhealthy but APIs work. Cosmetic — ignore unless tables don't load. 3. **`supabase_admin` role**: Required for pg-meta (feeds Studio UI). Must exist as LOGIN role with grants. Created in `post-boot.sql`. 4. **Migration 03 conflict target**: Uses index name (`idx_keywords_unique_lower`) via `ON CONFLICT DO NOTHING`. Cannot use `ON CONFLICT ON CONSTRAINT ...` because it's an index, not a constraint. This caused the initial migration failure — the fixed SQL uses plain `ON CONFLICT DO NOTHING` (Postgres auto-detects the unique index). 5. **Migration 03 data migration**: PL/pgSQL block migrates from old `search_queries` → new tables. If running manually, ensure `search_queries` still exists before executing. 6. **Worker container rebuild required after code changes**: `docker compose up -d --build worker` ## File Map ``` ├── .env # secrets (gitignored) ├── .env.example # template ├── docker-compose.yml # 6 services ├── README.md # user-facing setup guide + deployment section ├── ARCHITECTURE.md # technical deep-dive: schema, scheduler, notification engine ├── supabase/ │ ├── migrations/ │ │ ├── 00-run-init.sh │ │ ├── 01-init.sql # original schema (users, ads, notifications, etc.) │ │ ├── 02-image-and-pricing.sql # main_image_url + price_history table │ │ ├── 03-global-keywords.sql # keywords + subscriptions refactor │ │ └── post-boot.sql # supabase_admin role + admin user seed │ ├── pg_hba.conf │ └── kong.yml ├── worker/ │ ├── Dockerfile # python:3.12-slim, pip install requirements.txt │ ├── requirements.txt # PTB 21.4, asyncpg 0.30, httpx 0.27, dotenv │ └── src/ │ ├── main.py # entry point + scheduler loop │ ├── bot.py # Telegram command handlers │ ├── scraper.py # willhaben API client │ ├── notifier.py # message builder + sender │ └── db.py # asyncpg pool singleton └── data/ └── db/ # Postgres persistent volume (gitignored) ``` ## Design Decisions Recap - **Single worker container** — bot + scheduler share asyncio event loop and DB pool. Simpler ops at low scale. Crash takes down both, but Docker auto-restarts. - **Global keyword dedup** — one scrape feeds all subscribers via fan-out. Eliminates wasted API calls. Tradeoff: shared interval per keyword (first user sets it). - **asyncpg direct to Postgres** — no HTTP overhead on hot path. Full SQL flexibility. Worker bypasses Kong entirely. - **Page 1 only** — newest 30 ads sorted by recency. Trades completeness for speed + API politeness. - **Best-effort notifications** — failed sends to one subscriber don't block others. No retry queue.