162 lines
10 KiB
Markdown
162 lines
10 KiB
Markdown
# Agent Context — willhaben-tracker
|
|
|
|
## What It Is
|
|
A Telegram bot that monitors classified ads on [willhaben.at](https://www.willhaben.at) and notifies users about new listings and price drops matching their keywords. Notifications include photos, postcode, published/modified dates. Deployed locally via Docker Compose with a Supabase-style Postgres stack.
|
|
|
|
## Project Location
|
|
`/home/lago/Documents/Projects/willhaben-tracker` — git repo on `dev` branch.
|
|
|
|
## Live Credentials (Do Not Commit)
|
|
- **Telegram bot token**: in `.env` as `TELEGRAM_BOT_TOKEN`
|
|
- **Admin Telegram ID**: `298181113` (seeded as admin user on first boot)
|
|
- **Postgres password**: in `.env`, referenced by all services
|
|
|
|
## Stack Overview
|
|
|
|
| Service | Image | Host Port | Purpose |
|
|
|---------|-------|-----------|---------|
|
|
| `db` | `postgres:15-alpine` | 55632→5432 | Primary datastore, migrations on boot |
|
|
| `rest` | `postgrest/postgrest:v12.2.0` | internal | Auto-generated REST API over schema |
|
|
| `kong` | `kong:2.8.1` | 55621→8000 | Reverse proxy, JWT auth, CORS |
|
|
| `studio` | `supabase/studio` | 55630→3000 | Web admin dashboard (http://localhost:55630) |
|
|
| `meta` | `supabase/postgres-meta:v0.84.2` | internal | Schema metadata service (feeds Studio) |
|
|
| `worker` | custom (`./worker/Dockerfile`) | none | Telegram bot + scraping scheduler |
|
|
|
|
Worker connects directly to Postgres via `asyncpg`, bypassing PostgREST/Kong entirely. The Kong chain exists for Studio and potential future API clients.
|
|
|
|
## Docker Commands (Always from Project Root)
|
|
```bash
|
|
docker compose up -d --build # build + start all services
|
|
docker compose logs -f worker # tail worker logs
|
|
docker compose exec -T db psql -U postgres -d postgres -c "..." # run SQL
|
|
docker compose restart worker # restart just the worker
|
|
```
|
|
|
|
## Database — Current Schema (Post Migration 03)
|
|
|
|
### Tables
|
|
- **`users`** — whitelisted Telegram users (`telegram_id UNIQUE`, `is_admin`, `is_active`)
|
|
- **`keywords`** — global keyword tracking (`keyword UNIQUE case-insensitive via index`, `interval_minutes DEFAULT 60`, `is_active`, `last_scraped_at`, `initial_loaded`)
|
|
- **`keyword_subscriptions`** — many-to-many: `(keyword_id, user_id) PK`
|
|
- **`ads`** — deduplicated ad snapshots (`wh_ad_id UNIQUE`, `raw_json JSONB`, `title`, `price NUMERIC`, `location`, `url`, `published_at TIMESTAMPTZ`, `first_seen_at`, `main_image_url TEXT`, `postcode TEXT`, `modified_at TIMESTAMPTZ`)
|
|
- **`price_history`** — price change audit log (`ad_id FK→ads`, `old_price`, `new_price`, `changed_at`, unique on all three)
|
|
- **`notifications`** — sent message audit trail (`user_id FK→users`, `ad_id FK→ads`, `message_id INT`)
|
|
- **`scrape_logs`** — per-run health logs (`keyword_id FK→keywords`, `status IN ('success','error')`, `ads_found`, `new_ads`, `error_message`, `scraped_at`)
|
|
|
|
### Key Indexes
|
|
- `idx_keywords_unique_lower` — unique index on `LOWER(keyword)` (used as conflict target via `ON CONFLICT DO NOTHING`)
|
|
- `idx_keyword_subscriptions_user` — fast user subscription lookups
|
|
- `idx_price_history_ad_id` — price timeline per ad
|
|
|
|
### Migration Order
|
|
1. `01-init.sql` — creates users, ads, notifications, scrape_logs (original tables)
|
|
2. `02-image-and-pricing.sql` — adds `main_image_url` to ads, creates `price_history`
|
|
3. `03-global-keywords.sql` — refactors: new `keywords` + `keyword_subscriptions`, drops old `search_queries` + `query_ads`, adds `postcode` and `modified_at` to ads, renames `scrape_logs.search_query_id` → `keyword_id`
|
|
4. `post-boot.sql` — creates `supabase_admin` role with grants, seeds admin user `298181113`
|
|
|
|
**Important**: Postgres migrations only run on fresh volume init (first boot). To apply new migration files on a running stack, either:
|
|
- Execute manually via `docker compose exec -T db psql ...`
|
|
- Or delete `data/db/` and restart (⚠️ loses all data)
|
|
|
|
## Worker Code — Python 3.12 (`worker/src/`)
|
|
|
|
| File | Responsibility |
|
|
|------|---------------|
|
|
| `main.py` | Entry point: dotenv, asyncpg pool, PTB v21 app + scheduler loop |
|
|
| `bot.py` | Telegram command handlers: `/start`, `/add <kw>`, `/list`, `/delete <kw_id>`, `/stats`, admin commands |
|
|
| `scraper.py` | Willhaben API client: httpx with 3 retries, attribute parser, field extractor |
|
|
| `notifier.py` | Telegram message builder + sender: photo or text, HTML formatting, inline buttons |
|
|
| `db.py` | asyncpg pool singleton (`get_pool()`, `close_pool()`) |
|
|
|
|
### Scheduler Flow (`main.py:scheduler_task`)
|
|
1. Poll `keywords` for due entries (`is_active AND last_scraped_at stale vs interval`)
|
|
2. For each keyword: fetch all active subscribers from `keyword_subscriptions JOIN users`
|
|
3. Scrape API once (30 newest ads, page 1, sort by recency)
|
|
4. For each ad: global dedup on `wh_ad_id` → INSERT if new, UPDATE price if dropped
|
|
5. Fan-out notifications to ALL subscribers: `notify_new_ad(bot, tg_id, ad)` and `notify_price_drop(bot, tg_id, ad)`
|
|
6. Mark `initial_loaded = true` after first cycle (prevents baseline spam)
|
|
7. Deactivate keyword if zero active subscribers
|
|
|
|
### Scraper Details (`scraper.py`)
|
|
- **API endpoint**: `https://www.willhaben.at/webapi/ad-search/search/atz/seo/kaufen-und-verkaufen/marktplatz?keyword=...&rows=30&sort=1`
|
|
- **Headers**: must include `x-wh-client: api@willhaben.at;responsive_web;server;1.0.0;desktop`
|
|
- **Attribute format**: `[{"name": "KEY", "values": ["val"]}]` (NOT `{key, value}`)
|
|
- **Key attribute mappings**:
|
|
- `HEADING` → title (fallback to `description`)
|
|
- `PRICE/AMOUNT` → price (strip commas, parse float)
|
|
- `LOCATION` → display location string
|
|
- `POSTCODE` → postal code string
|
|
- `SEO_URL` → full URL = `/iad/{seo_url}`
|
|
- `PUBLISHED_String` → published datetime (ISO 8601)
|
|
- `CHANGED_String` → modified datetime (ISO 8601)
|
|
- **Image**: from `advertImageList.advertImage[0].referenceImageUrl` — use this, NOT `mainImageUrl`
|
|
|
|
### Notifier Details (`notifier.py`)
|
|
- Sends photo via `bot.send_photo()` when `main_image_url` exists; falls back to `bot.send_message()`
|
|
- Uses `parse_mode="HTML"` for bold titles
|
|
- Message format: header emoji + title (bold) + price/location/postcode + published/modified dates + inline "View Ad →" button
|
|
- Returns Telegram message_id (`int | None`)
|
|
|
|
### Critical asyncpg Rule
|
|
**Never pass ISO string to a TIMESTAMPTZ column.** asyncpg expects native `datetime` objects. The scraper must return `published_at` and `modified_at` as Python `datetime` instances, not strings. (This was the main bug in v1.)
|
|
|
|
## Telegram Bot — Commands
|
|
|
|
| Command | Access | Description |
|
|
|---------|--------|-------------|
|
|
| `/start` | Anyone | Auto-registers user, shows help |
|
|
| `/add <keyword>` | Registered | Subscribe to keyword (shared across users) |
|
|
| `/list` | Registered | List subscriptions with subscriber counts |
|
|
| `/delete <keyword_id>` | Registered | Unsubscribe; deactivates keyword if last subscriber |
|
|
| `/stats` | Registered | Total keywords, ads, notifications |
|
|
| `/adduser <tg_id> [admin]` | Admin only | Add/promote user |
|
|
| `/removeuser <tg_id>` | Admin only | Remove user |
|
|
| `/users` | Admin only | List all users |
|
|
|
|
No `/pause` or `/resume` — replaced by subscribe/unsubscribe model.
|
|
|
|
## Known Quirks & Gotchas
|
|
|
|
1. **Postgres auth stale volume**: The bind-mounted `data/db/` keeps old `pg_hba.conf`. Fixed via `POSTGRES_HOST_AUTH_METHOD=trust` in docker-compose. If you see auth errors, delete `data/db/` and restart (⚠️ data loss).
|
|
2. **Supabase Studio "unhealthy"**: Health check reports unhealthy but APIs work. Cosmetic — ignore unless tables don't load.
|
|
3. **`supabase_admin` role**: Required for pg-meta (feeds Studio UI). Must exist as LOGIN role with grants. Created in `post-boot.sql`.
|
|
4. **Migration 03 conflict target**: Uses index name (`idx_keywords_unique_lower`) via `ON CONFLICT DO NOTHING`. Cannot use `ON CONFLICT ON CONSTRAINT ...` because it's an index, not a constraint. This caused the initial migration failure — the fixed SQL uses plain `ON CONFLICT DO NOTHING` (Postgres auto-detects the unique index).
|
|
5. **Migration 03 data migration**: PL/pgSQL block migrates from old `search_queries` → new tables. If running manually, ensure `search_queries` still exists before executing.
|
|
6. **Worker container rebuild required after code changes**: `docker compose up -d --build worker`
|
|
|
|
## File Map
|
|
```
|
|
├── .env # secrets (gitignored)
|
|
├── .env.example # template
|
|
├── docker-compose.yml # 6 services
|
|
├── README.md # user-facing setup guide + deployment section
|
|
├── ARCHITECTURE.md # technical deep-dive: schema, scheduler, notification engine
|
|
├── supabase/
|
|
│ ├── migrations/
|
|
│ │ ├── 00-run-init.sh
|
|
│ │ ├── 01-init.sql # original schema (users, ads, notifications, etc.)
|
|
│ │ ├── 02-image-and-pricing.sql # main_image_url + price_history table
|
|
│ │ ├── 03-global-keywords.sql # keywords + subscriptions refactor
|
|
│ │ └── post-boot.sql # supabase_admin role + admin user seed
|
|
│ ├── pg_hba.conf
|
|
│ └── kong.yml
|
|
├── worker/
|
|
│ ├── Dockerfile # python:3.12-slim, pip install requirements.txt
|
|
│ ├── requirements.txt # PTB 21.4, asyncpg 0.30, httpx 0.27, dotenv
|
|
│ └── src/
|
|
│ ├── main.py # entry point + scheduler loop
|
|
│ ├── bot.py # Telegram command handlers
|
|
│ ├── scraper.py # willhaben API client
|
|
│ ├── notifier.py # message builder + sender
|
|
│ └── db.py # asyncpg pool singleton
|
|
└── data/
|
|
└── db/ # Postgres persistent volume (gitignored)
|
|
```
|
|
|
|
## Design Decisions Recap
|
|
- **Single worker container** — bot + scheduler share asyncio event loop and DB pool. Simpler ops at low scale. Crash takes down both, but Docker auto-restarts.
|
|
- **Global keyword dedup** — one scrape feeds all subscribers via fan-out. Eliminates wasted API calls. Tradeoff: shared interval per keyword (first user sets it).
|
|
- **asyncpg direct to Postgres** — no HTTP overhead on hot path. Full SQL flexibility. Worker bypasses Kong entirely.
|
|
- **Page 1 only** — newest 30 ads sorted by recency. Trades completeness for speed + API politeness.
|
|
- **Best-effort notifications** — failed sends to one subscriber don't block others. No retry queue.
|