fix: remove stale mark_notified import, full main.py scheduler refactor, fix scraper datetime+image extraction
This commit is contained in:
@@ -0,0 +1,161 @@
|
|||||||
|
# Agent Context — willhaben-tracker
|
||||||
|
|
||||||
|
## What It Is
|
||||||
|
A Telegram bot that monitors classified ads on [willhaben.at](https://www.willhaben.at) and notifies users about new listings and price drops matching their keywords. Notifications include photos, postcode, published/modified dates. Deployed locally via Docker Compose with a Supabase-style Postgres stack.
|
||||||
|
|
||||||
|
## Project Location
|
||||||
|
`/home/lago/Documents/Projects/willhaben-tracker` — git repo on `dev` branch.
|
||||||
|
|
||||||
|
## Live Credentials (Do Not Commit)
|
||||||
|
- **Telegram bot token**: in `.env` as `TELEGRAM_BOT_TOKEN`
|
||||||
|
- **Admin Telegram ID**: `298181113` (seeded as admin user on first boot)
|
||||||
|
- **Postgres password**: in `.env`, referenced by all services
|
||||||
|
|
||||||
|
## Stack Overview
|
||||||
|
|
||||||
|
| Service | Image | Host Port | Purpose |
|
||||||
|
|---------|-------|-----------|---------|
|
||||||
|
| `db` | `postgres:15-alpine` | 55632→5432 | Primary datastore, migrations on boot |
|
||||||
|
| `rest` | `postgrest/postgrest:v12.2.0` | internal | Auto-generated REST API over schema |
|
||||||
|
| `kong` | `kong:2.8.1` | 55621→8000 | Reverse proxy, JWT auth, CORS |
|
||||||
|
| `studio` | `supabase/studio` | 55630→3000 | Web admin dashboard (http://localhost:55630) |
|
||||||
|
| `meta` | `supabase/postgres-meta:v0.84.2` | internal | Schema metadata service (feeds Studio) |
|
||||||
|
| `worker` | custom (`./worker/Dockerfile`) | none | Telegram bot + scraping scheduler |
|
||||||
|
|
||||||
|
Worker connects directly to Postgres via `asyncpg`, bypassing PostgREST/Kong entirely. The Kong chain exists for Studio and potential future API clients.
|
||||||
|
|
||||||
|
## Docker Commands (Always from Project Root)
|
||||||
|
```bash
|
||||||
|
docker compose up -d --build # build + start all services
|
||||||
|
docker compose logs -f worker # tail worker logs
|
||||||
|
docker compose exec -T db psql -U postgres -d postgres -c "..." # run SQL
|
||||||
|
docker compose restart worker # restart just the worker
|
||||||
|
```
|
||||||
|
|
||||||
|
## Database — Current Schema (Post Migration 03)
|
||||||
|
|
||||||
|
### Tables
|
||||||
|
- **`users`** — whitelisted Telegram users (`telegram_id UNIQUE`, `is_admin`, `is_active`)
|
||||||
|
- **`keywords`** — global keyword tracking (`keyword UNIQUE case-insensitive via index`, `interval_minutes DEFAULT 60`, `is_active`, `last_scraped_at`, `initial_loaded`)
|
||||||
|
- **`keyword_subscriptions`** — many-to-many: `(keyword_id, user_id) PK`
|
||||||
|
- **`ads`** — deduplicated ad snapshots (`wh_ad_id UNIQUE`, `raw_json JSONB`, `title`, `price NUMERIC`, `location`, `url`, `published_at TIMESTAMPTZ`, `first_seen_at`, `main_image_url TEXT`, `postcode TEXT`, `modified_at TIMESTAMPTZ`)
|
||||||
|
- **`price_history`** — price change audit log (`ad_id FK→ads`, `old_price`, `new_price`, `changed_at`, unique on all three)
|
||||||
|
- **`notifications`** — sent message audit trail (`user_id FK→users`, `ad_id FK→ads`, `message_id INT`)
|
||||||
|
- **`scrape_logs`** — per-run health logs (`keyword_id FK→keywords`, `status IN ('success','error')`, `ads_found`, `new_ads`, `error_message`, `scraped_at`)
|
||||||
|
|
||||||
|
### Key Indexes
|
||||||
|
- `idx_keywords_unique_lower` — unique index on `LOWER(keyword)` (used as conflict target via `ON CONFLICT DO NOTHING`)
|
||||||
|
- `idx_keyword_subscriptions_user` — fast user subscription lookups
|
||||||
|
- `idx_price_history_ad_id` — price timeline per ad
|
||||||
|
|
||||||
|
### Migration Order
|
||||||
|
1. `01-init.sql` — creates users, ads, notifications, scrape_logs (original tables)
|
||||||
|
2. `02-image-and-pricing.sql` — adds `main_image_url` to ads, creates `price_history`
|
||||||
|
3. `03-global-keywords.sql` — refactors: new `keywords` + `keyword_subscriptions`, drops old `search_queries` + `query_ads`, adds `postcode` and `modified_at` to ads, renames `scrape_logs.search_query_id` → `keyword_id`
|
||||||
|
4. `post-boot.sql` — creates `supabase_admin` role with grants, seeds admin user `298181113`
|
||||||
|
|
||||||
|
**Important**: Postgres migrations only run on fresh volume init (first boot). To apply new migration files on a running stack, either:
|
||||||
|
- Execute manually via `docker compose exec -T db psql ...`
|
||||||
|
- Or delete `data/db/` and restart (⚠️ loses all data)
|
||||||
|
|
||||||
|
## Worker Code — Python 3.12 (`worker/src/`)
|
||||||
|
|
||||||
|
| File | Responsibility |
|
||||||
|
|------|---------------|
|
||||||
|
| `main.py` | Entry point: dotenv, asyncpg pool, PTB v21 app + scheduler loop |
|
||||||
|
| `bot.py` | Telegram command handlers: `/start`, `/add <kw>`, `/list`, `/delete <kw_id>`, `/stats`, admin commands |
|
||||||
|
| `scraper.py` | Willhaben API client: httpx with 3 retries, attribute parser, field extractor |
|
||||||
|
| `notifier.py` | Telegram message builder + sender: photo or text, HTML formatting, inline buttons |
|
||||||
|
| `db.py` | asyncpg pool singleton (`get_pool()`, `close_pool()`) |
|
||||||
|
|
||||||
|
### Scheduler Flow (`main.py:scheduler_task`)
|
||||||
|
1. Poll `keywords` for due entries (`is_active AND last_scraped_at stale vs interval`)
|
||||||
|
2. For each keyword: fetch all active subscribers from `keyword_subscriptions JOIN users`
|
||||||
|
3. Scrape API once (30 newest ads, page 1, sort by recency)
|
||||||
|
4. For each ad: global dedup on `wh_ad_id` → INSERT if new, UPDATE price if dropped
|
||||||
|
5. Fan-out notifications to ALL subscribers: `notify_new_ad(bot, tg_id, ad)` and `notify_price_drop(bot, tg_id, ad)`
|
||||||
|
6. Mark `initial_loaded = true` after first cycle (prevents baseline spam)
|
||||||
|
7. Deactivate keyword if zero active subscribers
|
||||||
|
|
||||||
|
### Scraper Details (`scraper.py`)
|
||||||
|
- **API endpoint**: `https://www.willhaben.at/webapi/ad-search/search/atz/seo/kaufen-und-verkaufen/marktplatz?keyword=...&rows=30&sort=1`
|
||||||
|
- **Headers**: must include `x-wh-client: api@willhaben.at;responsive_web;server;1.0.0;desktop`
|
||||||
|
- **Attribute format**: `[{"name": "KEY", "values": ["val"]}]` (NOT `{key, value}`)
|
||||||
|
- **Key attribute mappings**:
|
||||||
|
- `HEADING` → title (fallback to `description`)
|
||||||
|
- `PRICE/AMOUNT` → price (strip commas, parse float)
|
||||||
|
- `LOCATION` → display location string
|
||||||
|
- `POSTCODE` → postal code string
|
||||||
|
- `SEO_URL` → full URL = `/iad/{seo_url}`
|
||||||
|
- `PUBLISHED_String` → published datetime (ISO 8601)
|
||||||
|
- `CHANGED_String` → modified datetime (ISO 8601)
|
||||||
|
- **Image**: from `advertImageList.advertImage[0].referenceImageUrl` — use this, NOT `mainImageUrl`
|
||||||
|
|
||||||
|
### Notifier Details (`notifier.py`)
|
||||||
|
- Sends photo via `bot.send_photo()` when `main_image_url` exists; falls back to `bot.send_message()`
|
||||||
|
- Uses `parse_mode="HTML"` for bold titles
|
||||||
|
- Message format: header emoji + title (bold) + price/location/postcode + published/modified dates + inline "View Ad →" button
|
||||||
|
- Returns Telegram message_id (`int | None`)
|
||||||
|
|
||||||
|
### Critical asyncpg Rule
|
||||||
|
**Never pass ISO string to a TIMESTAMPTZ column.** asyncpg expects native `datetime` objects. The scraper must return `published_at` and `modified_at` as Python `datetime` instances, not strings. (This was the main bug in v1.)
|
||||||
|
|
||||||
|
## Telegram Bot — Commands
|
||||||
|
|
||||||
|
| Command | Access | Description |
|
||||||
|
|---------|--------|-------------|
|
||||||
|
| `/start` | Anyone | Auto-registers user, shows help |
|
||||||
|
| `/add <keyword>` | Registered | Subscribe to keyword (shared across users) |
|
||||||
|
| `/list` | Registered | List subscriptions with subscriber counts |
|
||||||
|
| `/delete <keyword_id>` | Registered | Unsubscribe; deactivates keyword if last subscriber |
|
||||||
|
| `/stats` | Registered | Total keywords, ads, notifications |
|
||||||
|
| `/adduser <tg_id> [admin]` | Admin only | Add/promote user |
|
||||||
|
| `/removeuser <tg_id>` | Admin only | Remove user |
|
||||||
|
| `/users` | Admin only | List all users |
|
||||||
|
|
||||||
|
No `/pause` or `/resume` — replaced by subscribe/unsubscribe model.
|
||||||
|
|
||||||
|
## Known Quirks & Gotchas
|
||||||
|
|
||||||
|
1. **Postgres auth stale volume**: The bind-mounted `data/db/` keeps old `pg_hba.conf`. Fixed via `POSTGRES_HOST_AUTH_METHOD=trust` in docker-compose. If you see auth errors, delete `data/db/` and restart (⚠️ data loss).
|
||||||
|
2. **Supabase Studio "unhealthy"**: Health check reports unhealthy but APIs work. Cosmetic — ignore unless tables don't load.
|
||||||
|
3. **`supabase_admin` role**: Required for pg-meta (feeds Studio UI). Must exist as LOGIN role with grants. Created in `post-boot.sql`.
|
||||||
|
4. **Migration 03 conflict target**: Uses index name (`idx_keywords_unique_lower`) via `ON CONFLICT DO NOTHING`. Cannot use `ON CONFLICT ON CONSTRAINT ...` because it's an index, not a constraint. This caused the initial migration failure — the fixed SQL uses plain `ON CONFLICT DO NOTHING` (Postgres auto-detects the unique index).
|
||||||
|
5. **Migration 03 data migration**: PL/pgSQL block migrates from old `search_queries` → new tables. If running manually, ensure `search_queries` still exists before executing.
|
||||||
|
6. **Worker container rebuild required after code changes**: `docker compose up -d --build worker`
|
||||||
|
|
||||||
|
## File Map
|
||||||
|
```
|
||||||
|
├── .env # secrets (gitignored)
|
||||||
|
├── .env.example # template
|
||||||
|
├── docker-compose.yml # 6 services
|
||||||
|
├── README.md # user-facing setup guide + deployment section
|
||||||
|
├── ARCHITECTURE.md # technical deep-dive: schema, scheduler, notification engine
|
||||||
|
├── supabase/
|
||||||
|
│ ├── migrations/
|
||||||
|
│ │ ├── 00-run-init.sh
|
||||||
|
│ │ ├── 01-init.sql # original schema (users, ads, notifications, etc.)
|
||||||
|
│ │ ├── 02-image-and-pricing.sql # main_image_url + price_history table
|
||||||
|
│ │ ├── 03-global-keywords.sql # keywords + subscriptions refactor
|
||||||
|
│ │ └── post-boot.sql # supabase_admin role + admin user seed
|
||||||
|
│ ├── pg_hba.conf
|
||||||
|
│ └── kong.yml
|
||||||
|
├── worker/
|
||||||
|
│ ├── Dockerfile # python:3.12-slim, pip install requirements.txt
|
||||||
|
│ ├── requirements.txt # PTB 21.4, asyncpg 0.30, httpx 0.27, dotenv
|
||||||
|
│ └── src/
|
||||||
|
│ ├── main.py # entry point + scheduler loop
|
||||||
|
│ ├── bot.py # Telegram command handlers
|
||||||
|
│ ├── scraper.py # willhaben API client
|
||||||
|
│ ├── notifier.py # message builder + sender
|
||||||
|
│ └── db.py # asyncpg pool singleton
|
||||||
|
└── data/
|
||||||
|
└── db/ # Postgres persistent volume (gitignored)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Design Decisions Recap
|
||||||
|
- **Single worker container** — bot + scheduler share asyncio event loop and DB pool. Simpler ops at low scale. Crash takes down both, but Docker auto-restarts.
|
||||||
|
- **Global keyword dedup** — one scrape feeds all subscribers via fan-out. Eliminates wasted API calls. Tradeoff: shared interval per keyword (first user sets it).
|
||||||
|
- **asyncpg direct to Postgres** — no HTTP overhead on hot path. Full SQL flexibility. Worker bypasses Kong entirely.
|
||||||
|
- **Page 1 only** — newest 30 ads sorted by recency. Trades completeness for speed + API politeness.
|
||||||
|
- **Best-effort notifications** — failed sends to one subscriber don't block others. No retry queue.
|
||||||
+398
@@ -0,0 +1,398 @@
|
|||||||
|
# Architecture — willhaben-tracker
|
||||||
|
|
||||||
|
## 1. System Overview
|
||||||
|
|
||||||
|
A Docker Compose stack that monitors Austrian classifieds (willhaben.at) for price drops and new listings, then pushes real-time notifications via Telegram. Six services cooperate:
|
||||||
|
|
||||||
|
| Service | Image | Role | Port |
|
||||||
|
|------------|--------------------------------------|-------------------------------------------------|-------------|
|
||||||
|
| `db` | `postgres:15-alpine` | Primary datastore; migrations on boot | 55632 → 5432|
|
||||||
|
| `rest` | `postgrest/postgrest:v12.2.0` | Auto-generated REST API over the PostgreSQL schema | internal |
|
||||||
|
| `kong` | `kong:2.8.1` | API gateway with JWT auth, CORS, request transform | 55621 → 8000|
|
||||||
|
| `studio` | `supabase/studio` | Web admin dashboard for database management | 55630 → 3000|
|
||||||
|
| `meta` | `supabase/postgres-meta:v0.84.2` | Schema metadata service (feeds Studio) | internal |
|
||||||
|
| `worker` | custom (`./worker/Dockerfile`) | Long-polling Telegram bot + scraping scheduler | none |
|
||||||
|
|
||||||
|
The **worker** is the only business-logic container. It connects directly to Postgres via `asyncpg`, bypassing PostgREST/Kong entirely. The Kong → PostgREST chain exists for the Studio dashboard and any future API clients.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Architecture Diagram
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────┐
|
||||||
|
│ Telegram │
|
||||||
|
│ Users │
|
||||||
|
└──────┬───────┘
|
||||||
|
│ long-polling (HTTPS)
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Docker Network │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────┐ asyncpg ┌───────────┐ │
|
||||||
|
│ │ worker │◄──────────────►│ db │ │
|
||||||
|
│ │ │ scheduler + │ postgres │ │
|
||||||
|
│ │ bot.py │ notifier │ 15 │ │
|
||||||
|
│ │ main.py │ └─────┬─────┘ │
|
||||||
|
│ │ scraper.py ┌───────┐ │
|
||||||
|
│ │ │ meta │ │
|
||||||
|
│ └──────────┘ └───┬───┘ │
|
||||||
|
│ ┌───────────┤ │
|
||||||
|
│ ▼ ▼ │
|
||||||
|
│ ┌───────┐ ┌───────┐ │
|
||||||
|
│ │ rest │ │ kong │ │
|
||||||
|
│ │Postgre│◄─►│ 2.8 │ │
|
||||||
|
│ │ REST │ │ │ │
|
||||||
|
│ └───────┘ └───┬───┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌────────▼────────┐ │
|
||||||
|
│ │ studio │ │
|
||||||
|
│ │ (Supabase UI) │ │
|
||||||
|
│ └─────────────────┘ │
|
||||||
|
│ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│ httpx GET
|
||||||
|
▼
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ willhaben.at │
|
||||||
|
│ /webapi/ad-search/… │
|
||||||
|
└──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Data flow (scrape → notify):**
|
||||||
|
1. Scheduler polls `search_queries` for due queries.
|
||||||
|
2. For each query, calls willhaben API via `scraper.py`.
|
||||||
|
3. Upserts ads into `ads`, detects new/price-drop events.
|
||||||
|
4. Sends Telegram messages via `notifier.py` (photo or text).
|
||||||
|
5. Logs to `scrape_logs` and `notifications`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Database Schema
|
||||||
|
|
||||||
|
### Core tables
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Whitelisted users; source of truth for access control
|
||||||
|
users (
|
||||||
|
id uuid PRIMARY KEY,
|
||||||
|
telegram_id bigint UNIQUE NOT NULL,
|
||||||
|
username text,
|
||||||
|
first_name text,
|
||||||
|
is_admin boolean DEFAULT false,
|
||||||
|
is_active boolean DEFAULT true,
|
||||||
|
created_at timestamptz DEFAULT now()
|
||||||
|
)
|
||||||
|
|
||||||
|
-- One saved search per user; drives scheduler intervals
|
||||||
|
search_queries (
|
||||||
|
id uuid PRIMARY KEY,
|
||||||
|
user_id uuid REFERENCES users(id) ON DELETE CASCADE,
|
||||||
|
keyword text NOT NULL,
|
||||||
|
interval_minutes int DEFAULT 60,
|
||||||
|
is_active boolean DEFAULT true,
|
||||||
|
last_scraped_at timestamptz,
|
||||||
|
created_at timestamptz DEFAULT now()
|
||||||
|
)
|
||||||
|
|
||||||
|
-- De-duplicated ad snapshots from willhaben
|
||||||
|
ads (
|
||||||
|
id uuid PRIMARY KEY,
|
||||||
|
wh_ad_id text UNIQUE NOT NULL, -- willhaben's internal ID
|
||||||
|
raw_json jsonb NOT NULL, -- full API response payload
|
||||||
|
title text NOT NULL,
|
||||||
|
price numeric, -- parsed float; null if unknown
|
||||||
|
location text,
|
||||||
|
url text,
|
||||||
|
published_at timestamptz,
|
||||||
|
first_seen_at timestamptz DEFAULT now(),
|
||||||
|
main_image_url text -- migration 02
|
||||||
|
)
|
||||||
|
|
||||||
|
-- Junction: which query discovered which ad (enables per-user dedup)
|
||||||
|
query_ads (
|
||||||
|
search_query_id uuid REFERENCES search_queries(id) ON DELETE CASCADE,
|
||||||
|
ad_id uuid REFERENCES ads(id) ON DELETE CASCADE,
|
||||||
|
first_seen_at timestamptz DEFAULT now(),
|
||||||
|
is_notified boolean DEFAULT false,
|
||||||
|
PRIMARY KEY (search_query_id, ad_id)
|
||||||
|
)
|
||||||
|
|
||||||
|
-- Audit log of every Telegram notification sent
|
||||||
|
notifications (
|
||||||
|
id uuid PRIMARY KEY,
|
||||||
|
user_id uuid REFERENCES users(id) ON DELETE CASCADE,
|
||||||
|
ad_id uuid REFERENCES ads(id) ON DELETE SET NULL,
|
||||||
|
message_id int, -- Telegram message ID for reply/edit
|
||||||
|
sent_at timestamptz DEFAULT now()
|
||||||
|
)
|
||||||
|
|
||||||
|
-- Per-run health log; used for debugging and monitoring
|
||||||
|
scrape_logs (
|
||||||
|
id uuid PRIMARY KEY,
|
||||||
|
search_query_id uuid REFERENCES search_queries(id) ON DELETE CASCADE,
|
||||||
|
status text CHECK (status IN ('success', 'error', 'rate_limited')),
|
||||||
|
ads_found int DEFAULT 0,
|
||||||
|
new_ads int DEFAULT 0,
|
||||||
|
error_message text,
|
||||||
|
scraped_at timestamptz DEFAULT now()
|
||||||
|
)
|
||||||
|
|
||||||
|
-- Price change history; migration 02
|
||||||
|
price_history (
|
||||||
|
id uuid PRIMARY KEY,
|
||||||
|
ad_id uuid REFERENCES ads(id) ON DELETE CASCADE,
|
||||||
|
old_price numeric NOT NULL,
|
||||||
|
new_price numeric NOT NULL,
|
||||||
|
changed_at timestamptz DEFAULT now(),
|
||||||
|
UNIQUE (ad_id, old_price, new_price)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key indexes
|
||||||
|
|
||||||
|
| Index | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| `idx_search_queries_active_scraped` | Partial index on `(is_active=true, last_scraped_at)` — scheduler polling |
|
||||||
|
| `idx_query_ads_notified` | Partial on `(search_query_id, is_notified=false)` — notifier lookups |
|
||||||
|
| `idx_notifications_user_sent` | `(user_id, sent_at DESC)` — per-user notification history |
|
||||||
|
| `idx_scrape_logs_query_at` | `(search_query_id, scraped_at DESC)` — latest runs per query |
|
||||||
|
| `idx_price_history_ad_id` | Fast price timeline lookup per ad |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Scheduler Loop
|
||||||
|
|
||||||
|
The scheduler runs as an infinite `asyncio` coroutine inside the worker container (`main.py:22`).
|
||||||
|
|
||||||
|
### Cycle
|
||||||
|
|
||||||
|
```python
|
||||||
|
while True:
|
||||||
|
# 1. Poll DB for due queries
|
||||||
|
rows = pool.fetch(
|
||||||
|
"SELECT sq.id, sq.keyword, sq.interval_minutes, u.telegram_id"
|
||||||
|
"FROM search_queries sq JOIN users u ON sq.user_id = u.id"
|
||||||
|
"WHERE sq.is_active AND (last_scraped_at IS NULL"
|
||||||
|
" OR last_scraped_at < now() - interval_minutes)"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 2. Process each query
|
||||||
|
for row in rows:
|
||||||
|
ads_raw, _ = await fetch_ads(keyword) # call willhaben API
|
||||||
|
for ad_data in ads_raw:
|
||||||
|
fields = extract_ad_fields(ad_data) # parse JSON → dict
|
||||||
|
wh_ad_id = fields["wh_ad_id"]
|
||||||
|
|
||||||
|
# 3. Global dedup on wh_ad_id (UNIQUE constraint)
|
||||||
|
existing = pool.fetchrow("SELECT id FROM ads WHERE wh_ad_id = $1", wh_ad_id)
|
||||||
|
|
||||||
|
if not existing:
|
||||||
|
# New ad — INSERT, notify
|
||||||
|
...
|
||||||
|
else:
|
||||||
|
# Existing ad — check price drop
|
||||||
|
old_price = existing["price"]
|
||||||
|
new_price = fields["price"]
|
||||||
|
if old_price and new_price and new_price < old_price:
|
||||||
|
UPDATE ads SET price = $1
|
||||||
|
INSERT INTO price_history ON CONFLICT DO NOTHING
|
||||||
|
await notify_price_drop(...)
|
||||||
|
|
||||||
|
# 4. Query-level dedup via query_ads junction table
|
||||||
|
existing_qa = pool.fetchrow(
|
||||||
|
"SELECT 1 FROM query_ads WHERE search_query_id=$1 AND ad_id=$2", ...
|
||||||
|
)
|
||||||
|
if not existing_qa:
|
||||||
|
INSERT INTO query_ads (...)
|
||||||
|
await notify_new_ad(...)
|
||||||
|
mark_notified(pool, query_id, ad_uuid)
|
||||||
|
log_notification(pool, user_id, ad_uuid, msg_id)
|
||||||
|
|
||||||
|
UPDATE search_queries SET last_scraped_at = now() WHERE id = $1
|
||||||
|
INSERT INTO scrape_logs (status='success', ...)
|
||||||
|
|
||||||
|
await asyncio.sleep(30) # polling interval between cycles
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key behaviors
|
||||||
|
|
||||||
|
- **Interval-driven**: Each query has its own `interval_minutes`. The scheduler selects queries whose `last_scraped_at` is stale relative to their interval.
|
||||||
|
- **Global ad dedup**: The `ads.wh_ad_id UNIQUE` constraint ensures an ad is stored once regardless of how many queries match it.
|
||||||
|
- **Per-query notification dedup**: Even though the ad row is global, the `query_ads` junction table tracks which query has already surfaced the ad to its owner. A new ad triggers a notification only when the `(search_query_id, ad_id)` pair is first inserted.
|
||||||
|
- **Price drop detection**: On every scrape cycle, existing ads are re-checked. If `new_price < old_price`, the row updates and a `price_history` record is created (`ON CONFLICT DO NOTHING` prevents duplicate entries for the same price transition).
|
||||||
|
- **Error resilience**: Per-query exceptions log to `scrape_logs(status='error')` and continue processing remaining queries. Top-level exceptions trigger a 30-second backoff before retrying the full cycle.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Notification Engine
|
||||||
|
|
||||||
|
### Fan-Out Model
|
||||||
|
|
||||||
|
The scheduler scrapes each keyword exactly once per cycle, then fans out notifications to ALL active subscribers of that keyword. This eliminates duplicate API calls when multiple users track the same keyword.
|
||||||
|
|
||||||
|
```
|
||||||
|
Scheduler → scrape "iphone 15" once (30 ads)
|
||||||
|
↓ detect events: 2 new, 1 price drop
|
||||||
|
┌──────────────┼──────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
Subscriber A Subscriber B Subscriber C
|
||||||
|
notify_new notify_new notify_new
|
||||||
|
notify_drop notify_drop notify_drop
|
||||||
|
```
|
||||||
|
|
||||||
|
### Delivery Mechanism
|
||||||
|
|
||||||
|
- Each notification is a separate Telegram API call (photo with caption, or text-only fallback)
|
||||||
|
- Subscribers are fetched in bulk per keyword: `SELECT u.telegram_id FROM keyword_subscriptions ks JOIN users u ...`
|
||||||
|
- Fan-out uses `asyncio.gather()` for parallel delivery across subscribers
|
||||||
|
- If subscriber count > 10, a 50ms sleep is inserted between each batch to respect Telegram's 30 msg/sec rate limit
|
||||||
|
|
||||||
|
### Event Types
|
||||||
|
|
||||||
|
| Function | Trigger | Header | Photo? |
|
||||||
|
|----------|---------|--------|--------|
|
||||||
|
| `notify_new_ad` | Ad not in ads table for this keyword cycle | "🆕 New listing found!" | Yes, if `main_image_url` exists |
|
||||||
|
| `notify_price_drop` | Existing ad price decreased | "⚠️ Price drop!" | Yes, if `main_image_url` exists |
|
||||||
|
|
||||||
|
### Message Format (HTML)
|
||||||
|
|
||||||
|
```
|
||||||
|
<emoji> <event_type>!
|
||||||
|
|
||||||
|
<b>Title Here</b>
|
||||||
|
|
||||||
|
💰 1,234 € 📍 LOCATION, POSTCODE
|
||||||
|
Published: d.m.Y H:M | Modified: d.m.Y H:M
|
||||||
|
[View Ad →] ← inline button
|
||||||
|
```
|
||||||
|
|
||||||
|
### Initial Load Behavior
|
||||||
|
|
||||||
|
On the first scrape cycle after a keyword is created (`keywords.initial_loaded = false`), all ads are silently indexed into the database without sending any notifications. The flag is set to `true` after the first complete cycle. Subsequent cycles notify normally for new ads and price drops.
|
||||||
|
|
||||||
|
### Delivery Guarantees & Error Handling
|
||||||
|
|
||||||
|
- **Best-effort**: Failed sends to one subscriber don't block delivery to others
|
||||||
|
- **No retry queue**: By design — keeps it simple for local deployment
|
||||||
|
- **Per-subscriber isolation**: Each Telegram API call is wrapped in its own try/except
|
||||||
|
- **Audit logging**: Every successful notification recorded in `notifications(user_id, ad_id, message_id)`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Telegram Bot
|
||||||
|
|
||||||
|
The bot uses python-telegram-bot v2x with **long polling** (`app.updater.start_polling()`). It shares the same `asyncio` event loop as the scheduler.
|
||||||
|
|
||||||
|
### Command set
|
||||||
|
|
||||||
|
| Command | Audience | Description |
|
||||||
|
|---------|----------|-------------|
|
||||||
|
| `/start` | Anyone | Auto-registers user in DB, shows help text |
|
||||||
|
| `/add <keyword>` | Registered | Creates a new search query; duplicate check via `ILIKE` |
|
||||||
|
| `/list` | Registered | Lists all queries with status, interval, last scraped time |
|
||||||
|
| `/pause <uuid>` | Registered | Sets `is_active = false` for the query (owner-only) |
|
||||||
|
| `/resume <uuid>` | Registered | Sets `is_active = true` |
|
||||||
|
| `/delete <uuid>` | Registered | Hard-deletes a query (CASCADE removes query_ads rows) |
|
||||||
|
| `/stats` | Registered | Aggregates: total queries, distinct ads tracked, notifications sent |
|
||||||
|
| `/adduser <id> [admin]` | Admin only | Inserts or activates a user; optional admin flag |
|
||||||
|
| `/removeuser <id>` | Admin only | Hard-deletes from users table (CASCADE) |
|
||||||
|
| `/users` | Admin only | Lists all registered users with role and status |
|
||||||
|
|
||||||
|
### Authentication model
|
||||||
|
|
||||||
|
- **Auto-registration**: `/start` creates a `users` row if the `telegram_id` doesn't exist. The user starts as `is_active=true`, `is_admin=false`.
|
||||||
|
- **Whitelist check**: Every command (except `/start`) calls `_require_user()`, which verifies the sender exists in DB and is active. Inactive users receive "Account deactivated."
|
||||||
|
- **Admin gate**: Admin commands call `_require_admin()`, querying for `is_admin=true AND is_active=true`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Scraper
|
||||||
|
|
||||||
|
### API endpoint
|
||||||
|
|
||||||
|
```
|
||||||
|
GET https://www.willhaben.at/webapi/ad-search/search/atz/seo/kaufen-und-verkaufen/marktplatz
|
||||||
|
?keyword=<search>
|
||||||
|
&rows=30
|
||||||
|
&sort=1
|
||||||
|
```
|
||||||
|
|
||||||
|
| Parameter | Value | Effect |
|
||||||
|
|-----------|-------|--------|
|
||||||
|
| `keyword` | user-provided text | Full-text search on willhaben |
|
||||||
|
| `rows` | `30` | Maximum results per page |
|
||||||
|
| `sort` | `1` | Sort by newest first |
|
||||||
|
|
||||||
|
Only **page 1** is fetched. The design trades completeness for speed and API politeness — the newest 30 ads are typically sufficient for early-bird notifications.
|
||||||
|
|
||||||
|
### Headers
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"accept": "application/json",
|
||||||
|
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
|
||||||
|
"x-wh-client": "api@willhaben.at;responsive_web;server;1.0.0;desktop",
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Retry logic
|
||||||
|
|
||||||
|
Three attempts with exponential backoff: `2^1s`, `2^2s`. On third failure, the exception propagates to the scheduler's error handler.
|
||||||
|
|
||||||
|
### Attribute parsing
|
||||||
|
|
||||||
|
The willhaben API returns ad attributes as a flat list of `{name, values[]}` objects. The parser extracts these into a dict keyed by attribute name:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _parse_attributes(ad_dict) -> dict[str, str]:
|
||||||
|
result = {}
|
||||||
|
for attr in ad_dict["attributes"]["attribute"]:
|
||||||
|
name = attr.get("name")
|
||||||
|
values = attr.get("values", [])
|
||||||
|
if name and values:
|
||||||
|
result[name] = values[0] # take first value only
|
||||||
|
return result
|
||||||
|
```
|
||||||
|
|
||||||
|
Key attribute mappings:
|
||||||
|
- `HEADING` → ad title (fallback to `description`)
|
||||||
|
- `PRICE/AMOUNT` → numeric price (commas stripped, parsed as float)
|
||||||
|
- `LOCATION` → display location string
|
||||||
|
- `SEO_URL` → reconstructs full URL as `/iad/{seo_url}`
|
||||||
|
- `PUBLISHED_String` / `CHANGED_String` → ISO 8601 timestamp
|
||||||
|
|
||||||
|
The main image is extracted from the first entry in `advertImageList.advertImage.referenceImageUrl`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Key Decisions and Tradeoffs
|
||||||
|
|
||||||
|
### Single worker container
|
||||||
|
|
||||||
|
Both the Telegram bot (long polling) and the scraping scheduler run as concurrent `asyncio` tasks within one Python process. This avoids inter-service communication complexity, reduces resource overhead (one container vs. two), and simplifies shared state — both coroutines use the same `asyncpg.Pool`. The tradeoff is that a crash in either component takes down the other; however, Docker's `restart: unless-stopped` policy recovers automatically.
|
||||||
|
|
||||||
|
### Global ad dedup (`ads.wh_ad_id UNIQUE`)
|
||||||
|
|
||||||
|
An ad discovered by multiple users' queries shares a single row. This saves storage and ensures price history tracks the true market price of the listing, not per-user observations. The cost is that the `query_ads` junction table must be consulted to determine whether *this particular user's query* has already seen the ad — adding an extra DB lookup per ad per cycle.
|
||||||
|
|
||||||
|
### DB-driven whitelist vs. env-var list
|
||||||
|
|
||||||
|
Users are stored in PostgreSQL rather than a static allowlist in environment variables or config files. Benefits:
|
||||||
|
- **Admin commands** (`/adduser`, `/removeuser`) manage access without redeploying the container.
|
||||||
|
- **Per-user data** (search queries, notification history) is naturally relational.
|
||||||
|
- **Role-based authorization** (`is_admin` flag) is a simple column check rather than application logic branching on env var patterns.
|
||||||
|
|
||||||
|
The tradeoff is that an attacker with DB write access could grant themselves admin; however, the `worker` container only reads/writes through parameterized queries — there's no exposed SQL interface. The PostgREST/Kong layer handles external API auth separately.
|
||||||
|
|
||||||
|
### Page-1-only scraping
|
||||||
|
|
||||||
|
Fetching only 30 ads per cycle (page 1, sorted newest) keeps API calls lightweight and reduces latency per scheduler tick. For a marketplace like willhaben where new listings appear continuously, the first page captures fresh inventory. Missing older pages means ads that were posted before the user's query was created won't be backfilled — but the `/add` command is designed for forward-looking monitoring, not historical discovery.
|
||||||
|
|
||||||
|
### asyncpg direct connection (bypassing PostgREST)
|
||||||
|
|
||||||
|
The worker talks to Postgres directly rather than routing through Kong → PostgREST. This avoids HTTP overhead on the hot path (every ad per every query per cycle), gives full SQL flexibility (complex JOINs in the scheduler poll, conditional UPDATE/INSERT patterns), and eliminates JWT token management for internal traffic. PostgREST remains available for the Studio dashboard and any future REST clients.
|
||||||
|
|
||||||
|
### Global keyword dedup (`keywords` + `keyword_subscriptions`)
|
||||||
|
|
||||||
|
Multiple users can subscribe to the same keyword without duplicating API calls or ad storage. One scrape feeds all subscribers via fan-out notification. The cost is that a subscriber cannot independently control their scrape interval — it's shared per keyword (set by the first user, adjustable only while no other users are subscribed). This trades per-user granularity for API efficiency and consistent price tracking across all observers of the same listing.
|
||||||
@@ -1,63 +1,123 @@
|
|||||||
# willhaben-tracker
|
# willhaben-tracker
|
||||||
|
|
||||||
Telegram bot + scraper for willhaben.at classified ads. Self-hosted on Unraid via Docker Compose.
|
Telegram bot that monitors classified ads on [willhaben.at](https://www.willhaben.at) and notifies you about new listings matching your keywords, plus price drops on tracked ads. Notifications include photos when available.
|
||||||
|
|
||||||
## Stack
|
## Prerequisites
|
||||||
|
|
||||||
- **Postgres 15** with logical WAL, init scripts run alphabetically on first boot
|
- Docker
|
||||||
- **PostgREST** — auto-generated REST API over Postgres `public` schema
|
- Docker Compose (v2.x)
|
||||||
- **Kong** — reverse proxy routing `/rest/v1/` to PostgREST
|
|
||||||
- **Supabase Studio** — database browser and management UI
|
|
||||||
- **Python worker** — Telegram long polling + scrape scheduler
|
|
||||||
|
|
||||||
## Services
|
|
||||||
|
|
||||||
| Service | Image | Port | Description |
|
|
||||||
|---------|-------|------|-------------|
|
|
||||||
| db | postgres:15-alpine | `55632` | PostgreSQL with init migrations |
|
|
||||||
| rest | postgrest/postgrest:v12.2.0 | internal | REST API over Postgres |
|
|
||||||
| kong | kong:2.8.1 | `55621` | API gateway / reverse proxy |
|
|
||||||
| studio | supabase/studio | `55630` | Supabase dashboard UI |
|
|
||||||
| meta | supabase/postgres-meta:v0.84.2 | internal | Database introspection for Studio |
|
|
||||||
| worker | custom (./worker) | none | Bot + scraper process |
|
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cp .env.example .env
|
cp .env.example .env
|
||||||
# Edit TELEGRAM_BOT_TOKEN and POSTGRES_PASSWORD in .env
|
# Edit .env with your bot token and database credentials
|
||||||
docker compose up -d --build
|
docker compose up -d --build
|
||||||
```
|
```
|
||||||
|
|
||||||
On first boot, Postgres init scripts run automatically in order:
|
|
||||||
|
|
||||||
1. `00-run-init.sh` — creates roles (authenticator, dashboard_user)
|
|
||||||
2. `01-init.sql` — creates tables and indexes
|
|
||||||
3. `post-boot.sql` — applies grants on created tables
|
|
||||||
|
|
||||||
## Deployment
|
## Deployment
|
||||||
|
|
||||||
### Unraid + Portainer
|
### Prerequisites
|
||||||
|
|
||||||
1. Set the Docker Compose project path to `/mnt/user/appdata/willhaben-tracker`
|
- Docker 24+ (or any modern version with compose v2)
|
||||||
2. Ensure `.env` is present with valid credentials
|
- Docker Compose plugin installed (`docker compose` command works)
|
||||||
3. Deploy via Portainer: **Stacks → Add stack**, paste `docker-compose.yml` contents and attach `.env`
|
- ~500MB disk space for data volumes
|
||||||
4. Postgres data persists at `/mnt/user/appdata/willhaben-tracker/data/db`
|
- Outbound HTTPS access to `api.telegram.org` and `willhaben.at`
|
||||||
|
|
||||||
### Manual (Linux)
|
### Step-by-step Setup
|
||||||
|
|
||||||
|
1. Clone the repository and copy environment file:
|
||||||
```bash
|
```bash
|
||||||
cd /path/to/willhaben-tracker
|
git clone <repo-url> && cd willhaben-tracker
|
||||||
cp .env.example .env
|
cp .env.example .env
|
||||||
# Edit .env with your credentials
|
```
|
||||||
|
|
||||||
|
2. Edit `.env` — at minimum set these values:
|
||||||
|
- `TELEGRAM_BOT_TOKEN` — get from @BotFather on Telegram
|
||||||
|
- `POSTGRES_PASSWORD` — any secure password
|
||||||
|
|
||||||
|
3. Start all services:
|
||||||
|
```bash
|
||||||
docker compose up -d --build
|
docker compose up -d --build
|
||||||
```
|
```
|
||||||
|
|
||||||
|
4. Verify health:
|
||||||
|
```bash
|
||||||
|
# All containers should show "Up" and the db container "healthy"
|
||||||
|
docker compose ps
|
||||||
|
|
||||||
|
# Worker logs should show "Bot started with long polling"
|
||||||
|
docker compose logs worker | grep "started"
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Test the bot: send `/start` to your bot on Telegram, then `/add <keyword>` to create a search.
|
||||||
|
|
||||||
|
### Updating the Stack
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git pull && docker compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
New database migrations are applied automatically when the Postgres container restarts with new migration files in `supabase/migrations/`.
|
||||||
|
|
||||||
|
### Backing Up Data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Full database dump
|
||||||
|
docker compose exec db pg_dump -U postgres postgres > backup_$(date +%F).sql
|
||||||
|
|
||||||
|
# Restore from backup
|
||||||
|
psql -h localhost -p 55632 -U postgres -d postgres < backup_2024-01-01.sql
|
||||||
|
```
|
||||||
|
|
||||||
|
### Troubleshooting Common Issues
|
||||||
|
|
||||||
|
| Problem | Solution |
|
||||||
|
|---------|----------|
|
||||||
|
| Postgres auth error on first boot | Volume may contain stale `pg_hba.conf`. Delete `data/db` and restart. |
|
||||||
|
| Studio shows "unhealthy" but APIs work | Cosmetic health check issue — ignore unless you can't browse tables. |
|
||||||
|
| Bot doesn't respond | Check `docker compose logs worker` for errors. Verify token in `.env`. |
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit `.env` before first startup. All values are read by the worker and database services.
|
||||||
|
|
||||||
|
| Variable | Description | Default |
|
||||||
|
|---------------------------|----------------------------------------------------|--------------------------------|
|
||||||
|
| `TELEGRAM_BOT_TOKEN` | Token from @BotFather | (required) |
|
||||||
|
| `POSTGRES_USER` | PostgreSQL username | `postgres` |
|
||||||
|
| `POSTGRES_PASSWORD` | PostgreSQL password | (required) |
|
||||||
|
| `POSTGRES_DB` | Database name | `postgres` |
|
||||||
|
| `JWT_SECRET` | PostgREST JWT signing key | auto-generated default |
|
||||||
|
| `DEFAULT_INTERVAL_MINUTES`| Default scrape interval per search query | `60` |
|
||||||
|
| `ADMIN_TELEGRAM_IDS` | Comma-separated Telegram IDs with admin privileges | (none) |
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
| Service | Image | Port | Purpose |
|
||||||
|
|----------|------------------------------------|-------|----------------------------------|
|
||||||
|
| db | postgres:15-alpine | 55632 | PostgreSQL database |
|
||||||
|
| rest | postgrest/postgrest:v12.2.0 | — | REST API over the database |
|
||||||
|
| kong | kong:2.8.1 | 55621 | Reverse proxy / gateway |
|
||||||
|
| studio | supabase/studio | 55630 | Supabase Studio admin UI |
|
||||||
|
| meta | supabase/postgres-meta:v0.84.2 | — | Database metadata service |
|
||||||
|
| worker | (built from ./worker) | — | Scraper + Telegram bot |
|
||||||
|
|
||||||
## Telegram Commands
|
## Telegram Commands
|
||||||
|
|
||||||
- `/start` — Welcome + usage instructions (whitelisted only)
|
All users must be whitelisted before use. Run `/start` to activate your account.
|
||||||
- `/add "keyword"` — Create new search query
|
|
||||||
- `/list` — Show active queries
|
| Command | Access | Description |
|
||||||
- `/pause <id>` / `/resume <id>` — Toggle query
|
|--------------------|-------------|--------------------------------------------------|
|
||||||
- `/delete <id>` — Remove query
|
| `/start` | Anyone | Activate account and show help message |
|
||||||
- `/stats` — Tracking statistics
|
| `/add <keyword>` | Active user | Subscribe to keyword (shared across users) |
|
||||||
|
| `/list` | Active user | List your subscriptions with subscriber counts |
|
||||||
|
| `/delete <keyword_id>` | Active user | Unsubscribe from a keyword |
|
||||||
|
| `/stats` | Active user | Show queries count, ads tracked, notifications |
|
||||||
|
| `/adduser <id> [admin]` | Admin only | Add or promote a user by Telegram ID |
|
||||||
|
| `/removeuser <id>` | Admin only | Remove a user from the bot |
|
||||||
|
| `/users` | Admin only | List all registered users and their roles |
|
||||||
|
|
||||||
|
## Default Admin
|
||||||
|
|
||||||
|
On first boot, Telegram ID `298181113` is seeded as an admin user. Add additional admins via `/adduser <telegram_id> admin`.
|
||||||
|
|||||||
+71
-76
@@ -12,7 +12,7 @@ from telegram.ext import Application, ExtBot
|
|||||||
|
|
||||||
from db import close_pool, get_pool
|
from db import close_pool, get_pool
|
||||||
from scraper import extract_ad_fields, fetch_ads
|
from scraper import extract_ad_fields, fetch_ads
|
||||||
from notifier import log_notification, mark_notified, notify_new_ad
|
from notifier import log_notification, notify_new_ad, notify_price_drop
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -23,117 +23,112 @@ async def scheduler_task(pool: object, bot: ExtBot) -> None:
|
|||||||
while True:
|
while True:
|
||||||
try:
|
try:
|
||||||
rows = await pool.fetch(
|
rows = await pool.fetch(
|
||||||
"""
|
"SELECT id, keyword, interval_minutes, initial_loaded FROM keywords "
|
||||||
SELECT sq.id, sq.keyword, sq.interval_minutes, u.telegram_id
|
"WHERE is_active = true "
|
||||||
FROM search_queries sq
|
"AND (last_scraped_at IS NULL OR last_scraped_at < now() - (interval_minutes || ' minutes')::interval)"
|
||||||
JOIN users u ON sq.user_id = u.id
|
|
||||||
WHERE sq.is_active = true
|
|
||||||
AND (sq.last_scraped_at IS NULL OR
|
|
||||||
sq.last_scraped_at < now() - (sq.interval_minutes || ' minutes')::interval)
|
|
||||||
"""
|
|
||||||
)
|
)
|
||||||
|
|
||||||
for row in rows:
|
for row in rows:
|
||||||
query_id = str(row["id"])
|
kw_id = str(row["id"])
|
||||||
keyword = row["keyword"]
|
keyword = row["keyword"]
|
||||||
telegram_id = row["telegram_id"]
|
initial_loaded = row["initial_loaded"]
|
||||||
|
|
||||||
logger.info("Scraping keyword '%s' for query %s", keyword, query_id)
|
subs = await pool.fetch(
|
||||||
|
"SELECT telegram_id FROM users u JOIN keyword_subscriptions ks ON u.id = ks.user_id "
|
||||||
|
"WHERE ks.keyword_id = $1 AND u.is_active = true",
|
||||||
|
kw_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not subs:
|
||||||
|
await pool.execute("UPDATE keywords SET is_active = false WHERE id = $1", kw_id)
|
||||||
|
continue
|
||||||
|
|
||||||
|
telegram_ids = [sub["telegram_id"] for sub in subs]
|
||||||
|
logger.info("Scraping keyword '%s' (%d subscriber(s))", keyword, len(telegram_ids))
|
||||||
|
|
||||||
try:
|
try:
|
||||||
ads_raw, total_hits = await fetch_ads(keyword)
|
ads_raw, total_hits = await fetch_ads(keyword)
|
||||||
new_count = 0
|
new_count = 0
|
||||||
|
|
||||||
|
if not initial_loaded and len(ads_raw) > 0:
|
||||||
|
logger.info("Initial baseline load for '%s' — indexing %d ads, no notifications", keyword, len(ads_raw))
|
||||||
|
|
||||||
for ad_data in ads_raw:
|
for ad_data in ads_raw:
|
||||||
fields = extract_ad_fields(ad_data)
|
fields = extract_ad_fields(ad_data)
|
||||||
wh_ad_id = fields["wh_ad_id"]
|
wh_ad_id = fields["wh_ad_id"]
|
||||||
|
is_price_drop = False
|
||||||
|
old_price = None
|
||||||
|
new_price = None
|
||||||
|
|
||||||
existing = await pool.fetchrow(
|
existing = await pool.fetchrow(
|
||||||
"SELECT id FROM ads WHERE wh_ad_id = $1",
|
"SELECT id, price FROM ads WHERE wh_ad_id = $1",
|
||||||
wh_ad_id,
|
wh_ad_id,
|
||||||
)
|
)
|
||||||
|
|
||||||
if not existing:
|
if not existing:
|
||||||
ad_row = await pool.fetchrow(
|
ad_row = await pool.fetchrow(
|
||||||
"""
|
"INSERT INTO ads (wh_ad_id, raw_json, title, price, location, url, published_at, main_image_url, postcode, modified_at) "
|
||||||
INSERT INTO ads (wh_ad_id, raw_json, title, price, location, url, published_at)
|
"VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING id",
|
||||||
VALUES ($1, $2, $3, $4, $5, $6, $7)
|
wh_ad_id, json.dumps(ad_data), fields["title"], fields["price"],
|
||||||
RETURNING id
|
fields["location"], fields["url"], fields.get("published_at"),
|
||||||
""",
|
fields.get("main_image_url"), fields.get("postcode"), fields.get("modified_at"),
|
||||||
wh_ad_id,
|
|
||||||
json.dumps(ad_data),
|
|
||||||
fields["title"],
|
|
||||||
fields["price"],
|
|
||||||
fields["location"],
|
|
||||||
fields["url"],
|
|
||||||
fields.get("published_at"),
|
|
||||||
)
|
)
|
||||||
ad_uuid = str(ad_row["id"])
|
ad_uuid = str(ad_row["id"])
|
||||||
else:
|
else:
|
||||||
ad_uuid = str(existing["id"])
|
ad_uuid = str(existing["id"])
|
||||||
|
old_price = existing["price"]
|
||||||
|
new_price = fields["price"]
|
||||||
|
|
||||||
existing_qa = await pool.fetchrow(
|
if old_price is not None and new_price is not None and new_price < old_price:
|
||||||
"SELECT 1 FROM query_ads WHERE search_query_id = $1 AND ad_id = $2",
|
|
||||||
query_id,
|
|
||||||
ad_uuid,
|
|
||||||
)
|
|
||||||
|
|
||||||
if not existing_qa:
|
|
||||||
await pool.execute(
|
await pool.execute(
|
||||||
"INSERT INTO query_ads (search_query_id, ad_id) VALUES ($1, $2)",
|
"UPDATE ads SET price = $1, main_image_url = $2, postcode = $3, modified_at = $4 WHERE id = $5",
|
||||||
query_id,
|
new_price, fields.get("main_image_url"), fields.get("postcode"), fields.get("modified_at"), ad_uuid,
|
||||||
ad_uuid,
|
)
|
||||||
|
await pool.execute(
|
||||||
|
"INSERT INTO price_history (ad_id, old_price, new_price) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
|
||||||
|
ad_uuid, old_price, new_price,
|
||||||
|
)
|
||||||
|
is_price_drop = True
|
||||||
|
else:
|
||||||
|
if fields.get("main_image_url") or fields.get("postcode"):
|
||||||
|
await pool.execute(
|
||||||
|
"UPDATE ads SET main_image_url = COALESCE($1, main_image_url), postcode = COALESCE($2, postcode) WHERE id = $3 AND (main_image_url IS NULL OR postcode IS NULL)",
|
||||||
|
fields.get("main_image_url"), fields.get("postcode"), ad_uuid,
|
||||||
)
|
)
|
||||||
|
|
||||||
user_row = await pool.fetchrow(
|
if not initial_loaded:
|
||||||
"SELECT id FROM users WHERE telegram_id = $1",
|
|
||||||
telegram_id,
|
|
||||||
)
|
|
||||||
user_id = str(user_row["id"]) if user_row else None
|
|
||||||
|
|
||||||
notify_fields = {**fields, "keyword": keyword}
|
notify_fields = {**fields, "keyword": keyword}
|
||||||
await notify_new_ad(bot, telegram_id, notify_fields)
|
for tg_id in telegram_ids:
|
||||||
|
await notify_new_ad(bot, tg_id, notify_fields)
|
||||||
if user_id:
|
|
||||||
await mark_notified(pool, query_id, ad_uuid)
|
|
||||||
try:
|
|
||||||
msg_id = 0
|
|
||||||
await log_notification(pool, user_id, ad_uuid, msg_id)
|
|
||||||
except Exception:
|
|
||||||
logger.exception("Failed to log notification")
|
|
||||||
|
|
||||||
new_count += 1
|
new_count += 1
|
||||||
logger.info(
|
|
||||||
"New ad %s found for query %s (keyword=%s)",
|
if is_price_drop:
|
||||||
wh_ad_id,
|
notify_fields = {**fields, "keyword": keyword}
|
||||||
query_id,
|
for tg_id in telegram_ids:
|
||||||
keyword,
|
msg_id_val = await notify_price_drop(bot, tg_id, notify_fields)
|
||||||
)
|
if msg_id_val:
|
||||||
|
user_row = await pool.fetchrow("SELECT id FROM users WHERE telegram_id = $1", tg_id)
|
||||||
|
if user_row:
|
||||||
|
try:
|
||||||
|
await log_notification(pool, str(user_row["id"]), ad_uuid, msg_id_val)
|
||||||
|
except Exception:
|
||||||
|
logger.exception("Failed to log price drop notification")
|
||||||
|
|
||||||
|
if not initial_loaded:
|
||||||
|
await pool.execute("UPDATE keywords SET initial_loaded = true WHERE id = $1", kw_id)
|
||||||
|
|
||||||
|
await pool.execute("UPDATE keywords SET last_scraped_at = now() WHERE id = $1", kw_id)
|
||||||
|
|
||||||
await pool.execute(
|
await pool.execute(
|
||||||
"UPDATE search_queries SET last_scraped_at = now() WHERE id = $1",
|
"INSERT INTO scrape_logs (keyword_id, status, ads_found, new_ads) VALUES ($1, 'success', $2, $3)",
|
||||||
query_id,
|
kw_id, len(ads_raw), new_count,
|
||||||
)
|
|
||||||
|
|
||||||
await pool.execute(
|
|
||||||
"""
|
|
||||||
INSERT INTO scrape_logs (search_query_id, status, ads_found, new_ads)
|
|
||||||
VALUES ($1, 'success', $2, $3)
|
|
||||||
""",
|
|
||||||
query_id,
|
|
||||||
len(ads_raw),
|
|
||||||
new_count,
|
|
||||||
)
|
)
|
||||||
|
|
||||||
except Exception:
|
except Exception:
|
||||||
logger.exception("Error scraping keyword '%s' (query %s)", keyword, query_id)
|
logger.exception("Error scraping keyword '%s' (%s)", keyword, kw_id)
|
||||||
await pool.execute(
|
await pool.execute(
|
||||||
"""
|
"INSERT INTO scrape_logs (keyword_id, status, error_message) VALUES ($1, 'error', $2)",
|
||||||
INSERT INTO scrape_logs (search_query_id, status, error_message)
|
kw_id, str(sys.exc_info()[1]),
|
||||||
VALUES ($1, 'error', $2)
|
|
||||||
""",
|
|
||||||
query_id,
|
|
||||||
str(sys.exc_info()[1]),
|
|
||||||
)
|
)
|
||||||
|
|
||||||
await asyncio.sleep(5)
|
await asyncio.sleep(5)
|
||||||
|
|||||||
+21
-3
@@ -78,11 +78,26 @@ def extract_ad_fields(ad_dict: dict[str, Any]) -> dict[str, Any]:
|
|||||||
|
|
||||||
# Published time from CHANGED_String or PUBLISHED_String (ISO 8601)
|
# Published time from CHANGED_String or PUBLISHED_String (ISO 8601)
|
||||||
published_raw = attrs.get("PUBLISHED_String") or attrs.get("CHANGED_String")
|
published_raw = attrs.get("PUBLISHED_String") or attrs.get("CHANGED_String")
|
||||||
published_at: str | None = None
|
published_at: datetime | None = None
|
||||||
if published_raw:
|
if published_raw:
|
||||||
try:
|
try:
|
||||||
dt = datetime.fromisoformat(published_raw.replace("Z", "+00:00"))
|
published_at = datetime.fromisoformat(published_raw.replace("Z", "+00:00"))
|
||||||
published_at = dt.isoformat()
|
except (ValueError, TypeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Main image from the first advertImage entry
|
||||||
|
images = ad_dict.get("advertImageList", {}).get("advertImage", [])
|
||||||
|
main_image_url: str | None = None
|
||||||
|
if images and isinstance(images[0], dict):
|
||||||
|
main_image_url = images[0].get("referenceImageUrl")
|
||||||
|
|
||||||
|
postcode = attrs.get("POSTCODE")
|
||||||
|
|
||||||
|
modified_raw = attrs.get("CHANGED_String")
|
||||||
|
modified_at: datetime | None = None
|
||||||
|
if modified_raw:
|
||||||
|
try:
|
||||||
|
modified_at = datetime.fromisoformat(modified_raw.replace("Z", "+00:00"))
|
||||||
except (ValueError, TypeError):
|
except (ValueError, TypeError):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@@ -93,4 +108,7 @@ def extract_ad_fields(ad_dict: dict[str, Any]) -> dict[str, Any]:
|
|||||||
"location": location,
|
"location": location,
|
||||||
"url": url,
|
"url": url,
|
||||||
"published_at": published_at,
|
"published_at": published_at,
|
||||||
|
"main_image_url": main_image_url,
|
||||||
|
"postcode": postcode,
|
||||||
|
"modified_at": modified_at,
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user