Notes & Decision Log

Format: YYYY-MM-DD — context — decision/finding.

A note on dates: the project page is dated 2025-09-15 to match the narrative arc of the public blog post that introduced the project. The deploy to the cron-host VM happened on 2026-05-18; the 2-day sprint is captured below under those dates.

Decisions

Day 1 AM — Started with the obvious 4-class classifier (P0/P1/P2/Archive) over the most recent 568 messages. Result: 70% landed in P2 — same triage problem as before. Reframed to NOISE-as-default; classifier must justify any promotion. Same prompt structure, dramatically different surface.
Day 1 AM — Picked Postgres 16 over SQLite. The split into 3 systemd timers + 1 API process = concurrent writers; pg handles that without LSAT-level WAL tuning.
Day 1 PM — Added cross-channel verify (Jira REST + Sent-folder staleness). P0 candidate “Server is down, please reply ASAP” — Jira ticket already Done, replied 14 min ago from phone. Real class: NOISE. Adding 2 extra signals cut false-positive P0 from 7–9/day → 2–3/day.
Day 1 PM — Decision: hard-rules before LLM, not as a post-filter. Share-noti regex skips the LLM entirely → 22% of daily volume is free.
Day 2 AM — Thread-latest-only classification. Old replies bias the model toward stale states (“project on hold” 3 weeks ago when it’s now active). One row per thread, classifier sees only the most recent message. +12% accuracy on a 200-row eval.
Day 2 AM — Action-verb row title. The row shows “Reply to vendor re: SLA breach”, not “Re: [URGENT][FW: FW:] SLA”. Constrained in the prompt as verb-first, max 60 chars.
Day 2 AM — Killed the Snooze button. “Maybe Later” is anxiety as a feature — surfaces the same row tomorrow with no new context. Only Done and Archive remain.
Day 2 midday — Native SwiftUI over Electron. This is a daily-driver tool; NSPanel + Liquid Glass aesthetic, instant launch, no Chromium overhead. No telemetry, no crash reporter, no analytics.
Day 2 midday — 3 collapsible sections (P0 / P1 / P2). NOISE never surfaces. Keyboard shortcuts: D (Done), A (Archive), J/K (next/prev).
Day 2 PM — Killed the planned Slack digest + iOS push. Multi-surface attention is the actual problem this product solves; adding more surfaces undoes the win.
Day 2 PM — Action sync propagates to Gmail (archive INBOX + apply Mail-Assistant/Done label) so the inbox stays clean if I ever check it elsewhere.
Day 2 PM — Deployed to the cron-host VM via systemd. Cloudflare named tunnel + bearer; optional CF Access if I want to lock it to a Google identity.

Gotchas

Day 1 — IMAP thread resolution: References: header chains break across providers (some strip them). Fell back to a normalized-subject + 24h-window heuristic for the gaps. ~3% of threads still mis-resolve; the impact is one extra row, not a missed P0.
Day 1 — imaplib default timeout = none. A hung connection blocked the timer indefinitely. Fix: explicit socket.setdefaulttimeout(30) + UID-resume on next fire.
Day 1 — Postgres UNIQUE (thread_id, classified_for_latest_uid) was the right idempotency key. Earlier draft keyed on thread_id alone — meant a new reply never re-classified. Caught by a test fixture where a new reply landed and the surface didn’t update.
Day 2 AM — Future-tense gate: “the meeting will happen Friday” was P0-classified because of “will”. Added a date-extraction rule capping at P2 when the target date is more than 7 days out.
Day 2 AM — Local model bake-off (Llama 3.2 3B via Ollama) — false-positive on share-notis even after prompt tuning. Haiku 4.5 got share-noti + ambiguity right out of the box at $0.04/day. Skipped local.
Day 2 midday — SwiftUI NSPanel doesn’t capture keyboard input by default. Needed becomeKey = true + canBecomeKeyWindow override.
Day 2 midday — Liquid Glass aesthetic requires NSVisualEffectView behind every List row, otherwise the blur shows the menu bar. Wrapped the whole surface in a single BackgroundBlurView.
Day 2 PM — Gmail OAuth: gmail.modify scope is enough for label + archive; gmail.send was explicitly NOT requested. Avoids any “Mail-Assistant could send mail as you” prompt.
Day 2 PM — Cloudflare bearer + tunnel: the tunnel terminates TLS at the CF edge, but the bearer header is forwarded intact. Verified end-to-end with curl -H "Authorization: Bearer ...".

The reclassify cost trap

The single most expensive operational lesson of the 2-day sprint:

Every prompt pivot tempted a full-corpus re-sweep on all 568 messages to validate. Two days of that burned ~$30–40 in tokens before the lesson stuck.

Fix: always test prompt changes on a 30-row stratified subset (proportional NOISE/P0/P1/P2 mix from prior labels) first. Only sweep the full corpus once the subset metrics are stable. This trims a typical experiment cycle from ~$2 to ~$0.10.

The deeper rule: classify cost in personal AI tooling is real and additive. A throwaway script run twice a day on 568 messages with a 1¢/run model is still $73/year. Subset-testing isn’t pedantic frugality, it’s the discipline that lets the project hit the $1.20/month target.

Reference links

Blog post (public PM lens): Mail-Assistant: how I cut inbox triage from 90 min to 8 min
Anthropic Python SDK: https://github.com/anthropics/anthropic-sdk-python
IMAP RFC 3501: https://datatracker.ietf.org/doc/html/rfc3501
RFC 5322 (Internet Message Format): https://datatracker.ietf.org/doc/html/rfc5322
Gmail API users.messages.modify: https://developers.google.com/gmail/api/reference/rest/v1/users.messages/modify
Jira REST v3 issue: https://developer.atlassian.com/cloud/jira/platform/rest/v3/api-group-issues/
Cloudflare Tunnel docs: https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/

Working-session log