← Back to project
● Shipped P0 Size M Vertical app

Mail-Assistant — Notes

Chronological decision log, gotchas, and the reclassify cost trap.

Notes & Decision Log

Format: YYYY-MM-DD — context — decision/finding.

A note on dates: the project page is dated 2025-09-15 to match the narrative arc of the public blog post that introduced the project. The deploy to the cron-host VM happened on 2026-05-18; the 2-day sprint is captured below under those dates.

Decisions

  • Day 1 AM — Started with the obvious 4-class classifier (P0/P1/P2/Archive) over the most recent 568 messages. Result: 70% landed in P2 — same triage problem as before. Reframed to NOISE-as-default; classifier must justify any promotion. Same prompt structure, dramatically different surface.
  • Day 1 AM — Picked Postgres 16 over SQLite. The split into 3 systemd timers + 1 API process = concurrent writers; pg handles that without LSAT-level WAL tuning.
  • Day 1 PM — Added cross-channel verify (Jira REST + Sent-folder staleness). P0 candidate “Server is down, please reply ASAP” — Jira ticket already Done, replied 14 min ago from phone. Real class: NOISE. Adding 2 extra signals cut false-positive P0 from 7–9/day → 2–3/day.
  • Day 1 PM — Decision: hard-rules before LLM, not as a post-filter. Share-noti regex skips the LLM entirely → 22% of daily volume is free.
  • Day 2 AM — Thread-latest-only classification. Old replies bias the model toward stale states (“project on hold” 3 weeks ago when it’s now active). One row per thread, classifier sees only the most recent message. +12% accuracy on a 200-row eval.
  • Day 2 AM — Action-verb row title. The row shows “Reply to vendor re: SLA breach”, not “Re: [URGENT][FW: FW:] SLA”. Constrained in the prompt as verb-first, max 60 chars.
  • Day 2 AM — Killed the Snooze button. “Maybe Later” is anxiety as a feature — surfaces the same row tomorrow with no new context. Only Done and Archive remain.
  • Day 2 midday — Native SwiftUI over Electron. This is a daily-driver tool; NSPanel + Liquid Glass aesthetic, instant launch, no Chromium overhead. No telemetry, no crash reporter, no analytics.
  • Day 2 midday — 3 collapsible sections (P0 / P1 / P2). NOISE never surfaces. Keyboard shortcuts: D (Done), A (Archive), J/K (next/prev).
  • Day 2 PM — Killed the planned Slack digest + iOS push. Multi-surface attention is the actual problem this product solves; adding more surfaces undoes the win.
  • Day 2 PM — Action sync propagates to Gmail (archive INBOX + apply Mail-Assistant/Done label) so the inbox stays clean if I ever check it elsewhere.
  • Day 2 PM — Deployed to the cron-host VM via systemd. Cloudflare named tunnel + bearer; optional CF Access if I want to lock it to a Google identity.

Gotchas

  • Day 1 — IMAP thread resolution: References: header chains break across providers (some strip them). Fell back to a normalized-subject + 24h-window heuristic for the gaps. ~3% of threads still mis-resolve; the impact is one extra row, not a missed P0.
  • Day 1imaplib default timeout = none. A hung connection blocked the timer indefinitely. Fix: explicit socket.setdefaulttimeout(30) + UID-resume on next fire.
  • Day 1 — Postgres UNIQUE (thread_id, classified_for_latest_uid) was the right idempotency key. Earlier draft keyed on thread_id alone — meant a new reply never re-classified. Caught by a test fixture where a new reply landed and the surface didn’t update.
  • Day 2 AM — Future-tense gate: “the meeting will happen Friday” was P0-classified because of “will”. Added a date-extraction rule capping at P2 when the target date is more than 7 days out.
  • Day 2 AM — Local model bake-off (Llama 3.2 3B via Ollama) — false-positive on share-notis even after prompt tuning. Haiku 4.5 got share-noti + ambiguity right out of the box at $0.04/day. Skipped local.
  • Day 2 midday — SwiftUI NSPanel doesn’t capture keyboard input by default. Needed becomeKey = true + canBecomeKeyWindow override.
  • Day 2 midday — Liquid Glass aesthetic requires NSVisualEffectView behind every List row, otherwise the blur shows the menu bar. Wrapped the whole surface in a single BackgroundBlurView.
  • Day 2 PM — Gmail OAuth: gmail.modify scope is enough for label + archive; gmail.send was explicitly NOT requested. Avoids any “Mail-Assistant could send mail as you” prompt.
  • Day 2 PM — Cloudflare bearer + tunnel: the tunnel terminates TLS at the CF edge, but the bearer header is forwarded intact. Verified end-to-end with curl -H "Authorization: Bearer ...".

The reclassify cost trap

The single most expensive operational lesson of the 2-day sprint:

Every prompt pivot tempted a full-corpus re-sweep on all 568 messages to validate. Two days of that burned ~$30–40 in tokens before the lesson stuck.

Fix: always test prompt changes on a 30-row stratified subset (proportional NOISE/P0/P1/P2 mix from prior labels) first. Only sweep the full corpus once the subset metrics are stable. This trims a typical experiment cycle from ~$2 to ~$0.10.

The deeper rule: classify cost in personal AI tooling is real and additive. A throwaway script run twice a day on 568 messages with a 1¢/run model is still $73/year. Subset-testing isn’t pedantic frugality, it’s the discipline that lets the project hit the $1.20/month target.

Working-session log

DateHoursWhatOutcome
Day 1 AM~3 hIMAP poller, Postgres schema, naive 4-class classifierFirst 568 messages classified, 70% landed P2
Day 1 PM~3 hNOISE-default reframe + cross-channel verify (Jira + Sent)False-positive P0 9 → 3
Day 2 AM~3 hThread-latest-only + share-noti hard-rule + action-verb summaryLLM volume drops 22%, accuracy +12%
Day 2 midday~3 hSwiftUI app — 3 sections, Done/Archive, shortcutsSurface complete
Day 2 PM~2 hGmail action sync + tunnel + bearerEnd-to-end works
Day 2 eve~1 hDeploy to cron-host VM, smoke on backlog7 P0/P1 surfaced from 568 messages
Total~15 hours2-day sprintShipped, hit all DoD metrics