TL;DR — I get 60–120 emails a day across 3 accounts (work + 2 personal). Roughly 8% deserve a human reply within 24h, the rest is share-notifications, calendar invites, marketing, and bot pings. I built Mail-Assistant — a local AI triage layer that reads my inbox via IMAP, classifies each thread P0/P1/P2/NOISE with a cross-channel check (Jira + Sent + staleness), and renders a 3-section desktop UI. Triage time dropped from ~90 min/day to ~8 min. The PM learnings below are more useful than the code.
JTBD
When I open my laptop at 8am and see 60+ unread threads, I want to know which 3–7 I actually need to reply to today, so that I don’t burn 90 minutes triaging before any real work happens.
The pain isn’t “too many emails.” The pain is decision fatigue per row: every single message asks me to classify, prioritize, draft a reply, file it, snooze it. ADHD brain treats that as 60 simultaneous open loops.
Existing alternatives — and why they fail this JTBD
| Tool | Why it didn’t work for me |
|---|---|
| Gmail Priority Inbox | Optimizes for click-through, not “needs reply.” False-positive rate ~40%. |
| SaneBox / Superhuman | Subscription, vendor lock-in, no Jira/Slack cross-check. Same recall/precision tradeoff. |
| Manual labels | Maintenance > value. Stale within 2 weeks. |
| ”Just declare email bankruptcy” | Lost 2 client opportunities the first month. |
The gap: none of them know that the Jira ticket the email refers to is already Done, or that I already replied 6 minutes ago from my phone. Without cross-channel context, every classifier defaults to false-positive.
Five product decisions that killed 87% of the noise
1. NOISE is the default class, not P2
First version had 4 classes (P0/P1/P2/Archive). 70% landed in P2 (“maybe later”) — same problem as before.
Reframed: default = NOISE, classifier must justify promotion to P0/P1/P2. Same 568 messages → 7 P0/P1, the rest collapsed.
Lesson: when in doubt, the default should be the lowest-cost outcome. For triage, that’s “don’t surface.”
2. Cross-channel verify > prompt tuning
P0 candidate: “Server is down, please reply ASAP.”
Naive classifier: P0. Cross-channel check: the Jira ticket linked in the body is already Done, and I sent a reply 14 minutes ago from my phone. Real class: NOISE.
Adding 2 extra signals (Jira status + Sent timestamp) cut false-positive P0 from 7 → 2-3 per day. Same prompt, more context.
PM lesson: most “AI accuracy problems” are really input-context problems. Tune retrieval before tuning the model.
3. Thread-latest only
Email threads contain dozens of messages, but the only one that matters for classification is the latest reply. Older messages bias the classifier toward old states (“project on hold” from 3 weeks ago when it’s now active).
One row per thread, classifier sees only the most recent message. Same prompt, 12% accuracy lift.
4. Share-notifications are always NOISE
Emails matching (via Google Sheets|Notion|Dropbox|Figma|...) with body containing “shared with you” → hard-coded NOISE. No LLM call.
This isn’t a model improvement, it’s a product decision: those notifications never need a reply. Skipping them saves cost and removes 22% of daily volume from classification entirely.
5. One surface, not many
Early temptation: also ship a Slack digest, mobile push notifications, a daily email summary. Killed all of them.
The desktop app is the only surface. Every additional channel = more maintenance, more sync bugs, more attention surface. Less surface = less noise.
Architecture (short version)
IMAP pull (every 5 min)
↓
Classifier (Claude Haiku 4.5)
+ Jira status check (REST API)
+ Sent-folder staleness check
↓
Postgres (thread state + classification + action_summary)
↓
Desktop UI (3 collapsible sections: P0 / P1 / P2)
↓
Action buttons: Done / Archive → propagates to Gmail (archive INBOX + apply label)
No snooze button. No “maybe later” state. Done or Archive — that’s it. Less state = less decision fatigue.
Numbers from week 4
| Metric | Before | After |
|---|---|---|
| Daily triage time | ~90 min | ~8 min |
| False-positive P0/day | 7–9 | 2–3 |
| Threads requiring manual classification | 60–120 | 7 |
| Cost/day (Haiku tokens) | — | ~$0.04 |
What I’d tell a PM building something similar
- Default to “do not surface.” Every notification is an interruption tax.
- Cross-channel context beats prompt engineering. Add 1 extra signal before tuning the model.
- Drop “Maybe Later.” It’s a snooze button for your future self’s anxiety, not a product feature.
- Pick one surface, stay there. Multi-surface attention is the actual problem you’re solving.
- Action-first row design. The row title should be the action you’d take (“Reply to vendor re: SLA”), not the email subject.
Cost: ~$1.20/month. Replaces 80 minutes/day of unpaid attention labor. ROI is not in dollars.