← Back to Stash

Users ask natural language questions about their saved captures — “What restaurants did I save?”, “Summarize the AI articles from this week” — and get contextual answers with inline citations linking back to the originals. The challenge: give the LLM enough context to answer accurately without blowing token budgets or drowning it in irrelevant noise.

Each capture summary averages 200–400 tokens depending on language and enrichment level. Free and Basic plans have a 30K input token budget:

30,000 tok   Free/Basic input budget
 -1,500 tok   system prompt + history
────────
28,500 tok   available for captures

28,500 ÷ ~150 tok  ≈ 190 captures  (optimistic)
28,500 ÷ ~250 tok  ≈ 114 captures  (enriched)

200 is where Long Context hits its ceiling. Below it, the full library fits without trimming. Above it, you either trim (losing captures the user might be asking about) or switch to selective retrieval. Pro gets 150K tokens but even that runs out at 500+, and irrelevant context actively hurts answer quality regardless of budget.

Three retrieval paths, decided server-side per request:

User question
     │
     ├─ attachedIds? ─ yes ─▶  FOCUSED
     │                         fetch selected only
     │
     ├─ ≤ 200 captures? ─ yes ─▶  LONG CONTEXT
     │                             fetch all, inject as text
     │
     ├─ Pro plan? ─ no ─▶  LONG CONTEXT + TRIM
     │                     fetch all, trim to budget
     │
     ├─ > 20% without embeddings?
     │   ├─ yes ─▶  LONG CONTEXT  (fallback)
     │   │
     │   └─ no ─▶  SEMANTIC SEARCH
     │              embed query → pgvector top-40
     │              + 10 most recent → dedupe → LLM

The embedding coverage check catches legacy accounts where old captures were saved before the embedding feature existed. If more than 20% lack embeddings, semantic search would silently miss relevant content, so it falls back to Long Context and logs the reason as a visible step.

Top-40 semantic results — At ~250 tok/capture, 40 captures consume ~10K tokens, well within Pro's 150K budget. Higher risks diluting relevance; lower risks missing broad queries.

10 most recent captures — Vector search has a temporal blind spot. “What did I save this week?” needs recency, not similarity. The trade-off: specific queries get a few irrelevant items mixed in. Acceptable because temporal queries are common in a personal library context.

Long ContextSemantic Search
Input tokens~30K~15K
Cost / request~$0.013~$0.007
Pro 200 req/mo~$2.60~$1.40

Based on GPT-4.1-mini at $0.40 / 1M input, $1.60 / 1M output. Semantic Search input = ~10K (40 captures) + ~3K (10 recent) + ~2K (system prompt + history).

  • No hybrid search — Captures have structured fields (entity_type, category, tags). Pre-filtering by metadata before vector search would outperform pure similarity for categorical queries.
  • No reranking — Top-40 goes straight to the LLM. A cross-encoder cutting 40 to 10 would improve both relevance and cost at scale.
  • No HyDE — Short queries produce noisy embeddings. Generating a hypothetical answer first would improve recall for vague questions.
  • Threshold 0.15 is undertested — Set permissively without analyzing actual similarity distributions. Nearly everything passes.
  • No embedding backfill — Pre-existing captures have NULL embeddings. The 20% fallback handles this, but a background job would eliminate the need.