← Back to Stash

Stash AI Chat is purpose-built for one thing: helping users explore and understand the content they've saved — their personal “knowledge collage.” I wanted every conversation to stay within that collage. If you saved 200 articles about AI, you should be able to ask “summarize what I saved about GPT this week” and get a great answer. But you shouldn't be able to ask “write me a Python script” or “explain quantum computing” — that's not what this tool is for.

The catch: Stash pays for every LLM call. Not the user. That's a deliberate product decision — consistent quality, no API key setup friction. But it means anyone who treats the chat as a free ChatGPT replacement is spending my money on something that generates zero product value.

The specific threats I needed to block:

  • Off-topic abuse — Using AI Chat for general questions unrelated to saved captures. Every off-topic response costs ~$0.013 and undermines the product's purpose.
  • Script attacks — A loop hitting the API 100 times per minute racks up $78/hour in LLM costs before anyone notices.
  • Context window bloat — A 200-turn conversation where history alone consumes 40K+ tokens, making each response increasingly expensive.
  • Quota burn — A Pro user burning all 200 monthly requests in a 10-minute burst, then complaining about being locked out.

The constraint: block all of this without making legitimate use feel restricted. A user asking 5 genuine questions about their captures should never feel friction.

Each layer runs sequentially in the Edge Function. If any layer rejects, the request never reaches the LLM — zero cost incurred. Layers are ordered by check cost (cheapest first):

Request arrives
    │
    ├─ Layer 1: Rate limit ──── reject 429
    │   (DB count query, ~5ms)
    │
    ├─ Layer 2: Quota ──── reject 403
    │   (read ai_chat_usage row, ~3ms)
    │
    ├─ Layer 3: Per-request ──── reject 400
    │   (parse body + token estimate, ~1ms)
    │
    ├─ Layer 4: Scope ──── LLM refuses off-topic
    │   (system prompt, $0 pre-check)
    │
    └─ ✓ Call LLM (~$0.007–0.013)

Three sliding windows checked against the ai_usage_log table:

WindowFreeBasicPro
Per minute51010
Per hour156060
Per day15100200

Each window runs a SELECT count(*) with a created_at ≥ now() - interval filter. Three queries per request, but they hit an indexed column and return in ~5ms total. The per-minute window catches script attacks; per-hour and per-day prevent sustained abuse.

Monthly or lifetime usage caps, enforced via a single counter row in ai_chat_usage:

  • Free3 lifetime (not monthly — no reset, ever)
  • Basic15 / month
  • Pro200 / month

The counter increments after a successful LLM response, not before. If the LLM call fails (timeout, API error), the user doesn't lose a credit. This matters for trust — nothing is more frustrating than being charged for a failed request.

Two checks that prevent any single conversation from consuming disproportionate resources:

  • Turn count — Max 5 / 10 / 50 turns per conversation (Free/Basic/Pro). After the limit, the user must start a new conversation. This caps context window growth — each turn adds ~200 tokens of history.
  • Input token budget30K / 30K / 150K tokens. Estimated before the call using a simple heuristic: text length ÷ 4 for English, ÷ 2 for Korean. If the total (system prompt + conversation history + capture context) exceeds the budget, the request is rejected with a “message too long” error.

The system prompt constrains the LLM to only answer questions about the user's captures. This is the cheapest layer ($0 pre-check) but the hardest to get right.

The first version was too aggressive — it refused legitimate questions like “What happened with OpenAI?” because it looked like a general knowledge question. But the user had OpenAI-related articles saved. The fix: the prompt now says “always assume the user is asking about their captures, even if the question sounds general” and only refuses clearly impossible requests (code writing, homework, creative writing).

This is inherently fuzzy — LLM-based scope enforcement will always have edge cases. A determined user can jailbreak it. But combined with Layers 1–3, the worst case is a few off-topic responses within a capped quota — annoying but not financially dangerous.

Worst-case monthly cost per user without any defense: unlimited requests × $0.013 = unbounded. With all 4 layers active:

PlanMax requestsMax cost/month
Free3 (lifetime)$0.04
Basic15$0.20
Pro200$1.40–$2.60

Every cost is bounded and predictable. No user can generate a surprise bill regardless of how they use the feature.