CASE STUDY

Case Study: Building an AI Interview Prep Tool (RAG in Production)

Agensphere TeamJanuary 11, 20258 min read

Last quarter, we partnered with a Y Combinator company building an AI-powered interview prep platform. Their pitch: "LeetCode, but with an AI tutor that gives you personalized hints."

The challenge wasn't building a chatbot—it was building a system that actually helps users learn.

Here's what we built, what we learned, and the metrics that mattered.

The Problem: Generic Hints Don't Help

Initial user feedback (before AI):

  • "The hints are either too obvious or too cryptic."
  • "I wish the platform knew which topics I struggle with."
  • "I'd pay more if hints referenced my past mistakes."

The opportunity: Most interview prep platforms treat every user the same. But learning is personal. A hint that helps a beginner might insult an expert, and vice versa.

The vision: Build an AI that understands:

  • What the user already knows (based on past problems solved)
  • Where the user struggles (topics with low success rates)
  • How to hint without spoiling the solution

The Architecture: RAG + Personalization

We designed a Retrieval-Augmented Generation (RAG) system with three layers:

Layer One: Problem Knowledge Base

  • Content: 1,500 coding problems with solutions, hints, and difficulty ratings
  • Structure: Each problem chunked into:
    • Problem description (with examples)
    • Conceptual hints (no code)
    • Approach hints (algorithm strategy)
    • Code-level hints (specific implementation details)
    • Full solution with explanation
  • Embedding model: text-embedding-3-small (low cost, high quality)
  • Vector database: Pinecone (easy scaling)

Layer Two: User Context Database

  • Content: User-specific data
    • Past problems solved (with timestamps and time taken)
    • Topics mastered vs. struggling
    • Hint request patterns (do they ask early or late?)
  • Storage: Postgres with pg_vector extension (structured + semantic search)

Layer Three: Intelligent Hint Generation

When a user requests a hint:

Step One: Retrieve relevant context

  • Semantic search: Find similar problems the user has solved
  • Filter by topic: Pull problems in the same category (e.g., "Dynamic Programming")
  • Rank by recency: Weight recent problems higher

Step Two: Analyze user proficiency

  • If user solved 8 out of 10 DP problems → Advanced hints
  • If user solved 2 out of 10 DP problems → Beginner-friendly hints
  • If user repeatedly asks for hints on similar problems → Flag concept gap

Step Three: Generate personalized hint

  • Prompt GPT-4 with:
    • Current problem
    • Similar problems user solved (with success metrics)
    • User proficiency level
    • Instruction: "Give a hint that bridges their existing knowledge to this problem"

Step Four: Citation and transparency

  • Every hint includes: "This is similar to Problem 47 (which you solved in 12 minutes)..."
  • Users can click to review past solutions

The Tech Stack

Backend:

  • Next.js 14 App Router (API routes for hint generation)
  • TypeScript for type safety
  • Postgres with pg_vector (user data + vector search)
  • Pinecone (problem knowledge base)

AI Layer:

  • OpenAI text-embedding-3-small (embeddings)
  • GPT-4 (hint generation)
  • LangChain for orchestration
  • LangSmith for monitoring (essential for debugging RAG)

Frontend:

  • React with Monaco Editor (code editor)
  • Real-time hint rendering (streamed responses)
  • Progress dashboards (topics mastered, success rates)

Hosting:

  • Vercel (frontend and API)
  • Supabase (Postgres hosting)
  • Pinecone (managed vector DB)

Key Metrics: What Actually Worked

Before AI (Static Hints)

  • Hint usage: 35% of users requested hints
  • Satisfaction: 2.8 out of 5 stars for hint quality
  • Problem completion rate: 58%
  • Churn rate: 42% after first month

After AI (Personalized RAG Hints)

  • Hint usage: 67% of users requested hints (engagement up)
  • Satisfaction: 4.3 out of 5 stars
  • Problem completion rate: 74% (16-point improvement)
  • Churn rate: 28% (14-point improvement)

ROI:

  • Development cost: $12K (4 weeks of engineering)
  • Monthly operating cost: $380 (LLM API calls + Pinecone)
  • Retained users: 140 additional users per month at $30 per month subscription
  • Additional revenue: $4,200 per month
  • Payback period: 3 months

Cost Breakdown (Per 1,000 Hint Requests)

  • Embedding queries (user query + context retrieval): $0.04
  • GPT-4 API calls (hint generation): $8.50
  • Vector database queries: $0.30
  • Total: $8.84 per 1,000 hints

At average usage of 4.2 hints per user per month:

  • 1,000 active users = 4,200 hints per month
  • Cost: $37 per month for 1,000 users
  • Revenue: $30,000 per month (1,000 users at $30 per month)
  • AI cost as percentage of revenue: 0.12%

Compare to alternatives:

  • Human tutors: $15,000+ per month for same volume

What We Learned: Five Key Takeaways

First: Retrieval Quality Over Model Size

We tested GPT-4 vs. GPT-3.5 with identical retrieval logic.

Result: GPT-3.5 with great retrieval outperformed GPT-4 with poor retrieval.

Lesson: Spend 80% of effort on retrieval logic, 20% on prompt engineering.

Second: User Context is Everything

Early version: Generic hints based only on current problem.

User feedback: "This feels like reading a textbook."

Fix: Personalized hints referencing past problems solved.

Result: Engagement jumped 40%, hints rated 2× more helpful.

Third: Chunk Overlap Matters

First attempt: Clean chunks with no overlap (problem description, hints, solution).

Problem: Retrieval missed context at chunk boundaries.

Fix: 20% overlap between chunks (last 50 tokens of chunk one equals first 50 of chunk two).

Result: Retrieval accuracy improved 25%.

Fourth: Citations Build Trust

Users didn't trust generic hints. But when we added citations:

"This is similar to problem 23 (which you solved in 8 minutes)..."

Engagement increased. Why? Transparency. Users could verify the hint made sense.

Fifth: Monitor Retrieval, Not Just Generation

We built dashboards tracking:

  • Which problems had low retrieval scores (fix chunking)
  • Which concepts were never retrieved (improve vectorization)
  • Which users got stuck despite hints (flag for human review)

Surprised finding: 15% of "bad hints" were actually bad retrievals, not bad prompts.

What's Next: From MVP to Platform

The system is live and processing 12K+ hint requests per month. Here's the roadmap:

Phase Two: Multi-Modal Hints (Q2 2025)

  • Generate diagrams (e.g., tree structures for recursion problems)
  • Visualize algorithm steps
  • Support whiteboard-style explanations

Phase Three: Adaptive Difficulty (Q3 2025)

  • Suggest next problem based on mastery level
  • Auto-adjust problem difficulty if user is stuck
  • Create personalized learning paths

Phase Four: Peer Comparison (Q4 2025)

  • "Users who solved this problem also struggled with..."
  • Community-driven hint improvements
  • Collaborative learning features

The Hard Parts (What We'd Do Differently)

Challenge One: Cold Start Problem

New users have no history, so hints were generic at first.

Fix: Use onboarding quiz to gauge proficiency level, seed initial context.

Challenge Two: Overfitting to User History

Some users got stuck in a "comfort zone" (only saw hints related to easy problems they'd solved).

Fix: Inject diverse retrieval results (10% from unfamiliar topics).

Challenge Three: Latency

First version took 4-6 seconds to generate hints (poor UX).

Fix:

  • Pre-compute user embeddings (updated after each problem)
  • Cache common hint patterns
  • Stream responses (show partial hints while generating)
  • Reduced latency to 1.5-2 seconds

Challenge Four: Cost Spikes

Early testing: costs hit $400 for first 500 users (unsustainable).

Fix:

  • Switched from GPT-4 to GPT-3.5 Turbo for most hints (quality difference minimal)
  • Added caching for repeated queries
  • Rate-limited hint requests (max 10 per problem)

Should You Build Something Similar?

Yes, if:

  • You have structured knowledge (problems, lessons, documentation)
  • Personalization drives clear value (education, recommendations, support)
  • You can measure success (completion rates, satisfaction, retention)

No, if:

  • Your knowledge base is constantly changing (RAG requires stable chunking)
  • You can't tolerate 1-2 second latency
  • You don't have usage data to personalize on

Open Questions We're Still Exploring

How much context is too much?

  • We pass top 5 similar problems to GPT-4
  • Tried 10, saw no quality improvement but 2× cost
  • Sweet spot seems to be 3-5 chunks

When should AI refuse to hint?

  • If user asks for hints immediately (less than 30 seconds), should we encourage independent thinking first?
  • If user requests 5+ hints on same problem, should we suggest an easier problem?

How do we prevent hint gaming?

  • Some users request all hints immediately to "learn faster"
  • Considering: unlock hints progressively based on time spent

The Bottom Line

Building production RAG systems is less about AI magic and more about information architecture.

The LLM is the easy part. The hard part is:

  • Chunking your knowledge base correctly
  • Retrieving the right context
  • Personalizing based on user behavior
  • Monitoring what's working (and what's not)

If you're building something similar, spend 80% of your time on retrieval logic. The LLM will handle the rest.


Want to see the system in action? The platform is live at [withheld]. We're also open-sourcing our RAG evaluation framework in Q2 2025.

Building a RAG system for your product? We've learned a lot. Let's chat.

Need help building production-ready AI systems?

From architecture design to production deployment, we build intelligent systems that scale.