🚀 Tech Stack Advisor

An AI-powered multi-agent system that provides intelligent, personalized technology stack recommendations for software projects

🤖 5 Specialized AI Agents ⚡ LangGraph Orchestration 🔒 JWT Authentication ☁️ Production Ready

📋 Project Overview

Tech Stack Advisor is a production-ready web application that leverages cutting-edge AI technology to help developers and architects make informed decisions about their technology stack. The system uses five specialized AI agents working in concert to analyze project requirements and deliver comprehensive recommendations covering conversations, databases, infrastructure, cost optimization, and security. With intelligent multi-turn conversations, long-term memory powered by Qdrant vector search, and semantic query history, the system provides personalized, context-aware recommendations.

~3,400
Lines of Code
5
AI Agents
2-4s
Response Time
$0.0015
Cost Per Query

🎬 Try It Live

🌐 Live Application

Try the production deployment

Launch App →

📖 API Documentation

Interactive Swagger UI

View Docs →

💻 Source Code

Private repository (ranjanarajendran)

View Repository → Browse Source Code →

No access? Request here

📊 Technical Documentation

Comprehensive docs in repository

View Docs → Browse Documentation →

No access? Request here

🎯 What Problem Does It Solve?

The Challenge

Choosing the right technology stack for a project is one of the most critical decisions in software development. It requires:

The Solution

Tech Stack Advisor automates this complex decision-making process by:

💬 Intelligent Conversations

Engages in multi-turn dialogues to gather project requirements intelligently, asking targeted follow-up questions with structured choices.

🗄️ Database Recommendations

Analyzes data type, scale, consistency requirements, and recommends optimal database solutions with scaling strategies.

☁️ Infrastructure Planning

Suggests cloud providers, architecture patterns, and deployment strategies based on workload characteristics.

💰 Cost Optimization

Provides multi-provider cost comparisons and optimization recommendations to maximize budget efficiency.

🔒 Security Analysis

Performs threat modeling, checks compliance requirements, and recommends security measures.

🏗️ System Architecture

🎨 Modern Web UI

  • HTML/CSS/JavaScript (Vanilla)
  • User Authentication (Local + Google OAuth)
  • Responsive Design
  • Real-time API Integration with JWT
  • Admin Dashboard
  • Download JSON Results
HTTP REST + JWT Auth

⚡ FastAPI Backend (Port 8000)

  • Serves static files (HTML/CSS/JS)
  • POST /recommend - Main recommendation endpoint
  • Authentication endpoints (register/login/OAuth)
  • GET /health - Health monitoring
  • GET /metrics - Usage & cost tracking
  • Rate limiting & JWT authentication
  • Auto-generated Swagger docs

🔄 LangGraph Orchestrator

  • Query Parser (NLP-based context extraction)
  • Sequential agent coordination
  • State management with TypedDict
  • Correlation IDs for tracing
💬
Conversation
Manager
🗄️
Database
Agent
☁️
Infrastructure
Agent
💰
Cost
Agent
🔒
Security
Agent
📚 Qdrant
Vector Store
(34 docs)
🤖 Claude AI
Haiku Model
(LLM)
💵 Pricing
Data
(Real-time)

Architecture Highlights

📐 Detailed Architecture Diagram

flowchart LR subgraph Frontend["🎨 Frontend Layer"] direction TB UI[Web UI
HTML/CSS/JS] Auth[Auth Pages] Admin[Admin Dashboard] end subgraph API["⚡ API Gateway"] direction TB FastAPI[FastAPI :8000
Rate Limit: 50-100/hr
JWT Auth] Middleware[CORS + Rate Limiter] end subgraph AuthSys["🔐 Authentication"] direction TB JWT[JWT Tokens
HS256, 24h] OAuth[Google OAuth 2.0] Sessions[Session Store
30min timeout] end subgraph Orchestration["🔄 LangGraph Orchestrator"] direction TB Parser[Query Parser
NLP Extraction] subgraph Agents["5 Specialized Agents"] direction LR DB[🗄️ Database] --> Infra[☁️ Infrastructure] Infra --> Cost[💰 Cost] Cost --> Sec[🔒 Security] end Conv[💬 Conversation
Manager] Synth[Result Synthesizer] Parser --> Agents Agents --> Synth end subgraph RAG["📚 RAG System"] direction TB Embed[Embedding Model
all-MiniLM-L6-v2
384-dim] VectorDB[(Qdrant Vector DB
34 docs)] Embed --> VectorDB end subgraph LLM["🤖 AI Engine"] direction TB Claude[Claude 3 Haiku
$0.0015/query] TokenTrack[Token Tracking] Claude -.-> TokenTrack end subgraph Memory["💾 Memory Systems"] direction TB LongMem[(Long-Term
Qdrant)] ShortMem[(Short-Term
In-Memory)] end subgraph Monitor["📊 Monitoring"] direction TB Logger[Structured Logs
Correlation IDs] Prom[Prometheus
Metrics] end %% Main Flow Frontend --> API API --> Middleware Middleware --> AuthSys AuthSys --> Orchestration Orchestration --> Conv Parser -.-> RAG Agents -.-> RAG Agents -.-> LLM Synth --> Memory API -.-> Monitor %% Styling classDef frontend fill:#667eea,stroke:#333,stroke-width:2px,color:#fff classDef api fill:#48bb78,stroke:#333,stroke-width:2px,color:#fff classDef auth fill:#f59e0b,stroke:#333,stroke-width:2px,color:#fff classDef agent fill:#ed8936,stroke:#333,stroke-width:2px,color:#fff classDef storage fill:#4299e1,stroke:#333,stroke-width:2px,color:#fff classDef llm fill:#9f7aea,stroke:#333,stroke-width:2px,color:#fff classDef monitor fill:#38b2ac,stroke:#333,stroke-width:2px,color:#fff class Frontend,UI,Auth,Admin frontend class API,FastAPI,Middleware api class AuthSys,JWT,OAuth,Sessions auth class Orchestration,Parser,Agents,DB,Infra,Cost,Sec,Conv,Synth agent class RAG,Embed,VectorDB,Memory,LongMem,ShortMem storage class LLM,Claude,TokenTrack llm class Monitor,Logger,Prom monitor

🧮 Core Algorithms & Techniques

1. Natural Language Processing (NLP) - Context Extraction

Location: backend/src/orchestration/workflow.py

Techniques Used:

  • Regex Pattern Matching: Extracts DAU (Daily Active Users) and budget from natural language queries
  • Keyword-Based Classification: Detects scale ("small", "medium", "large", "enterprise"), workload type ("real-time", "api", "batch"), and data sensitivity
  • Heuristic-Based Estimation: QPS = DAU × 0.015 / 60, Data Volume = DAU / 100 GB

Performance: 1-5ms per query

2. Retrieval-Augmented Generation (RAG) - Semantic Search

Location: backend/src/rag/vectorstore.py

Algorithm: Vector Similarity Search with Cosine Distance

User Query → Text Embedding (384-dim) → Qdrant Vector Search → Cosine Similarity → Metadata Filtering → Top-K Retrieval

Model: sentence-transformers/all-MiniLM-L6-v2

Performance: ~30ms per RAG retrieval (embedding + search)

3. Database Scale Estimation - Tier-Based Heuristics

Location: backend/src/agents/database.py

Algorithm: Multi-factor tier classification

if DAU ≥ 1M or QPS ≥ 10K → Enterprise elif DAU ≥ 100K or QPS ≥ 1K → Large elif DAU ≥ 10K or QPS ≥ 100 → Medium else → Small

Outputs: Tier, estimated connections, cache recommendation, sharding strategy, replication approach

Complexity: O(1) - constant time decision tree

4. Multi-Cloud Cost Estimation - Component-Based Pricing

Location: backend/src/agents/cost.py

Algorithm: Linear cost model with provider-specific multipliers

Total Monthly Cost = Compute + Storage + Database + Bandwidth Provider Multipliers: AWS (1.0x), GCP (1.09x), Azure (1.05x)

Complexity: O(1) - lookup table + simple arithmetic

5. Security Risk Assessment - Multi-Factor Scoring

Location: backend/src/agents/security.py

Algorithm: Weighted risk scoring with priority matrix

Risk Score = Data Sensitivity (40%) + Public Exposure (25%) + Compliance (20%) + Architecture Complexity (15%) Risk Level: ≥70 = CRITICAL, ≥50 = HIGH, ≥30 = MEDIUM, else LOW

Threat Prioritization: CRITICAL (SQL injection, data exposure), HIGH (CSRF, XSS), MEDIUM (DDoS), LOW (info disclosure)

Complexity: O(n) where n = number of compliance requirements (typically < 5)

6. Conversation Management - Iterative Context Accumulation

Location: backend/src/agents/conversation.py

Algorithm: Contextual completion tracking with structured questioning

Completion = app_type (40%) + dau/scale (40%) + key_features (20%) Ready when completion ≥ 80%

Strategy: Prioritize missing critical fields → Provide structured options → Extract entities → Avoid redundancy

Complexity: O(k) where k = number of conversation turns (typically 2-5)

7. LangGraph Sequential Workflow - State Machine

Location: backend/src/orchestration/workflow.py

Pattern: Directed Acyclic Graph (DAG) with shared state

parse_query → database_agent → infrastructure_agent → cost_agent → security_agent → synthesize

State Management: Shared TypedDict state accumulates results from each agent

Complexity: O(n) where n = number of agents (currently 5, sequential execution)

8. Token & Cost Tracking - Real-Time Metrics

Location: backend/src/core/logging.py

Algorithm: Cumulative metrics with daily aggregation

Cost = (input_tokens × $0.00025 + output_tokens × $0.00125) / 1000 Average: ~$0.0015 per query (~6,250 tokens)

Metrics: Prometheus-format counters and gauges for tokens, cost, queries (per agent and daily totals)

💡 Key Technical Decisions

Decision 1: Vanilla JavaScript vs. Streamlit/React

Chosen: Vanilla HTML/CSS/JavaScript

Primary Reason: Streamlit has known deployment issues on Railway, requiring complex WebSocket configuration and separate service management.

✅ Why Vanilla JS Won:
  • Single service deployment (no WebSocket complexity)
  • Served directly from FastAPI (port 8000 only)
  • No Streamlit-specific Railway configuration
  • Zero build step, instant deployment
  • Smaller bundle size (~15KB vs ~2MB)
❌ Rejected Alternatives:
  • Streamlit: Deployment complexity on Railway, WebSocket issues, required separate service
  • React: Overkill for UI complexity, build step overhead
  • Vue: Added dependency overhead

Key Learning: We initially built with Streamlit but encountered deployment issues on Railway. Rewriting in vanilla JavaScript reduced architecture complexity from 2 services to 1, eliminated WebSocket configuration headaches, and made deployment trivial.

Decision 2: LLM Provider Selection

Chosen: Anthropic Claude (Haiku model)

Why:

  • Best cost/performance ratio: $0.25 per 1M input tokens (vs GPT-4: $30)
  • Long context windows (200K tokens)
  • Strong instruction following
  • Built-in safety features
  • Lower latency than GPT-4

Cost Comparison (per 1,000 queries):

Model Cost Decision
Claude Haiku $1.50 ✅ Selected
Claude Sonnet $15.00 ❌ 10x more expensive, exceeds 1GB Railway limit
GPT-3.5-Turbo $2.00 ❌ More expensive, lower quality
GPT-4-Turbo $30.00 ❌ 20x more expensive
Gemini Pro $0.50 ❌ Inconsistent API, less mature

Initial Consideration: Claude Sonnet

We initially considered upgrading to Claude Sonnet for its ability to generate larger responses (8,192 output tokens vs Haiku's 4,096), which would be useful for comprehensive infrastructure diagrams and detailed recommendations.

Why We Stayed with Haiku:

  • Cost: Sonnet is 10x more expensive ($3 vs $0.30 per 1M tokens)
  • Memory Footprint: Sonnet model exceeded Railway's 1GB free tier memory limit, requiring paid plan upgrade
  • Architectural Workaround: Instead of upgrading, we split the infrastructure diagram generation into a separate, smaller task outside the main Infrastructure Agent, keeping responses within Haiku's token limits
  • Performance: Haiku's faster response times (2-4s) better suited our real-time recommendation use case

Key Learning: Architectural refactoring (task decomposition) can be more cost-effective than upgrading to larger models. By splitting complex outputs into focused sub-tasks, we maintained quality while achieving 10x cost savings and staying within infrastructure constraints.

Decision 3: Multi-Agent Architecture

Chosen: 5 specialized agents with LangGraph orchestration

Agents:

  1. Conversation Manager: Intelligent multi-turn dialogues to gather requirements
  2. Database Agent: Database technology recommendations
  3. Infrastructure Agent: Cloud architecture and deployment strategies
  4. Cost Agent: Multi-provider cost comparisons
  5. Security Agent: Threat modeling and compliance checks

Why:

  • Separation of Concerns: Each agent has focused expertise
  • Better Prompt Engineering: Smaller, targeted prompts vs one giant prompt
  • Conversational UX: Conversation Manager guides users through complex requirements
  • Parallel Future Optimization: Can parallelize agents for 3.7× speedup
  • Maintainability: Easy to update individual agents
  • Testability: Each agent can be tested in isolation

Decision 4: Deployment Platform

Chosen: Railway (Hobby plan $5/month)

✅ Pros:
  • GitHub auto-deploy
  • Predictable pricing
  • Zero-downtime deploys
  • Built-in SSL
  • Simple environment management
❌ Rejected Alternatives:
  • AWS: Complex, time-consuming setup
  • Heroku: More expensive, being sunset
  • Vercel: Serverless cold starts
  • Railway Free: 500 hours/month limit

🚀 Implementation Journey

Week 1: Agent Development

Built 4 specialized agents (Database, Infrastructure, Cost, Security) with base class architecture and LLM integration. Implemented 8 tools for knowledge retrieval and computation.

LOC: ~1,000 | Key Tech: Python, Anthropic Claude, Protocol-based tools

Week 2: LangGraph Orchestration

Designed sequential workflow pipeline with state management. Implemented query parser for extracting DAU, compliance, and budget from natural language.

LOC: ~500 | Key Tech: LangGraph, TypedDict state, Correlation IDs

Week 3: REST API Development

Built production FastAPI with rate limiting, cost controls, and comprehensive error handling. Added Swagger/ReDoc documentation.

LOC: ~400 | Key Tech: FastAPI, slowapi, Pydantic, CORS

Week 4: RAG System

Implemented vector search with Qdrant. Curated 34 technical documents covering databases, infrastructure, and security. Used sentence-transformers for embeddings.

LOC: ~500 | Key Tech: Qdrant, sentence-transformers, 384-d vectors

Week 5: Authentication & Frontend

Built Modern Web UI with vanilla JavaScript. Implemented JWT authentication, Google OAuth 2.0, and admin dashboard. Replaced Streamlit for simpler deployment.

LOC: ~400 | Key Tech: HTML/CSS/JS, JWT, bcrypt, Google OAuth

Week 6: Deployment & Polish

Deployed to Railway. Fixed NumPy compatibility issues. Switched from free tier to Hobby plan ($5/month) due to 500 hour limit. Added comprehensive documentation.

Status: ✅ Production-ready | Platform: Railway

⚡ Challenges & Solutions

Challenge 1: sentence-transformers Compatibility

Problem: ImportError with NumPy 2.0 breaking sentence-transformers

ImportError: cannot import name 'cached_download' from 'huggingface_hub' # Root cause: sentence-transformers 2.x incompatible with NumPy 2.0

✅ Solution

Pinned NumPy to version <2.0.0 in pyproject.toml:

[project] dependencies = [ "sentence-transformers>=2.2.2,<3.0.0", "numpy>=1.21.0,<2.0.0", # Pin to NumPy 1.x "transformers>=4.30.0", ]

Lesson Learned: Always pin major versions in production dependencies

Challenge 2: Railway Free Tier Exceeded

Problem: App went down with "exceeded usage limit" error

Investigation:

  • Free tier: 500 hours/month
  • Our usage: 24/7 × 30 days = 720 hours/month
  • Overage: 220 hours → app suspended

✅ Solution

Upgraded to Railway Hobby plan ($5/month) for unlimited execution hours

Why this was the right choice:

  • Predictable costs vs pay-per-use
  • Continuous availability
  • Still cheaper than AWS (when factoring in setup time)

Challenge 3: Streamlit Deployment Complexity

Problem: Streamlit required separate service, WebSocket configuration, and complex CORS setup

✅ Solution

Rewrote frontend in vanilla HTML/CSS/JavaScript served directly from FastAPI

Benefits:

  • Single service deployment (simplified architecture)
  • No WebSocket issues
  • Faster page loads (no framework overhead)
  • Single port (8000) instead of two

Challenge 4: Cost Control at Scale

Problem: Needed to prevent runaway API costs from abuse or bugs

✅ Solution

Implemented multi-layer protection:

  • Rate limiting: 5 req/hour (demo), 50 req/hour (authenticated)
  • Daily budget cap: $2.00 default, configurable
  • Token tracking: Monitor per-request costs
  • Query validation: Limit input length (10-1000 chars)

🧠 Memory Management & Conversation Design

Short-Term Memory (Request Scope)

Each request gets a unique correlation ID that tracks the request through all agents:

import uuid from contextvars import ContextVar correlation_id_var: ContextVar[str] = ContextVar('correlation_id') @app.middleware("http") async def add_correlation_id(request: Request, call_next): correlation_id = str(uuid.uuid4()) correlation_id_var.set(correlation_id) logger.info("request_start", correlation_id=correlation_id) response = await call_next(request) return response

Purpose: Debug issues, trace requests, performance analysis

Long-Term Memory (Implemented with Qdrant)

Persistent storage using Qdrant vector database with semantic search capabilities:

Three Qdrant Collections:

  1. users: Authentication data, user profiles, usage statistics (total_queries, total_cost_usd)
  2. user_queries: Query history with 384-dimensional semantic embeddings for similarity search
  3. user_feedback: User feedback on recommendations for continuous improvement

Semantic Search Implementation:

from sentence_transformers import SentenceTransformer from qdrant_client import QdrantClient class UserMemoryStore: def __init__(self): self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dim self.client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY) def store_query(self, user_id, query, recommendations, tokens_used, cost_usd): # Generate semantic embedding query_embedding = self.embedding_model.encode(query).tolist() # Store with vector for similarity search self.client.upsert( collection_name="user_queries", points=[PointStruct( vector=query_embedding, payload={ "user_id": user_id, "query": query, "recommendations": recommendations, "tokens_used": tokens_used, "cost_usd": cost_usd, } )] ) def search_similar_queries(self, user_id, query, limit=5): # Find semantically similar past queries query_embedding = self.embedding_model.encode(query) results = self.client.search( collection_name="user_queries", query_vector=query_embedding, query_filter={"user_id": user_id}, limit=limit ) return results # Returns queries with similarity scores

Enabled Features:

Multi-Turn Conversations (Implemented)

Conversation Manager agent enables intelligent multi-turn dialogues with session-based memory:

SessionStore Implementation:

class SessionStore: """In-memory short-term conversation memory (30-minute timeout)""" @staticmethod def create_session(user_id: str) -> str: session_id = str(uuid.uuid4()) _sessions[session_id] = { "user_id": user_id, "conversation_history": [], # All messages in conversation "extracted_context": {}, # Accumulated project requirements "completion_percentage": 0, # How much info gathered "ready_for_recommendation": False } return session_id @staticmethod def add_message(session_id: str, role: str, content: str): session["conversation_history"].append({ "role": role, "content": content, "timestamp": time.time() })

Conversation Flow:

  1. User starts conversation: "I need a tech stack for my project"
  2. Agent asks follow-up: "How many daily active users do you expect?"
  3. User responds: "Around 100K users"
  4. Agent continues: "What type of data will you be storing?"
  5. Context accumulates: extracted_context = {"dau": 100000, "data_type": "..."}
  6. Completion tracked: completion_percentage increases from 0% → 100%
  7. Ready signal: When ready_for_recommendation = True, system generates full recommendation

Enabled Multi-Turn Queries:

Note: Production systems should migrate from in-memory SessionStore to Redis for persistence across server restarts and multi-instance deployments.

🔐 Authentication & Security

Why Authentication Was Necessary

  1. Cost Control: Prevent abuse of expensive LLM API calls (~$0.0015/query)
  2. Rate Limiting: Enforce per-user limits instead of per-IP
  3. Audit Trail: Track who makes what requests for debugging
  4. Feature Access: Enable user profiles, query history, saved recommendations
  5. Admin Features: Manage users, view feedback, monitor system health

Authentication Implementation

JWT Tokens

Stateless authentication with 1-hour expiration. Tokens include user email and role (user/admin).

Password Security

bcrypt hashing with salt rounds. Passwords never stored in plain text or logged.

Google OAuth 2.0

Social login with state parameter for CSRF protection. User passwords stay at Google.

Rate Limiting

Per-user limits: 50 req/hour authenticated vs 5 req/hour demo mode.

Security Measures

Rate Limiting Implementation (SlowAPI)

The system implements comprehensive rate limiting using SlowAPI, a FastAPI extension built on a token bucket algorithm with in-memory storage. This protects against abuse and controls API costs.

Architecture

# backend/src/api/main.py from slowapi import Limiter, _rate_limit_exceeded_handler from slowapi.util import get_remote_address from slowapi.errors import RateLimitExceeded # Initialize limiter with IP-based tracking limiter = Limiter(key_func=get_remote_address) app.state.limiter = limiter # Register exception handler for HTTP 429 responses app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

Configuration

# backend/src/core/config.py class Settings(BaseSettings): rate_limit_demo: str = "50/hour" # Demo/unauthenticated users rate_limit_authenticated: str = "100/hour" # Authenticated users daily_query_cap: int = 100 # Daily limit per user

Applied to Endpoints

@app.post("/recommend") @limiter.limit(settings.rate_limit_demo) # 50 requests/hour by IP async def get_recommendation(request: Request, req: RecommendationRequest): # Endpoint logic pass @app.post("/generate-diagram") @limiter.limit(settings.rate_limit_demo) async def generate_architecture_diagram(request: Request, req: dict): pass @app.post("/conversation/start") @limiter.limit(settings.rate_limit_demo) async def start_conversation(request: Request): pass

How It Works

Benefits

💰 Cost Control

Prevents LLM API cost spiral from excessive requests

🛡️ Abuse Prevention

Protects against denial-of-service attempts

⚖️ Fair Resource Allocation

Ensures equitable access among all users

🚀 Production-Ready

Battle-tested library with minimal overhead

⚙️ Configurable

Different limits for demo vs authenticated users (50/hour vs 100/hour)

Limitations & Future Enhancements

📈 Performance & Scalability

Current Performance Metrics

Metric Value Notes
Total Response Time 2-4 seconds Includes all agents + parsing
LLM Latency ~3.3 seconds 99.5% of total time
RAG Search ~30ms Vector search across 34 docs
Query Parsing 1-5ms NLP extraction
Tokens Per Query ~6,250 Across all agents
Cost Per Query $0.0015 Claude Haiku pricing

Bottleneck Analysis

Current Architecture (Sequential):

Optimized Architecture (Parallel - Future):

Scalability Analysis

Load Level Requests/Day Monthly Cost Infrastructure
Demo 100 $4.50 API + $5 hosting = $9.50 Single Railway instance
Small Business 1,000 $45 API + $5 hosting = $50 Single Railway instance
Growing Startup 10,000 $450 API + $25 hosting = $475 2-3 Railway instances + load balancer
Enterprise 100,000 $4,500 API + $500 infrastructure = $5,000 Kubernetes cluster, Redis cache

📊 Monitoring & Observability

The Tech Stack Advisor includes comprehensive monitoring capabilities with Prometheus-format metrics, structured logging, and Grafana Cloud integration for production-grade observability.

Prometheus Metrics Endpoint

The system exposes metrics at /metrics/prometheus in Prometheus format for seamless integration with monitoring systems:

# Access Prometheus metrics (requires JWT authentication) curl http://localhost:8000/metrics/prometheus \ -H "Authorization: Bearer <your-jwt-token>"

Available Metrics

HTTP Request Metrics

LLM Usage & Cost Tracking

Application Metrics

Grafana Cloud Integration

The application integrates seamlessly with Grafana Cloud for real-time monitoring dashboards and alerting. The free tier provides:

📊 Metrics Storage

10,000 metric series with 14-day retention

📈 Real-time Dashboards

Customizable dashboards for HTTP, LLM, and application metrics

🔔 Alerting

Alert on cost thresholds, error rates, and latency spikes

💰 Cost

$0/month for free tier (suitable for demo/small projects)

Setup Guide: See GRAFANA_CLOUD_SETUP.md for complete configuration instructions. (Private repo - request access if needed)

Example PromQL Queries

Common queries for monitoring the application in Grafana:

# Request rate (requests per second) rate(http_requests_total[5m]) # P95 latency across all endpoints histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Daily LLM cost tracking llm_daily_cost_usd # Error rate percentage sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Token usage by agent sum by (agent) (llm_tokens_total) # Active sessions gauge active_conversation_sessions

Structured Logging

All logs are emitted in structured JSON format using structlog with correlation IDs for request tracing:

{ "event": "recommendation_generated", "correlation_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "user_id": "user@example.com", "tokens_used": 6250, "cost_usd": 0.0015, "duration_ms": 3245, "timestamp": "2024-01-15T10:30:45.123Z" }

Benefits:

🛠️ Technology Stack

Backend

Python 3.11+ FastAPI Pydantic LangChain LangGraph Anthropic Claude sentence-transformers Qdrant structlog slowapi bcrypt PyJWT

Frontend

HTML5 CSS3 JavaScript (ES6+) JWT localStorage

Development & Testing

pytest mypy ruff uvicorn

Infrastructure

Railway GitHub Auto-deploy SSL/HTTPS SQLite

📚 Lessons Learned

1. Simplicity Wins

Vanilla JavaScript over React saved weeks of complexity. No build step means faster iteration and simpler deployment.

2. Cost-Conscious Architecture

Choosing Claude Haiku over GPT-4 saved 95% on API costs without sacrificing quality. Always benchmark cheaper alternatives.

3. Dependency Hell is Real

The NumPy 2.0 breaking change taught us to pin major versions and test upgrades carefully.

4. Platform Matters

Railway's $5/month hobby plan is worth it vs fighting with free tier limits. Developer time is expensive.

5. Multi-Agent Design

Specialized agents with focused prompts outperform monolithic prompts for complex tasks.

6. Authentication is Non-Negotiable

Even for "free" services, authentication prevents abuse and enables valuable features like personalization.

7. Monitor Everything

Correlation IDs, structured logging, and cost tracking saved countless debugging hours.

Ready to Learn More?

This project showcases production-ready AI engineering, modern web development, and cloud deployment expertise.

Try Live Demo View on GitHub

Private repository - Request access if needed