📋 Project Overview
Tech Stack Advisor is a production-ready web application that leverages cutting-edge AI technology to help developers and architects make informed decisions about their technology stack. The system uses five specialized AI agents working in concert to analyze project requirements and deliver comprehensive recommendations covering conversations, databases, infrastructure, cost optimization, and security. With intelligent multi-turn conversations, long-term memory powered by Qdrant vector search, and semantic query history, the system provides personalized, context-aware recommendations.
🎬 Try It Live
🌐 Live Application
Try the production deployment
Launch App →
🎯 What Problem Does It Solve?
The Challenge
Choosing the right technology stack for a project is one of the most critical decisions in software development. It requires:
- Deep expertise across multiple domains (databases, infrastructure, security, cost optimization)
- Understanding of scale requirements and how different technologies perform at various scales
- Knowledge of compliance requirements (GDPR, HIPAA, PCI-DSS, SOC 2)
- Cost-benefit analysis across different cloud providers and deployment options
- Security threat modeling and mitigation strategies
The Solution
Tech Stack Advisor automates this complex decision-making process by:
💬 Intelligent Conversations
Engages in multi-turn dialogues to gather project requirements intelligently, asking targeted follow-up questions with structured choices.
🗄️ Database Recommendations
Analyzes data type, scale, consistency requirements, and recommends optimal database solutions with scaling strategies.
☁️ Infrastructure Planning
Suggests cloud providers, architecture patterns, and deployment strategies based on workload characteristics.
💰 Cost Optimization
Provides multi-provider cost comparisons and optimization recommendations to maximize budget efficiency.
🔒 Security Analysis
Performs threat modeling, checks compliance requirements, and recommends security measures.
🏗️ System Architecture
🎨 Modern Web UI
- HTML/CSS/JavaScript (Vanilla)
- User Authentication (Local + Google OAuth)
- Responsive Design
- Real-time API Integration with JWT
- Admin Dashboard
- Download JSON Results
↓
HTTP REST + JWT Auth
↓
⚡ FastAPI Backend (Port 8000)
- Serves static files (HTML/CSS/JS)
- POST /recommend - Main recommendation endpoint
- Authentication endpoints (register/login/OAuth)
- GET /health - Health monitoring
- GET /metrics - Usage & cost tracking
- Rate limiting & JWT authentication
- Auto-generated Swagger docs
↓
🔄 LangGraph Orchestrator
- Query Parser (NLP-based context extraction)
- Sequential agent coordination
- State management with TypedDict
- Correlation IDs for tracing
↓
💬
Conversation
Manager
🗄️
Database
Agent
☁️
Infrastructure
Agent
💰
Cost
Agent
🔒
Security
Agent
↓
📚 Qdrant
Vector Store
(34 docs)
🤖 Claude AI
Haiku Model
(LLM)
💵 Pricing
Data
(Real-time)
Architecture Highlights
- Single-Service Design: Unified FastAPI backend serving both API and web UI on port 8000
- Agent Orchestration: LangGraph manages sequential workflow through all agents
- RAG System: Qdrant vector database with 34 curated technical documents
- Authentication: JWT tokens + Google OAuth 2.0 for user management
- State Management: Correlation IDs track requests through entire pipeline
📐 Detailed Architecture Diagram
flowchart LR
subgraph Frontend["🎨 Frontend Layer"]
direction TB
UI[Web UI
HTML/CSS/JS]
Auth[Auth Pages]
Admin[Admin Dashboard]
end
subgraph API["⚡ API Gateway"]
direction TB
FastAPI[FastAPI :8000
Rate Limit: 50-100/hr
JWT Auth]
Middleware[CORS + Rate Limiter]
end
subgraph AuthSys["🔐 Authentication"]
direction TB
JWT[JWT Tokens
HS256, 24h]
OAuth[Google OAuth 2.0]
Sessions[Session Store
30min timeout]
end
subgraph Orchestration["🔄 LangGraph Orchestrator"]
direction TB
Parser[Query Parser
NLP Extraction]
subgraph Agents["5 Specialized Agents"]
direction LR
DB[🗄️ Database] --> Infra[☁️ Infrastructure]
Infra --> Cost[💰 Cost]
Cost --> Sec[🔒 Security]
end
Conv[💬 Conversation
Manager]
Synth[Result Synthesizer]
Parser --> Agents
Agents --> Synth
end
subgraph RAG["📚 RAG System"]
direction TB
Embed[Embedding Model
all-MiniLM-L6-v2
384-dim]
VectorDB[(Qdrant Vector DB
34 docs)]
Embed --> VectorDB
end
subgraph LLM["🤖 AI Engine"]
direction TB
Claude[Claude 3 Haiku
$0.0015/query]
TokenTrack[Token Tracking]
Claude -.-> TokenTrack
end
subgraph Memory["💾 Memory Systems"]
direction TB
LongMem[(Long-Term
Qdrant)]
ShortMem[(Short-Term
In-Memory)]
end
subgraph Monitor["📊 Monitoring"]
direction TB
Logger[Structured Logs
Correlation IDs]
Prom[Prometheus
Metrics]
end
%% Main Flow
Frontend --> API
API --> Middleware
Middleware --> AuthSys
AuthSys --> Orchestration
Orchestration --> Conv
Parser -.-> RAG
Agents -.-> RAG
Agents -.-> LLM
Synth --> Memory
API -.-> Monitor
%% Styling
classDef frontend fill:#667eea,stroke:#333,stroke-width:2px,color:#fff
classDef api fill:#48bb78,stroke:#333,stroke-width:2px,color:#fff
classDef auth fill:#f59e0b,stroke:#333,stroke-width:2px,color:#fff
classDef agent fill:#ed8936,stroke:#333,stroke-width:2px,color:#fff
classDef storage fill:#4299e1,stroke:#333,stroke-width:2px,color:#fff
classDef llm fill:#9f7aea,stroke:#333,stroke-width:2px,color:#fff
classDef monitor fill:#38b2ac,stroke:#333,stroke-width:2px,color:#fff
class Frontend,UI,Auth,Admin frontend
class API,FastAPI,Middleware api
class AuthSys,JWT,OAuth,Sessions auth
class Orchestration,Parser,Agents,DB,Infra,Cost,Sec,Conv,Synth agent
class RAG,Embed,VectorDB,Memory,LongMem,ShortMem storage
class LLM,Claude,TokenTrack llm
class Monitor,Logger,Prom monitor
🧮 Core Algorithms & Techniques
1. Natural Language Processing (NLP) - Context Extraction
Location: backend/src/orchestration/workflow.py
Techniques Used:
- Regex Pattern Matching: Extracts DAU (Daily Active Users) and budget from natural language queries
- Keyword-Based Classification: Detects scale ("small", "medium", "large", "enterprise"), workload type ("real-time", "api", "batch"), and data sensitivity
- Heuristic-Based Estimation: QPS = DAU × 0.015 / 60, Data Volume = DAU / 100 GB
Performance: 1-5ms per query
2. Retrieval-Augmented Generation (RAG) - Semantic Search
Location: backend/src/rag/vectorstore.py
Algorithm: Vector Similarity Search with Cosine Distance
User Query → Text Embedding (384-dim) → Qdrant Vector Search → Cosine Similarity → Metadata Filtering → Top-K Retrieval
Model: sentence-transformers/all-MiniLM-L6-v2
Performance: ~30ms per RAG retrieval (embedding + search)
3. Database Scale Estimation - Tier-Based Heuristics
Location: backend/src/agents/database.py
Algorithm: Multi-factor tier classification
if DAU ≥ 1M or QPS ≥ 10K → Enterprise
elif DAU ≥ 100K or QPS ≥ 1K → Large
elif DAU ≥ 10K or QPS ≥ 100 → Medium
else → Small
Outputs: Tier, estimated connections, cache recommendation, sharding strategy, replication approach
Complexity: O(1) - constant time decision tree
4. Multi-Cloud Cost Estimation - Component-Based Pricing
Location: backend/src/agents/cost.py
Algorithm: Linear cost model with provider-specific multipliers
Total Monthly Cost = Compute + Storage + Database + Bandwidth
Provider Multipliers: AWS (1.0x), GCP (1.09x), Azure (1.05x)
Complexity: O(1) - lookup table + simple arithmetic
5. Security Risk Assessment - Multi-Factor Scoring
Location: backend/src/agents/security.py
Algorithm: Weighted risk scoring with priority matrix
Risk Score = Data Sensitivity (40%) + Public Exposure (25%) + Compliance (20%) + Architecture Complexity (15%)
Risk Level: ≥70 = CRITICAL, ≥50 = HIGH, ≥30 = MEDIUM, else LOW
Threat Prioritization: CRITICAL (SQL injection, data exposure), HIGH (CSRF, XSS), MEDIUM (DDoS), LOW (info disclosure)
Complexity: O(n) where n = number of compliance requirements (typically < 5)
6. Conversation Management - Iterative Context Accumulation
Location: backend/src/agents/conversation.py
Algorithm: Contextual completion tracking with structured questioning
Completion = app_type (40%) + dau/scale (40%) + key_features (20%)
Ready when completion ≥ 80%
Strategy: Prioritize missing critical fields → Provide structured options → Extract entities → Avoid redundancy
Complexity: O(k) where k = number of conversation turns (typically 2-5)
7. LangGraph Sequential Workflow - State Machine
Location: backend/src/orchestration/workflow.py
Pattern: Directed Acyclic Graph (DAG) with shared state
parse_query → database_agent → infrastructure_agent → cost_agent → security_agent → synthesize
State Management: Shared TypedDict state accumulates results from each agent
Complexity: O(n) where n = number of agents (currently 5, sequential execution)
8. Token & Cost Tracking - Real-Time Metrics
Location: backend/src/core/logging.py
Algorithm: Cumulative metrics with daily aggregation
Cost = (input_tokens × $0.00025 + output_tokens × $0.00125) / 1000
Average: ~$0.0015 per query (~6,250 tokens)
Metrics: Prometheus-format counters and gauges for tokens, cost, queries (per agent and daily totals)
💡 Key Technical Decisions
Decision 1: Vanilla JavaScript vs. Streamlit/React
Chosen: Vanilla HTML/CSS/JavaScript
Primary Reason: Streamlit has known deployment issues on Railway, requiring complex WebSocket configuration and separate service management.
✅ Why Vanilla JS Won:
- Single service deployment (no WebSocket complexity)
- Served directly from FastAPI (port 8000 only)
- No Streamlit-specific Railway configuration
- Zero build step, instant deployment
- Smaller bundle size (~15KB vs ~2MB)
❌ Rejected Alternatives:
- Streamlit: Deployment complexity on Railway, WebSocket issues, required separate service
- React: Overkill for UI complexity, build step overhead
- Vue: Added dependency overhead
Key Learning: We initially built with Streamlit but encountered deployment issues on Railway. Rewriting in vanilla JavaScript reduced architecture complexity from 2 services to 1, eliminated WebSocket configuration headaches, and made deployment trivial.
Decision 2: LLM Provider Selection
Chosen: Anthropic Claude (Haiku model)
Why:
- Best cost/performance ratio: $0.25 per 1M input tokens (vs GPT-4: $30)
- Long context windows (200K tokens)
- Strong instruction following
- Built-in safety features
- Lower latency than GPT-4
Cost Comparison (per 1,000 queries):
| Model |
Cost |
Decision |
| Claude Haiku |
$1.50 |
✅ Selected |
| Claude Sonnet |
$15.00 |
❌ 10x more expensive, exceeds 1GB Railway limit |
| GPT-3.5-Turbo |
$2.00 |
❌ More expensive, lower quality |
| GPT-4-Turbo |
$30.00 |
❌ 20x more expensive |
| Gemini Pro |
$0.50 |
❌ Inconsistent API, less mature |
Initial Consideration: Claude Sonnet
We initially considered upgrading to Claude Sonnet for its ability to generate larger responses (8,192 output tokens vs Haiku's 4,096), which would be useful for comprehensive infrastructure diagrams and detailed recommendations.
Why We Stayed with Haiku:
- Cost: Sonnet is 10x more expensive ($3 vs $0.30 per 1M tokens)
- Memory Footprint: Sonnet model exceeded Railway's 1GB free tier memory limit, requiring paid plan upgrade
- Architectural Workaround: Instead of upgrading, we split the infrastructure diagram generation into a separate, smaller task outside the main Infrastructure Agent, keeping responses within Haiku's token limits
- Performance: Haiku's faster response times (2-4s) better suited our real-time recommendation use case
Key Learning: Architectural refactoring (task decomposition) can be more cost-effective than upgrading to larger models. By splitting complex outputs into focused sub-tasks, we maintained quality while achieving 10x cost savings and staying within infrastructure constraints.
Decision 3: Multi-Agent Architecture
Chosen: 5 specialized agents with LangGraph orchestration
Agents:
- Conversation Manager: Intelligent multi-turn dialogues to gather requirements
- Database Agent: Database technology recommendations
- Infrastructure Agent: Cloud architecture and deployment strategies
- Cost Agent: Multi-provider cost comparisons
- Security Agent: Threat modeling and compliance checks
Why:
- Separation of Concerns: Each agent has focused expertise
- Better Prompt Engineering: Smaller, targeted prompts vs one giant prompt
- Conversational UX: Conversation Manager guides users through complex requirements
- Parallel Future Optimization: Can parallelize agents for 3.7× speedup
- Maintainability: Easy to update individual agents
- Testability: Each agent can be tested in isolation
Decision 4: Deployment Platform
Chosen: Railway (Hobby plan $5/month)
✅ Pros:
- GitHub auto-deploy
- Predictable pricing
- Zero-downtime deploys
- Built-in SSL
- Simple environment management
❌ Rejected Alternatives:
- AWS: Complex, time-consuming setup
- Heroku: More expensive, being sunset
- Vercel: Serverless cold starts
- Railway Free: 500 hours/month limit
🚀 Implementation Journey
Week 1: Agent Development
Built 4 specialized agents (Database, Infrastructure, Cost, Security) with base class architecture and LLM integration. Implemented 8 tools for knowledge retrieval and computation.
LOC: ~1,000 | Key Tech: Python, Anthropic Claude, Protocol-based tools
Week 2: LangGraph Orchestration
Designed sequential workflow pipeline with state management. Implemented query parser for extracting DAU, compliance, and budget from natural language.
LOC: ~500 | Key Tech: LangGraph, TypedDict state, Correlation IDs
Week 3: REST API Development
Built production FastAPI with rate limiting, cost controls, and comprehensive error handling. Added Swagger/ReDoc documentation.
LOC: ~400 | Key Tech: FastAPI, slowapi, Pydantic, CORS
Week 4: RAG System
Implemented vector search with Qdrant. Curated 34 technical documents covering databases, infrastructure, and security. Used sentence-transformers for embeddings.
LOC: ~500 | Key Tech: Qdrant, sentence-transformers, 384-d vectors
Week 5: Authentication & Frontend
Built Modern Web UI with vanilla JavaScript. Implemented JWT authentication, Google OAuth 2.0, and admin dashboard. Replaced Streamlit for simpler deployment.
LOC: ~400 | Key Tech: HTML/CSS/JS, JWT, bcrypt, Google OAuth
Week 6: Deployment & Polish
Deployed to Railway. Fixed NumPy compatibility issues. Switched from free tier to Hobby plan ($5/month) due to 500 hour limit. Added comprehensive documentation.
Status: ✅ Production-ready | Platform: Railway
⚡ Challenges & Solutions
Challenge 1: sentence-transformers Compatibility
Problem: ImportError with NumPy 2.0 breaking sentence-transformers
ImportError: cannot import name 'cached_download' from 'huggingface_hub'
# Root cause: sentence-transformers 2.x incompatible with NumPy 2.0
✅ Solution
Pinned NumPy to version <2.0.0 in pyproject.toml:
[project]
dependencies = [
"sentence-transformers>=2.2.2,<3.0.0",
"numpy>=1.21.0,<2.0.0", # Pin to NumPy 1.x
"transformers>=4.30.0",
]
Lesson Learned: Always pin major versions in production dependencies
Challenge 2: Railway Free Tier Exceeded
Problem: App went down with "exceeded usage limit" error
Investigation:
- Free tier: 500 hours/month
- Our usage: 24/7 × 30 days = 720 hours/month
- Overage: 220 hours → app suspended
✅ Solution
Upgraded to Railway Hobby plan ($5/month) for unlimited execution hours
Why this was the right choice:
- Predictable costs vs pay-per-use
- Continuous availability
- Still cheaper than AWS (when factoring in setup time)
Challenge 3: Streamlit Deployment Complexity
Problem: Streamlit required separate service, WebSocket configuration, and complex CORS setup
✅ Solution
Rewrote frontend in vanilla HTML/CSS/JavaScript served directly from FastAPI
Benefits:
- Single service deployment (simplified architecture)
- No WebSocket issues
- Faster page loads (no framework overhead)
- Single port (8000) instead of two
Challenge 4: Cost Control at Scale
Problem: Needed to prevent runaway API costs from abuse or bugs
✅ Solution
Implemented multi-layer protection:
- Rate limiting: 5 req/hour (demo), 50 req/hour (authenticated)
- Daily budget cap: $2.00 default, configurable
- Token tracking: Monitor per-request costs
- Query validation: Limit input length (10-1000 chars)
🧠 Memory Management & Conversation Design
Short-Term Memory (Request Scope)
Each request gets a unique correlation ID that tracks the request through all agents:
import uuid
from contextvars import ContextVar
correlation_id_var: ContextVar[str] = ContextVar('correlation_id')
@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
correlation_id = str(uuid.uuid4())
correlation_id_var.set(correlation_id)
logger.info("request_start", correlation_id=correlation_id)
response = await call_next(request)
return response
Purpose: Debug issues, trace requests, performance analysis
Long-Term Memory (Implemented with Qdrant)
Persistent storage using Qdrant vector database with semantic search capabilities:
Three Qdrant Collections:
- users: Authentication data, user profiles, usage statistics (total_queries, total_cost_usd)
- user_queries: Query history with 384-dimensional semantic embeddings for similarity search
- user_feedback: User feedback on recommendations for continuous improvement
Semantic Search Implementation:
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
class UserMemoryStore:
def __init__(self):
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dim
self.client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
def store_query(self, user_id, query, recommendations, tokens_used, cost_usd):
# Generate semantic embedding
query_embedding = self.embedding_model.encode(query).tolist()
# Store with vector for similarity search
self.client.upsert(
collection_name="user_queries",
points=[PointStruct(
vector=query_embedding,
payload={
"user_id": user_id,
"query": query,
"recommendations": recommendations,
"tokens_used": tokens_used,
"cost_usd": cost_usd,
}
)]
)
def search_similar_queries(self, user_id, query, limit=5):
# Find semantically similar past queries
query_embedding = self.embedding_model.encode(query)
results = self.client.search(
collection_name="user_queries",
query_vector=query_embedding,
query_filter={"user_id": user_id},
limit=limit
)
return results # Returns queries with similarity scores
Enabled Features:
- Query History: "You asked something similar 2 days ago for a chat app"
- Semantic Search: Find related queries even with different wording
- User Statistics: Track total queries, cumulative cost per user
- Feedback Loop: Store and analyze user feedback on recommendations
- Cost Tracking: Monitor per-user API costs for budget controls
Multi-Turn Conversations (Implemented)
Conversation Manager agent enables intelligent multi-turn dialogues with session-based memory:
SessionStore Implementation:
class SessionStore:
"""In-memory short-term conversation memory (30-minute timeout)"""
@staticmethod
def create_session(user_id: str) -> str:
session_id = str(uuid.uuid4())
_sessions[session_id] = {
"user_id": user_id,
"conversation_history": [], # All messages in conversation
"extracted_context": {}, # Accumulated project requirements
"completion_percentage": 0, # How much info gathered
"ready_for_recommendation": False
}
return session_id
@staticmethod
def add_message(session_id: str, role: str, content: str):
session["conversation_history"].append({
"role": role,
"content": content,
"timestamp": time.time()
})
Conversation Flow:
- User starts conversation: "I need a tech stack for my project"
- Agent asks follow-up: "How many daily active users do you expect?"
- User responds: "Around 100K users"
- Agent continues: "What type of data will you be storing?"
- Context accumulates: extracted_context = {"dau": 100000, "data_type": "..."}
- Completion tracked: completion_percentage increases from 0% → 100%
- Ready signal: When ready_for_recommendation = True, system generates full recommendation
Enabled Multi-Turn Queries:
- "What if I increase the budget to $1000?" → Updates context, regenerates recommendations
- "Can you recommend alternatives to PostgreSQL?" → Refines database recommendations
- "How would this change for 1M users instead?" → Re-runs all agents with new scale
Note: Production systems should migrate from in-memory SessionStore to Redis for persistence across server restarts and multi-instance deployments.
🔐 Authentication & Security
Why Authentication Was Necessary
- Cost Control: Prevent abuse of expensive LLM API calls (~$0.0015/query)
- Rate Limiting: Enforce per-user limits instead of per-IP
- Audit Trail: Track who makes what requests for debugging
- Feature Access: Enable user profiles, query history, saved recommendations
- Admin Features: Manage users, view feedback, monitor system health
Authentication Implementation
JWT Tokens
Stateless authentication with 1-hour expiration. Tokens include user email and role (user/admin).
Password Security
bcrypt hashing with salt rounds. Passwords never stored in plain text or logged.
Google OAuth 2.0
Social login with state parameter for CSRF protection. User passwords stay at Google.
Rate Limiting
Per-user limits: 50 req/hour authenticated vs 5 req/hour demo mode.
Security Measures
- Input Validation: Pydantic schemas validate all inputs
- XSS Prevention: Content-Security-Policy headers
- CSRF Protection: JWT tokens + OAuth state parameter
- CORS Configuration: Restrict origins in production
- SQL Injection: Parameterized queries with SQLAlchemy
Rate Limiting Implementation (SlowAPI)
The system implements comprehensive rate limiting using SlowAPI, a FastAPI extension built on a token bucket algorithm with in-memory storage. This protects against abuse and controls API costs.
Architecture
# backend/src/api/main.py
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
# Initialize limiter with IP-based tracking
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
# Register exception handler for HTTP 429 responses
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
Configuration
# backend/src/core/config.py
class Settings(BaseSettings):
rate_limit_demo: str = "50/hour" # Demo/unauthenticated users
rate_limit_authenticated: str = "100/hour" # Authenticated users
daily_query_cap: int = 100 # Daily limit per user
Applied to Endpoints
@app.post("/recommend")
@limiter.limit(settings.rate_limit_demo) # 50 requests/hour by IP
async def get_recommendation(request: Request, req: RecommendationRequest):
# Endpoint logic
pass
@app.post("/generate-diagram")
@limiter.limit(settings.rate_limit_demo)
async def generate_architecture_diagram(request: Request, req: dict):
pass
@app.post("/conversation/start")
@limiter.limit(settings.rate_limit_demo)
async def start_conversation(request: Request):
pass
How It Works
- IP-Based Tracking:
get_remote_address extracts client IP from request headers
- Sliding Window Algorithm: Tracks requests per IP in a time window (e.g., last hour)
- Automatic Enforcement: Returns HTTP 429 (Too Many Requests) with
Retry-After header when limit exceeded
- Per-Endpoint Limits: Each decorated endpoint maintains independent rate limits
- In-Memory Storage: Fast lookup with minimal latency (suitable for single-instance deployments)
Benefits
💰 Cost Control
Prevents LLM API cost spiral from excessive requests
🛡️ Abuse Prevention
Protects against denial-of-service attempts
⚖️ Fair Resource Allocation
Ensures equitable access among all users
🚀 Production-Ready
Battle-tested library with minimal overhead
⚙️ Configurable
Different limits for demo vs authenticated users (50/hour vs 100/hour)
Limitations & Future Enhancements
- In-Memory Storage: Limits reset on server restart; consider Redis backend for production clusters
- IP-Based Only: Sophisticated users can bypass with IP rotation; consider user-based limits
- No Distributed Sync: Multi-instance deployments need shared state (Redis/Memcached)
📈 Performance & Scalability
Current Performance Metrics
| Metric |
Value |
Notes |
| Total Response Time |
2-4 seconds |
Includes all agents + parsing |
| LLM Latency |
~3.3 seconds |
99.5% of total time |
| RAG Search |
~30ms |
Vector search across 34 docs |
| Query Parsing |
1-5ms |
NLP extraction |
| Tokens Per Query |
~6,250 |
Across all agents |
| Cost Per Query |
$0.0015 |
Claude Haiku pricing |
Bottleneck Analysis
Current Architecture (Sequential):
- Parse Query: 5ms
- Database Agent: 800ms
- Infrastructure Agent: 900ms
- Cost Agent: 850ms
- Security Agent: 700ms
- Total: 3,255ms
Optimized Architecture (Parallel - Future):
- Parse Query: 5ms
- All Agents (Parallel): 900ms (slowest agent)
- Total: 905ms (3.7× faster!)
Scalability Analysis
| Load Level |
Requests/Day |
Monthly Cost |
Infrastructure |
| Demo |
100 |
$4.50 API + $5 hosting = $9.50 |
Single Railway instance |
| Small Business |
1,000 |
$45 API + $5 hosting = $50 |
Single Railway instance |
| Growing Startup |
10,000 |
$450 API + $25 hosting = $475 |
2-3 Railway instances + load balancer |
| Enterprise |
100,000 |
$4,500 API + $500 infrastructure = $5,000 |
Kubernetes cluster, Redis cache |
📊 Monitoring & Observability
The Tech Stack Advisor includes comprehensive monitoring capabilities with Prometheus-format metrics, structured logging, and Grafana Cloud integration for production-grade observability.
Prometheus Metrics Endpoint
The system exposes metrics at /metrics/prometheus in Prometheus format for seamless integration with monitoring systems:
# Access Prometheus metrics (requires JWT authentication)
curl http://localhost:8000/metrics/prometheus \
-H "Authorization: Bearer <your-jwt-token>"
Available Metrics
HTTP Request Metrics
http_requests_total{method, endpoint, status_code} - Total HTTP requests counter with labels for method, endpoint, and status code
http_request_duration_seconds{method, endpoint} - HTTP request duration histogram for calculating p50, p95, p99 latencies
LLM Usage & Cost Tracking
llm_tokens_total{agent, token_type} - Token usage by agent (input/output tokens)
llm_cost_usd_total{agent} - Cumulative API cost per agent in USD
llm_requests_total{agent, status} - LLM request count by agent and status (success/error)
llm_daily_tokens - Daily token usage gauge (resets at midnight UTC)
llm_daily_cost_usd - Daily cost in USD gauge
llm_daily_queries - Daily query count gauge
Application Metrics
active_conversation_sessions - Number of active conversation sessions
user_registrations_total{oauth_provider} - Total user registrations by OAuth provider (local/google)
user_logins_total{oauth_provider} - Total user logins by provider
recommendations_total{status, authenticated} - Total recommendations generated with status and auth labels
Grafana Cloud Integration
The application integrates seamlessly with Grafana Cloud for real-time monitoring dashboards and alerting. The free tier provides:
📊 Metrics Storage
10,000 metric series with 14-day retention
📈 Real-time Dashboards
Customizable dashboards for HTTP, LLM, and application metrics
🔔 Alerting
Alert on cost thresholds, error rates, and latency spikes
💰 Cost
$0/month for free tier (suitable for demo/small projects)
Setup Guide: See GRAFANA_CLOUD_SETUP.md for complete configuration instructions. (Private repo - request access if needed)
Example PromQL Queries
Common queries for monitoring the application in Grafana:
# Request rate (requests per second)
rate(http_requests_total[5m])
# P95 latency across all endpoints
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Daily LLM cost tracking
llm_daily_cost_usd
# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Token usage by agent
sum by (agent) (llm_tokens_total)
# Active sessions gauge
active_conversation_sessions
Structured Logging
All logs are emitted in structured JSON format using structlog with correlation IDs for request tracing:
{
"event": "recommendation_generated",
"correlation_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"user_id": "user@example.com",
"tokens_used": 6250,
"cost_usd": 0.0015,
"duration_ms": 3245,
"timestamp": "2024-01-15T10:30:45.123Z"
}
Benefits:
- Request Tracing: Correlation IDs track requests through all agents and services
- Debugging: Structured logs enable powerful filtering and aggregation (e.g., "show all errors for correlation_id X")
- Performance Analysis: Track duration and cost for individual requests
- Cost Control: Monitor per-user API costs and daily spending
🛠️ Technology Stack
Backend
Python 3.11+
FastAPI
Pydantic
LangChain
LangGraph
Anthropic Claude
sentence-transformers
Qdrant
structlog
slowapi
bcrypt
PyJWT
Frontend
HTML5
CSS3
JavaScript (ES6+)
JWT localStorage
Development & Testing
pytest
mypy
ruff
uvicorn
Infrastructure
Railway
GitHub Auto-deploy
SSL/HTTPS
SQLite
📚 Lessons Learned
1. Simplicity Wins
Vanilla JavaScript over React saved weeks of complexity. No build step means faster iteration and simpler deployment.
2. Cost-Conscious Architecture
Choosing Claude Haiku over GPT-4 saved 95% on API costs without sacrificing quality. Always benchmark cheaper alternatives.
3. Dependency Hell is Real
The NumPy 2.0 breaking change taught us to pin major versions and test upgrades carefully.
4. Platform Matters
Railway's $5/month hobby plan is worth it vs fighting with free tier limits. Developer time is expensive.
5. Multi-Agent Design
Specialized agents with focused prompts outperform monolithic prompts for complex tasks.
6. Authentication is Non-Negotiable
Even for "free" services, authentication prevents abuse and enables valuable features like personalization.
7. Monitor Everything
Correlation IDs, structured logging, and cost tracking saved countless debugging hours.