Tech Stack Advisor - AI-Powered Architecture Recommendations

📋 Project Overview

Tech Stack Advisor is a production-ready web application that leverages cutting-edge AI technology to help developers and architects make informed decisions about their technology stack. The system uses five specialized AI agents working in concert to analyze project requirements and deliver comprehensive recommendations covering conversations, databases, infrastructure, cost optimization, and security. With intelligent multi-turn conversations, long-term memory powered by Qdrant vector search, and semantic query history, the system provides personalized, context-aware recommendations.

~3,400

Lines of Code

5

AI Agents

2-4s

Response Time

$0.0015

Cost Per Query

🎬 Try It Live

🌐 Live Application

Try the production deployment

Launch App →

📖 API Documentation

Interactive Swagger UI

View Docs →

💻 Source Code

Private repository (ranjanarajendran)

View Repository → Browse Source Code →

No access? Request here

📊 Technical Documentation

Comprehensive docs in repository

View Docs → Browse Documentation →

No access? Request here

🎯 What Problem Does It Solve?

The Challenge

Choosing the right technology stack for a project is one of the most critical decisions in software development. It requires:

Deep expertise across multiple domains (databases, infrastructure, security, cost optimization)
Understanding of scale requirements and how different technologies perform at various scales
Knowledge of compliance requirements (GDPR, HIPAA, PCI-DSS, SOC 2)
Cost-benefit analysis across different cloud providers and deployment options
Security threat modeling and mitigation strategies

The Solution

Tech Stack Advisor automates this complex decision-making process by:

💬 Intelligent Conversations

Engages in multi-turn dialogues to gather project requirements intelligently, asking targeted follow-up questions with structured choices.

🗄️ Database Recommendations

Analyzes data type, scale, consistency requirements, and recommends optimal database solutions with scaling strategies.

☁️ Infrastructure Planning

Suggests cloud providers, architecture patterns, and deployment strategies based on workload characteristics.

💰 Cost Optimization

Provides multi-provider cost comparisons and optimization recommendations to maximize budget efficiency.

🔒 Security Analysis

Performs threat modeling, checks compliance requirements, and recommends security measures.

🏗️ System Architecture

🎨 Modern Web UI

HTML/CSS/JavaScript (Vanilla)
User Authentication (Local + Google OAuth)
Responsive Design
Real-time API Integration with JWT
Admin Dashboard
Download JSON Results

↓

HTTP REST + JWT Auth

↓

⚡ FastAPI Backend (Port 8000)

Serves static files (HTML/CSS/JS)
POST /recommend - Main recommendation endpoint
Authentication endpoints (register/login/OAuth)
GET /health - Health monitoring
GET /metrics - Usage & cost tracking
Rate limiting & JWT authentication
Auto-generated Swagger docs

↓

🔄 LangGraph Orchestrator

Query Parser (NLP-based context extraction)
Sequential agent coordination
State management with TypedDict
Correlation IDs for tracing

↓

💬
Conversation
Manager

🗄️
Database
Agent

☁️
Infrastructure
Agent

💰
Cost
Agent

🔒
Security
Agent

↓

📚 Qdrant
Vector Store
(34 docs)

🤖 Claude AI
Haiku Model
(LLM)

💵 Pricing
Data
(Real-time)

Architecture Highlights

Single-Service Design: Unified FastAPI backend serving both API and web UI on port 8000
Agent Orchestration: LangGraph manages sequential workflow through all agents
RAG System: Qdrant vector database with 34 curated technical documents
Authentication: JWT tokens + Google OAuth 2.0 for user management
State Management: Correlation IDs track requests through entire pipeline

📐 Detailed Architecture Diagram

flowchart LR subgraph Frontend["🎨 Frontend Layer"] direction TB UI[Web UI
HTML/CSS/JS] Auth[Auth Pages] Admin[Admin Dashboard] end subgraph API["⚡ API Gateway"] direction TB FastAPI[FastAPI :8000
Rate Limit: 50-100/hr
JWT Auth] Middleware[CORS + Rate Limiter] end subgraph AuthSys["🔐 Authentication"] direction TB JWT[JWT Tokens
HS256, 24h] OAuth[Google OAuth 2.0] Sessions[Session Store
30min timeout] end subgraph Orchestration["🔄 LangGraph Orchestrator"] direction TB Parser[Query Parser
NLP Extraction] subgraph Agents["5 Specialized Agents"] direction LR DB[🗄️ Database] --> Infra[☁️ Infrastructure] Infra --> Cost[💰 Cost] Cost --> Sec[🔒 Security] end Conv[💬 Conversation
Manager] Synth[Result Synthesizer] Parser --> Agents Agents --> Synth end subgraph RAG["📚 RAG System"] direction TB Embed[Embedding Model
all-MiniLM-L6-v2
384-dim] VectorDB[(Qdrant Vector DB
34 docs)] Embed --> VectorDB end subgraph LLM["🤖 AI Engine"] direction TB Claude[Claude 3 Haiku
$0.0015/query] TokenTrack[Token Tracking] Claude -.-> TokenTrack end subgraph Memory["💾 Memory Systems"] direction TB LongMem[(Long-Term
Qdrant)] ShortMem[(Short-Term
In-Memory)] end subgraph Monitor["📊 Monitoring"] direction TB Logger[Structured Logs
Correlation IDs] Prom[Prometheus
Metrics] end %% Main Flow Frontend --> API API --> Middleware Middleware --> AuthSys AuthSys --> Orchestration Orchestration --> Conv Parser -.-> RAG Agents -.-> RAG Agents -.-> LLM Synth --> Memory API -.-> Monitor %% Styling classDef frontend fill:#667eea,stroke:#333,stroke-width:2px,color:#fff classDef api fill:#48bb78,stroke:#333,stroke-width:2px,color:#fff classDef auth fill:#f59e0b,stroke:#333,stroke-width:2px,color:#fff classDef agent fill:#ed8936,stroke:#333,stroke-width:2px,color:#fff classDef storage fill:#4299e1,stroke:#333,stroke-width:2px,color:#fff classDef llm fill:#9f7aea,stroke:#333,stroke-width:2px,color:#fff classDef monitor fill:#38b2ac,stroke:#333,stroke-width:2px,color:#fff class Frontend,UI,Auth,Admin frontend class API,FastAPI,Middleware api class AuthSys,JWT,OAuth,Sessions auth class Orchestration,Parser,Agents,DB,Infra,Cost,Sec,Conv,Synth agent class RAG,Embed,VectorDB,Memory,LongMem,ShortMem storage class LLM,Claude,TokenTrack llm class Monitor,Logger,Prom monitor

🧮 Core Algorithms & Techniques

1. Natural Language Processing (NLP) - Context Extraction

Location: backend/src/orchestration/workflow.py

Techniques Used:

Regex Pattern Matching: Extracts DAU (Daily Active Users) and budget from natural language queries
Keyword-Based Classification: Detects scale ("small", "medium", "large", "enterprise"), workload type ("real-time", "api", "batch"), and data sensitivity
Heuristic-Based Estimation: QPS = DAU × 0.015 / 60, Data Volume = DAU / 100 GB

Performance: 1-5ms per query

2. Retrieval-Augmented Generation (RAG) - Semantic Search

Location: backend/src/rag/vectorstore.py

Algorithm: Vector Similarity Search with Cosine Distance

User Query → Text Embedding (384-dim) → Qdrant Vector Search → Cosine Similarity → Metadata Filtering → Top-K Retrieval
                

Model: sentence-transformers/all-MiniLM-L6-v2

Performance: ~30ms per RAG retrieval (embedding + search)

3. Database Scale Estimation - Tier-Based Heuristics

Location: backend/src/agents/database.py

Algorithm: Multi-factor tier classification

if DAU ≥ 1M or QPS ≥ 10K → Enterprise
elif DAU ≥ 100K or QPS ≥ 1K → Large
elif DAU ≥ 10K or QPS ≥ 100 → Medium
else → Small
                

Outputs: Tier, estimated connections, cache recommendation, sharding strategy, replication approach

Complexity: O(1) - constant time decision tree

4. Multi-Cloud Cost Estimation - Component-Based Pricing

Location: backend/src/agents/cost.py

Algorithm: Linear cost model with provider-specific multipliers

Total Monthly Cost = Compute + Storage + Database + Bandwidth
Provider Multipliers: AWS (1.0x), GCP (1.09x), Azure (1.05x)
                

Complexity: O(1) - lookup table + simple arithmetic

5. Security Risk Assessment - Multi-Factor Scoring

Location: backend/src/agents/security.py

Algorithm: Weighted risk scoring with priority matrix

Risk Score = Data Sensitivity (40%) + Public Exposure (25%) + Compliance (20%) + Architecture Complexity (15%)
Risk Level: ≥70 = CRITICAL, ≥50 = HIGH, ≥30 = MEDIUM, else LOW
                

Threat Prioritization: CRITICAL (SQL injection, data exposure), HIGH (CSRF, XSS), MEDIUM (DDoS), LOW (info disclosure)

Complexity: O(n) where n = number of compliance requirements (typically < 5)

6. Conversation Management - Iterative Context Accumulation

Location: backend/src/agents/conversation.py

Algorithm: Contextual completion tracking with structured questioning

Completion = app_type (40%) + dau/scale (40%) + key_features (20%)
Ready when completion ≥ 80%
                

Strategy: Prioritize missing critical fields → Provide structured options → Extract entities → Avoid redundancy

Complexity: O(k) where k = number of conversation turns (typically 2-5)

7. LangGraph Sequential Workflow - State Machine

Location: backend/src/orchestration/workflow.py

Pattern: Directed Acyclic Graph (DAG) with shared state

parse_query → database_agent → infrastructure_agent → cost_agent → security_agent → synthesize
                

State Management: Shared TypedDict state accumulates results from each agent

Complexity: O(n) where n = number of agents (currently 5, sequential execution)

8. Token & Cost Tracking - Real-Time Metrics

Location: backend/src/core/logging.py

Algorithm: Cumulative metrics with daily aggregation

Cost = (input_tokens × $0.00025 + output_tokens × $0.00125) / 1000
Average: ~$0.0015 per query (~6,250 tokens)
                

Metrics: Prometheus-format counters and gauges for tokens, cost, queries (per agent and daily totals)

💡 Key Technical Decisions

Decision 1: Vanilla JavaScript vs. Streamlit/React

Chosen: Vanilla HTML/CSS/JavaScript

Primary Reason: Streamlit has known deployment issues on Railway, requiring complex WebSocket configuration and separate service management.

✅ Why Vanilla JS Won:

Single service deployment (no WebSocket complexity)
Served directly from FastAPI (port 8000 only)
No Streamlit-specific Railway configuration
Zero build step, instant deployment
Smaller bundle size (~15KB vs ~2MB)

❌ Rejected Alternatives:

Streamlit: Deployment complexity on Railway, WebSocket issues, required separate service
React: Overkill for UI complexity, build step overhead
Vue: Added dependency overhead

Key Learning: We initially built with Streamlit but encountered deployment issues on Railway. Rewriting in vanilla JavaScript reduced architecture complexity from 2 services to 1, eliminated WebSocket configuration headaches, and made deployment trivial.

Decision 2: LLM Provider Selection

Chosen: Anthropic Claude (Haiku model)

Why:

Best cost/performance ratio: $0.25 per 1M input tokens (vs GPT-4: $30)
Long context windows (200K tokens)
Strong instruction following
Built-in safety features
Lower latency than GPT-4

Cost Comparison (per 1,000 queries):

Model	Cost	Decision
Claude Haiku	$1.50	✅ Selected
Claude Sonnet	$15.00	❌ 10x more expensive, exceeds 1GB Railway limit
GPT-3.5-Turbo	$2.00	❌ More expensive, lower quality
GPT-4-Turbo	$30.00	❌ 20x more expensive
Gemini Pro	$0.50	❌ Inconsistent API, less mature

Initial Consideration: Claude Sonnet

We initially considered upgrading to Claude Sonnet for its ability to generate larger responses (8,192 output tokens vs Haiku's 4,096), which would be useful for comprehensive infrastructure diagrams and detailed recommendations.

Why We Stayed with Haiku:

Cost: Sonnet is 10x more expensive ($3 vs $0.30 per 1M tokens)
Memory Footprint: Sonnet model exceeded Railway's 1GB free tier memory limit, requiring paid plan upgrade
Architectural Workaround: Instead of upgrading, we split the infrastructure diagram generation into a separate, smaller task outside the main Infrastructure Agent, keeping responses within Haiku's token limits
Performance: Haiku's faster response times (2-4s) better suited our real-time recommendation use case

Key Learning: Architectural refactoring (task decomposition) can be more cost-effective than upgrading to larger models. By splitting complex outputs into focused sub-tasks, we maintained quality while achieving 10x cost savings and staying within infrastructure constraints.

Decision 3: Multi-Agent Architecture

Chosen: 5 specialized agents with LangGraph orchestration

Agents:

Conversation Manager: Intelligent multi-turn dialogues to gather requirements
Database Agent: Database technology recommendations
Infrastructure Agent: Cloud architecture and deployment strategies
Cost Agent: Multi-provider cost comparisons
Security Agent: Threat modeling and compliance checks

Why:

Separation of Concerns: Each agent has focused expertise
Better Prompt Engineering: Smaller, targeted prompts vs one giant prompt
Conversational UX: Conversation Manager guides users through complex requirements
Parallel Future Optimization: Can parallelize agents for 3.7× speedup
Maintainability: Easy to update individual agents
Testability: Each agent can be tested in isolation

Decision 4: Deployment Platform

Chosen: Railway (Hobby plan $5/month)

✅ Pros:

GitHub auto-deploy
Predictable pricing
Zero-downtime deploys
Built-in SSL
Simple environment management

❌ Rejected Alternatives:

AWS: Complex, time-consuming setup
Heroku: More expensive, being sunset
Vercel: Serverless cold starts
Railway Free: 500 hours/month limit

🚀 Implementation Journey

Week 1: Agent Development

Built 4 specialized agents (Database, Infrastructure, Cost, Security) with base class architecture and LLM integration. Implemented 8 tools for knowledge retrieval and computation.

LOC: ~1,000 | Key Tech: Python, Anthropic Claude, Protocol-based tools

Week 2: LangGraph Orchestration

Designed sequential workflow pipeline with state management. Implemented query parser for extracting DAU, compliance, and budget from natural language.

LOC: ~500 | Key Tech: LangGraph, TypedDict state, Correlation IDs

Week 3: REST API Development

Built production FastAPI with rate limiting, cost controls, and comprehensive error handling. Added Swagger/ReDoc documentation.

LOC: ~400 | Key Tech: FastAPI, slowapi, Pydantic, CORS

Week 4: RAG System

Implemented vector search with Qdrant. Curated 34 technical documents covering databases, infrastructure, and security. Used sentence-transformers for embeddings.

LOC: ~500 | Key Tech: Qdrant, sentence-transformers, 384-d vectors

Week 5: Authentication & Frontend

Built Modern Web UI with vanilla JavaScript. Implemented JWT authentication, Google OAuth 2.0, and admin dashboard. Replaced Streamlit for simpler deployment.

LOC: ~400 | Key Tech: HTML/CSS/JS, JWT, bcrypt, Google OAuth

Week 6: Deployment & Polish

Deployed to Railway. Fixed NumPy compatibility issues. Switched from free tier to Hobby plan ($5/month) due to 500 hour limit. Added comprehensive documentation.

Status: ✅ Production-ready | Platform: Railway

⚡ Challenges & Solutions

Challenge 1: sentence-transformers Compatibility

Problem: ImportError with NumPy 2.0 breaking sentence-transformers

ImportError: cannot import name 'cached_download' from 'huggingface_hub'
# Root cause: sentence-transformers 2.x incompatible with NumPy 2.0
                

✅ Solution

Pinned NumPy to version <2.0.0 in pyproject.toml:

[project]
dependencies = [
    "sentence-transformers>=2.2.2,<3.0.0",
    "numpy>=1.21.0,<2.0.0",  # Pin to NumPy 1.x
    "transformers>=4.30.0",
]
                

Lesson Learned: Always pin major versions in production dependencies

Challenge 2: Railway Free Tier Exceeded

Problem: App went down with "exceeded usage limit" error

Investigation:

Free tier: 500 hours/month
Our usage: 24/7 × 30 days = 720 hours/month
Overage: 220 hours → app suspended

✅ Solution

Upgraded to Railway Hobby plan ($5/month) for unlimited execution hours

Why this was the right choice:

Predictable costs vs pay-per-use
Continuous availability
Still cheaper than AWS (when factoring in setup time)

Challenge 3: Streamlit Deployment Complexity

Problem: Streamlit required separate service, WebSocket configuration, and complex CORS setup

✅ Solution

Rewrote frontend in vanilla HTML/CSS/JavaScript served directly from FastAPI

Benefits:

Single service deployment (simplified architecture)
No WebSocket issues
Faster page loads (no framework overhead)
Single port (8000) instead of two

Challenge 4: Cost Control at Scale

Problem: Needed to prevent runaway API costs from abuse or bugs

✅ Solution

Implemented multi-layer protection:

Rate limiting: 5 req/hour (demo), 50 req/hour (authenticated)
Daily budget cap: $2.00 default, configurable
Token tracking: Monitor per-request costs
Query validation: Limit input length (10-1000 chars)

🧠 Memory Management & Conversation Design

Short-Term Memory (Request Scope)

Each request gets a unique correlation ID that tracks the request through all agents:

import uuid
from contextvars import ContextVar

correlation_id_var: ContextVar[str] = ContextVar('correlation_id')

@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
    correlation_id = str(uuid.uuid4())
    correlation_id_var.set(correlation_id)
    logger.info("request_start", correlation_id=correlation_id)
    response = await call_next(request)
    return response
            

Purpose: Debug issues, trace requests, performance analysis

Long-Term Memory (Implemented with Qdrant)

Persistent storage using Qdrant vector database with semantic search capabilities:

Three Qdrant Collections:

users: Authentication data, user profiles, usage statistics (total_queries, total_cost_usd)
user_queries: Query history with 384-dimensional semantic embeddings for similarity search
user_feedback: User feedback on recommendations for continuous improvement

Semantic Search Implementation:

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient

class UserMemoryStore:
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim
        self.client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

    def store_query(self, user_id, query, recommendations, tokens_used, cost_usd):
        # Generate semantic embedding
        query_embedding = self.embedding_model.encode(query).tolist()

        # Store with vector for similarity search
        self.client.upsert(
            collection_name="user_queries",
            points=[PointStruct(
                vector=query_embedding,
                payload={
                    "user_id": user_id,
                    "query": query,
                    "recommendations": recommendations,
                    "tokens_used": tokens_used,
                    "cost_usd": cost_usd,
                }
            )]
        )

    def search_similar_queries(self, user_id, query, limit=5):
        # Find semantically similar past queries
        query_embedding = self.embedding_model.encode(query)
        results = self.client.search(
            collection_name="user_queries",
            query_vector=query_embedding,
            query_filter={"user_id": user_id},
            limit=limit
        )
        return results  # Returns queries with similarity scores
            

Enabled Features:

Query History: "You asked something similar 2 days ago for a chat app"
Semantic Search: Find related queries even with different wording
User Statistics: Track total queries, cumulative cost per user
Feedback Loop: Store and analyze user feedback on recommendations
Cost Tracking: Monitor per-user API costs for budget controls

Multi-Turn Conversations (Implemented)

Conversation Manager agent enables intelligent multi-turn dialogues with session-based memory:

SessionStore Implementation:

class SessionStore:
    """In-memory short-term conversation memory (30-minute timeout)"""

    @staticmethod
    def create_session(user_id: str) -> str:
        session_id = str(uuid.uuid4())
        _sessions[session_id] = {
            "user_id": user_id,
            "conversation_history": [],  # All messages in conversation
            "extracted_context": {},      # Accumulated project requirements
            "completion_percentage": 0,   # How much info gathered
            "ready_for_recommendation": False
        }
        return session_id

    @staticmethod
    def add_message(session_id: str, role: str, content: str):
        session["conversation_history"].append({
            "role": role,
            "content": content,
            "timestamp": time.time()
        })
            

Conversation Flow:

User starts conversation: "I need a tech stack for my project"
Agent asks follow-up: "How many daily active users do you expect?"
User responds: "Around 100K users"
Agent continues: "What type of data will you be storing?"
Context accumulates: extracted_context = {"dau": 100000, "data_type": "..."}
Completion tracked: completion_percentage increases from 0% → 100%
Ready signal: When ready_for_recommendation = True, system generates full recommendation

Enabled Multi-Turn Queries:

"What if I increase the budget to $1000?" → Updates context, regenerates recommendations
"Can you recommend alternatives to PostgreSQL?" → Refines database recommendations
"How would this change for 1M users instead?" → Re-runs all agents with new scale

Note: Production systems should migrate from in-memory SessionStore to Redis for persistence across server restarts and multi-instance deployments.

🔐 Authentication & Security

Why Authentication Was Necessary

Cost Control: Prevent abuse of expensive LLM API calls (~$0.0015/query)
Rate Limiting: Enforce per-user limits instead of per-IP
Audit Trail: Track who makes what requests for debugging
Feature Access: Enable user profiles, query history, saved recommendations
Admin Features: Manage users, view feedback, monitor system health

Authentication Implementation

JWT Tokens

Stateless authentication with 1-hour expiration. Tokens include user email and role (user/admin).

Password Security

bcrypt hashing with salt rounds. Passwords never stored in plain text or logged.

Google OAuth 2.0

Social login with state parameter for CSRF protection. User passwords stay at Google.

Rate Limiting

Per-user limits: 50 req/hour authenticated vs 5 req/hour demo mode.

Security Measures

Input Validation: Pydantic schemas validate all inputs
XSS Prevention: Content-Security-Policy headers
CSRF Protection: JWT tokens + OAuth state parameter
CORS Configuration: Restrict origins in production
SQL Injection: Parameterized queries with SQLAlchemy

Rate Limiting Implementation (SlowAPI)

The system implements comprehensive rate limiting using SlowAPI, a FastAPI extension built on a token bucket algorithm with in-memory storage. This protects against abuse and controls API costs.

Architecture

# backend/src/api/main.py
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

# Initialize limiter with IP-based tracking
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

# Register exception handler for HTTP 429 responses
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
            

Configuration

# backend/src/core/config.py
class Settings(BaseSettings):
    rate_limit_demo: str = "50/hour"           # Demo/unauthenticated users
    rate_limit_authenticated: str = "100/hour"  # Authenticated users
    daily_query_cap: int = 100                  # Daily limit per user
            

Applied to Endpoints

@app.post("/recommend")
@limiter.limit(settings.rate_limit_demo)  # 50 requests/hour by IP
async def get_recommendation(request: Request, req: RecommendationRequest):
    # Endpoint logic
    pass

@app.post("/generate-diagram")
@limiter.limit(settings.rate_limit_demo)
async def generate_architecture_diagram(request: Request, req: dict):
    pass

@app.post("/conversation/start")
@limiter.limit(settings.rate_limit_demo)
async def start_conversation(request: Request):
    pass
            

How It Works

IP-Based Tracking: get_remote_address extracts client IP from request headers
Sliding Window Algorithm: Tracks requests per IP in a time window (e.g., last hour)
Automatic Enforcement: Returns HTTP 429 (Too Many Requests) with Retry-After header when limit exceeded
Per-Endpoint Limits: Each decorated endpoint maintains independent rate limits
In-Memory Storage: Fast lookup with minimal latency (suitable for single-instance deployments)

Benefits

💰 Cost Control

Prevents LLM API cost spiral from excessive requests

🛡️ Abuse Prevention

Protects against denial-of-service attempts

⚖️ Fair Resource Allocation

Ensures equitable access among all users

🚀 Production-Ready

Battle-tested library with minimal overhead

⚙️ Configurable

Different limits for demo vs authenticated users (50/hour vs 100/hour)

Limitations & Future Enhancements

In-Memory Storage: Limits reset on server restart; consider Redis backend for production clusters
IP-Based Only: Sophisticated users can bypass with IP rotation; consider user-based limits
No Distributed Sync: Multi-instance deployments need shared state (Redis/Memcached)

📈 Performance & Scalability

Current Performance Metrics

Metric	Value	Notes
Total Response Time	2-4 seconds	Includes all agents + parsing
LLM Latency	~3.3 seconds	99.5% of total time
RAG Search	~30ms	Vector search across 34 docs
Query Parsing	1-5ms	NLP extraction
Tokens Per Query	~6,250	Across all agents
Cost Per Query	$0.0015	Claude Haiku pricing

Bottleneck Analysis

Current Architecture (Sequential):

Parse Query: 5ms
Database Agent: 800ms
Infrastructure Agent: 900ms
Cost Agent: 850ms
Security Agent: 700ms
Total: 3,255ms

Optimized Architecture (Parallel - Future):

Parse Query: 5ms
All Agents (Parallel): 900ms (slowest agent)
Total: 905ms (3.7× faster!)

Scalability Analysis

Load Level	Requests/Day	Monthly Cost	Infrastructure
Demo	100	$4.50 API + $5 hosting = $9.50	Single Railway instance
Small Business	1,000	$45 API + $5 hosting = $50	Single Railway instance
Growing Startup	10,000	$450 API + $25 hosting = $475	2-3 Railway instances + load balancer
Enterprise	100,000	$4,500 API + $500 infrastructure = $5,000	Kubernetes cluster, Redis cache

📊 Monitoring & Observability

The Tech Stack Advisor includes comprehensive monitoring capabilities with Prometheus-format metrics, structured logging, and Grafana Cloud integration for production-grade observability.

Prometheus Metrics Endpoint

The system exposes metrics at /metrics/prometheus in Prometheus format for seamless integration with monitoring systems:

# Access Prometheus metrics (requires JWT authentication)
curl http://localhost:8000/metrics/prometheus \
  -H "Authorization: Bearer <your-jwt-token>"
            

Available Metrics

HTTP Request Metrics

http_requests_total{method, endpoint, status_code} - Total HTTP requests counter with labels for method, endpoint, and status code
http_request_duration_seconds{method, endpoint} - HTTP request duration histogram for calculating p50, p95, p99 latencies

LLM Usage & Cost Tracking

llm_tokens_total{agent, token_type} - Token usage by agent (input/output tokens)
llm_cost_usd_total{agent} - Cumulative API cost per agent in USD
llm_requests_total{agent, status} - LLM request count by agent and status (success/error)
llm_daily_tokens - Daily token usage gauge (resets at midnight UTC)
llm_daily_cost_usd - Daily cost in USD gauge
llm_daily_queries - Daily query count gauge

Application Metrics

active_conversation_sessions - Number of active conversation sessions
user_registrations_total{oauth_provider} - Total user registrations by OAuth provider (local/google)
user_logins_total{oauth_provider} - Total user logins by provider
recommendations_total{status, authenticated} - Total recommendations generated with status and auth labels

Grafana Cloud Integration

The application integrates seamlessly with Grafana Cloud for real-time monitoring dashboards and alerting. The free tier provides:

📊 Metrics Storage

10,000 metric series with 14-day retention

📈 Real-time Dashboards

Customizable dashboards for HTTP, LLM, and application metrics

🔔 Alerting

Alert on cost thresholds, error rates, and latency spikes

💰 Cost

$0/month for free tier (suitable for demo/small projects)

Setup Guide: See GRAFANA_CLOUD_SETUP.md for complete configuration instructions. (Private repo - request access if needed)

Example PromQL Queries

Common queries for monitoring the application in Grafana:

# Request rate (requests per second)
rate(http_requests_total[5m])

# P95 latency across all endpoints
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Daily LLM cost tracking
llm_daily_cost_usd

# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Token usage by agent
sum by (agent) (llm_tokens_total)

# Active sessions gauge
active_conversation_sessions
            

Structured Logging

All logs are emitted in structured JSON format using structlog with correlation IDs for request tracing:

{
  "event": "recommendation_generated",
  "correlation_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "user_id": "user@example.com",
  "tokens_used": 6250,
  "cost_usd": 0.0015,
  "duration_ms": 3245,
  "timestamp": "2024-01-15T10:30:45.123Z"
}
            

Benefits:

Request Tracing: Correlation IDs track requests through all agents and services
Debugging: Structured logs enable powerful filtering and aggregation (e.g., "show all errors for correlation_id X")
Performance Analysis: Track duration and cost for individual requests
Cost Control: Monitor per-user API costs and daily spending

🛠️ Technology Stack

Backend

Python 3.11+ FastAPI Pydantic LangChain LangGraph Anthropic Claude sentence-transformers Qdrant structlog slowapi bcrypt PyJWT

Frontend

HTML5 CSS3 JavaScript (ES6+) JWT localStorage

Development & Testing

pytest mypy ruff uvicorn

Infrastructure

Railway GitHub Auto-deploy SSL/HTTPS SQLite

📚 Lessons Learned

1. Simplicity Wins

Vanilla JavaScript over React saved weeks of complexity. No build step means faster iteration and simpler deployment.

2. Cost-Conscious Architecture

Choosing Claude Haiku over GPT-4 saved 95% on API costs without sacrificing quality. Always benchmark cheaper alternatives.

3. Dependency Hell is Real

The NumPy 2.0 breaking change taught us to pin major versions and test upgrades carefully.

4. Platform Matters

Railway's $5/month hobby plan is worth it vs fighting with free tier limits. Developer time is expensive.

5. Multi-Agent Design

Specialized agents with focused prompts outperform monolithic prompts for complex tasks.

6. Authentication is Non-Negotiable

Even for "free" services, authentication prevents abuse and enables valuable features like personalization.

7. Monitor Everything

Correlation IDs, structured logging, and cost tracking saved countless debugging hours.

🚀 Tech Stack Advisor

📋 Project Overview

🎬 Try It Live

🌐 Live Application

📖 API Documentation

💻 Source Code

📊 Technical Documentation

🎯 What Problem Does It Solve?

The Challenge

The Solution

💬 Intelligent Conversations

🗄️ Database Recommendations

☁️ Infrastructure Planning

💰 Cost Optimization

🔒 Security Analysis

🏗️ System Architecture

🎨 Modern Web UI

⚡ FastAPI Backend (Port 8000)

🔄 LangGraph Orchestrator

Architecture Highlights

📐 Detailed Architecture Diagram

🧮 Core Algorithms & Techniques

1. Natural Language Processing (NLP) - Context Extraction

2. Retrieval-Augmented Generation (RAG) - Semantic Search

3. Database Scale Estimation - Tier-Based Heuristics

4. Multi-Cloud Cost Estimation - Component-Based Pricing

5. Security Risk Assessment - Multi-Factor Scoring

6. Conversation Management - Iterative Context Accumulation

7. LangGraph Sequential Workflow - State Machine

8. Token & Cost Tracking - Real-Time Metrics

💡 Key Technical Decisions

Decision 1: Vanilla JavaScript vs. Streamlit/React

Decision 2: LLM Provider Selection

Decision 3: Multi-Agent Architecture

Decision 4: Deployment Platform

🚀 Implementation Journey

Week 1: Agent Development

Week 2: LangGraph Orchestration

Week 3: REST API Development

Week 4: RAG System

Week 5: Authentication & Frontend

Week 6: Deployment & Polish

⚡ Challenges & Solutions

Challenge 1: sentence-transformers Compatibility

✅ Solution

Challenge 2: Railway Free Tier Exceeded

✅ Solution

Challenge 3: Streamlit Deployment Complexity

✅ Solution

Challenge 4: Cost Control at Scale

✅ Solution

🧠 Memory Management & Conversation Design

Short-Term Memory (Request Scope)

Long-Term Memory (Implemented with Qdrant)

Three Qdrant Collections:

Semantic Search Implementation:

Enabled Features:

Multi-Turn Conversations (Implemented)

SessionStore Implementation:

Conversation Flow:

Enabled Multi-Turn Queries:

🔐 Authentication & Security

Why Authentication Was Necessary

Authentication Implementation

JWT Tokens

Password Security

Google OAuth 2.0

Rate Limiting

Security Measures

Rate Limiting Implementation (SlowAPI)

Architecture

Configuration

Applied to Endpoints

How It Works

Benefits

💰 Cost Control

🛡️ Abuse Prevention

⚖️ Fair Resource Allocation

🚀 Production-Ready

⚙️ Configurable