GRAFANA_CLOUD_SETUP.md

Language: markdown | Path: GRAFANA_CLOUD_SETUP.md | Lines: 509
# Grafana Cloud Monitoring Setup Guide

Complete guide to set up FREE monitoring for Tech Stack Advisor using Grafana Cloud and Prometheus metrics.

## What You Get (100% Free Forever)

- 10,000 metric series
- 50 GB logs/month
- 50 GB traces/month
- 14-day retention
- Beautiful dashboards
- Real-time alerts
- **Cost: $0/month**

---

## Step 1: Create Grafana Cloud Account (5 minutes)

### 1.1 Sign Up

1. Visit https://grafana.com/auth/sign-up/create-user
2. Fill in:
   - Email
   - Username
   - Company (can be "Personal" or your name)
3. Click "Create Free Account"
4. Verify your email

### 1.2 Create Your Stack

1. After login, you'll be prompted to create a stack
2. Choose a stack name (e.g., `tech-stack-advisor`)
3. Select region closest to your Railway deployment
4. Click "Create Stack"

---

## Step 2: Configure Prometheus Data Source (3 minutes)

### 2.1 Get Your Prometheus Credentials

1. In Grafana Cloud dashboard, go to **Connections** → **Add new connection**
2. Search for "Prometheus"
3. Click on "Hosted Prometheus"
4. You'll see your credentials:

```
Remote Write URL: https://prometheus-prod-XX-prod-XX-XX.grafana.net/api/prom/push
Username: XXXXXX
Password: glc_XXXXXXXXXXXXXXXXXXXXXXXXXXXX
```

**Save these credentials securely!**

### 2.2 Test the Metrics Endpoint

First, verify your app is exposing metrics locally:

```bash
# Start your app if not running
cd /Users/admin/codeprojects/tech-stack-advisor
source .venv/bin/activate
python -m backend.src.api.main

# In another terminal, test the metrics endpoint
curl http://localhost:8000/metrics/prometheus
```

You should see output like:
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{endpoint="/",method="GET",status_code="200"} 5.0
# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
...
```

---

## Step 3: Configure Grafana to Scrape Your Metrics (2 methods)

### Method A: Using Grafana Agent (Recommended for Railway)

#### 3.1 Install Grafana Agent on Your Machine (for testing)

**On macOS:**
```bash
brew install grafana-agent
```

**On Linux:**
```bash
# Download and install
wget https://github.com/grafana/agent/releases/latest/download/grafana-agent-linux-amd64
chmod +x grafana-agent-linux-amd64
sudo mv grafana-agent-linux-amd64 /usr/local/bin/grafana-agent
```

#### 3.2 Create Agent Configuration

Create `grafana-agent-config.yaml`:

```yaml
server:
  log_level: info

metrics:
  global:
    scrape_interval: 60s
    remote_write:
      - url: https://prometheus-prod-XX-prod-XX-XX.grafana.net/api/prom/push
        basic_auth:
          username: XXXXXX
          password: glc_XXXXXXXXXXXXXXXXXXXXXXXXXXXX

  configs:
    - name: tech-stack-advisor
      scrape_configs:
        - job_name: 'fastapi-app'
          static_configs:
            - targets: ['localhost:8000']
          metrics_path: '/metrics/prometheus'
          scrape_interval: 30s
```

Replace `url`, `username`, and `password` with your actual Grafana Cloud credentials.

#### 3.3 Run Grafana Agent

```bash
grafana-agent --config.file=grafana-agent-config.yaml
```

Keep this running in a terminal. It will scrape metrics every 30 seconds and send to Grafana Cloud.

---

### Method B: Using Grafana Cloud Integration (for Railway Production)

For Railway deployment, you'll need to make your `/metrics/prometheus` endpoint publicly accessible.

#### 3.1 Deploy to Railway

```bash
# Push your code to GitHub (metrics already implemented!)
git add .
git commit -m "Add Prometheus metrics for Grafana Cloud"
git push

# Railway will auto-deploy
```

#### 3.2 Get Your Railway URL

After deployment, get your app URL from Railway dashboard:
```
https://your-app-name.up.railway.app
```

#### 3.3 Configure Grafana Cloud to Scrape

In Grafana Cloud:

1. Go to **Connections** → **Add new connection**
2. Search for "Prometheus"
3. Click "Configure"
4. Add scrape config:

```yaml
scrape_configs:
  - job_name: 'tech-stack-advisor-railway'
    scrape_interval: 60s
    scrape_timeout: 30s
    metrics_path: /metrics/prometheus
    static_configs:
      - targets:
          - your-app-name.up.railway.app
```

5. Click "Save & Test"

---

## Step 4: Verify Data is Flowing (2 minutes)

### 4.1 Check Grafana Explore

1. In Grafana Cloud, go to **Explore** (compass icon)
2. Select your Prometheus data source
3. In the metrics browser, search for: `http_requests_total`
4. Click "Run Query"

You should see a graph with your metrics!

### 4.2 Test All Metrics

Try these queries in Explore:

**HTTP Requests by Endpoint:**
```promql
rate(http_requests_total[5m])
```

**Request Duration P95:**
```promql
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```

**LLM Daily Cost:**
```promql
llm_daily_cost_usd
```

**Active Conversation Sessions:**
```promql
active_conversation_sessions
```

---

## Step 5: Create Dashboards (10 minutes)

### 5.1 Import Pre-built Dashboard Template

1. Go to **Dashboards** → **Import**
2. Use the template below or create your own

### 5.2 Custom Dashboard JSON (Copy & Import)

Save this as `tech-stack-advisor-dashboard.json`:

```json
{
  "dashboard": {
    "title": "Tech Stack Advisor Monitoring",
    "panels": [
      {
        "title": "HTTP Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Request Duration P95",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "LLM Daily Cost",
        "targets": [
          {
            "expr": "llm_daily_cost_usd"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Active Sessions",
        "targets": [
          {
            "expr": "active_conversation_sessions"
          }
        ],
        "type": "stat"
      }
    ]
  }
}
```

---

## Step 6: Set Up Alerts (5 minutes)

### 6.1 Create Alert Rules

1. Go to **Alerting** → **Alert rules**
2. Click "New alert rule"

### 6.2 Example Alert: High Error Rate

**Alert Name:** High HTTP Error Rate

**Query:**
```promql
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
```

**Condition:** Alert when error rate > 5%

**Notification:** Email or Slack

### 6.3 Example Alert: Daily Budget Exceeded

**Alert Name:** Daily Budget Exceeded

**Query:**
```promql
llm_daily_cost_usd > 1.8
```

**Condition:** Alert when cost > 90% of $2 budget

---

## Available Metrics Reference

### HTTP Metrics

- `http_requests_total{method, endpoint, status_code}` - Total HTTP requests
- `http_request_duration_seconds{method, endpoint}` - Request duration histogram

### LLM Metrics

- `llm_tokens_total{agent, token_type}` - Token usage by agent
- `llm_cost_usd_total{agent}` - Cost by agent
- `llm_requests_total{agent, status}` - LLM request count
- `llm_daily_tokens` - Daily token usage
- `llm_daily_cost_usd` - Daily cost
- `llm_daily_queries` - Daily query count

### Application Metrics

- `active_conversation_sessions` - Active conversation count
- `user_registrations_total{oauth_provider}` - User registrations
- `user_logins_total{oauth_provider}` - User logins
- `recommendations_total{status, authenticated}` - Recommendations generated

---

## Useful Dashboard Queries

### API Performance

**Request Rate (requests/second):**
```promql
sum(rate(http_requests_total[5m]))
```

**P50, P95, P99 Latency:**
```promql
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
```

**Error Rate Percentage:**
```promql
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
```

### Cost Monitoring

**Daily Cost Trend:**
```promql
llm_daily_cost_usd
```

**Cost by Agent (if implemented):**
```promql
sum(llm_cost_usd_total) by (agent)
```

**Budget Utilization:**
```promql
(llm_daily_cost_usd / 2.0) * 100
```

### User Activity

**Active Users (sessions):**
```promql
active_conversation_sessions
```

**New Registrations (per hour):**
```promql
rate(user_registrations_total[1h]) * 3600
```

**Login Success Rate:**
```promql
sum(user_logins_total{oauth_provider="local"}) / sum(user_registrations_total{oauth_provider="local"})
```

---

## Troubleshooting

### Metrics Not Showing Up

**Check 1: Verify endpoint is accessible**
```bash
curl https://your-app-name.up.railway.app/metrics/prometheus
```

**Check 2: Verify Grafana Agent is running**
```bash
ps aux | grep grafana-agent
```

**Check 3: Check agent logs**
```bash
# If using systemd
journalctl -u grafana-agent -f
```

### No Data in Grafana

**Check 1: Verify data source connection**
- Go to Configuration → Data Sources
- Click on your Prometheus data source
- Click "Save & Test"

**Check 2: Check scrape targets**
- In Grafana Explore, run: `up{job="tech-stack-advisor-railway"}`
- Should show `1` if scraping successfully

### High Memory Usage

If Grafana Agent uses too much memory, reduce scrape frequency:

```yaml
scrape_interval: 120s  # Changed from 60s
```

---

## Cost Optimization Tips

### Stay Within Free Tier

Grafana Cloud free tier limits:
- 10,000 series
- 50 GB logs/month
- 14-day retention

**Current metrics count:** ~15 series (well within limit!)

To check your usage:
1. Go to **Administration** → **Usage Insights**
2. Monitor "Active Series" count

### Reduce Scrape Frequency

If approaching limits:
- Change scrape_interval from 30s to 60s or 120s
- Still get good visibility with less frequent updates

---

## Next Steps

### Recommended Dashboards to Create

1. **API Performance Dashboard**
   - Request rate
   - Latency percentiles
   - Error rate
   - Active sessions

2. **Cost Monitoring Dashboard**
   - Daily cost gauge
   - Cost trend graph
   - Budget utilization
   - Token usage

3. **User Activity Dashboard**
   - New registrations
   - Login activity
   - Active conversations
   - Recommendations generated

### Recommended Alerts

1. Error rate > 5%
2. P95 latency > 5 seconds
3. Daily cost > 90% of budget
4. API down (no requests for 5 minutes)

---

## Support & Resources

- **Grafana Docs:** https://grafana.com/docs/
- **Prometheus Query Language:** https://prometheus.io/docs/prometheus/latest/querying/basics/
- **Grafana Community:** https://community.grafana.com/

---

## Summary

You now have:
- ✅ Prometheus metrics endpoint (`/metrics/prometheus`)
- ✅ Grafana Cloud account (free tier)
- ✅ Metrics being scraped and stored
- ✅ Real-time dashboards
- ✅ Alerting configured
- ✅ **Total cost: $0/month**

**Enjoy professional-grade monitoring for free!**
Tech Stack Advisor - Code Viewer

GRAFANA_CLOUD_SETUP.md