← Back to File Tree
GRAFANA_CLOUD_SETUP.md
Language: markdown |
Path: GRAFANA_CLOUD_SETUP.md |
Lines: 509
# Grafana Cloud Monitoring Setup Guide
Complete guide to set up FREE monitoring for Tech Stack Advisor using Grafana Cloud and Prometheus metrics.
## What You Get (100% Free Forever)
- 10,000 metric series
- 50 GB logs/month
- 50 GB traces/month
- 14-day retention
- Beautiful dashboards
- Real-time alerts
- **Cost: $0/month**
---
## Step 1: Create Grafana Cloud Account (5 minutes)
### 1.1 Sign Up
1. Visit https://grafana.com/auth/sign-up/create-user
2. Fill in:
- Email
- Username
- Company (can be "Personal" or your name)
3. Click "Create Free Account"
4. Verify your email
### 1.2 Create Your Stack
1. After login, you'll be prompted to create a stack
2. Choose a stack name (e.g., `tech-stack-advisor`)
3. Select region closest to your Railway deployment
4. Click "Create Stack"
---
## Step 2: Configure Prometheus Data Source (3 minutes)
### 2.1 Get Your Prometheus Credentials
1. In Grafana Cloud dashboard, go to **Connections** → **Add new connection**
2. Search for "Prometheus"
3. Click on "Hosted Prometheus"
4. You'll see your credentials:
```
Remote Write URL: https://prometheus-prod-XX-prod-XX-XX.grafana.net/api/prom/push
Username: XXXXXX
Password: glc_XXXXXXXXXXXXXXXXXXXXXXXXXXXX
```
**Save these credentials securely!**
### 2.2 Test the Metrics Endpoint
First, verify your app is exposing metrics locally:
```bash
# Start your app if not running
cd /Users/admin/codeprojects/tech-stack-advisor
source .venv/bin/activate
python -m backend.src.api.main
# In another terminal, test the metrics endpoint
curl http://localhost:8000/metrics/prometheus
```
You should see output like:
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{endpoint="/",method="GET",status_code="200"} 5.0
# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
...
```
---
## Step 3: Configure Grafana to Scrape Your Metrics (2 methods)
### Method A: Using Grafana Agent (Recommended for Railway)
#### 3.1 Install Grafana Agent on Your Machine (for testing)
**On macOS:**
```bash
brew install grafana-agent
```
**On Linux:**
```bash
# Download and install
wget https://github.com/grafana/agent/releases/latest/download/grafana-agent-linux-amd64
chmod +x grafana-agent-linux-amd64
sudo mv grafana-agent-linux-amd64 /usr/local/bin/grafana-agent
```
#### 3.2 Create Agent Configuration
Create `grafana-agent-config.yaml`:
```yaml
server:
log_level: info
metrics:
global:
scrape_interval: 60s
remote_write:
- url: https://prometheus-prod-XX-prod-XX-XX.grafana.net/api/prom/push
basic_auth:
username: XXXXXX
password: glc_XXXXXXXXXXXXXXXXXXXXXXXXXXXX
configs:
- name: tech-stack-advisor
scrape_configs:
- job_name: 'fastapi-app'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics/prometheus'
scrape_interval: 30s
```
Replace `url`, `username`, and `password` with your actual Grafana Cloud credentials.
#### 3.3 Run Grafana Agent
```bash
grafana-agent --config.file=grafana-agent-config.yaml
```
Keep this running in a terminal. It will scrape metrics every 30 seconds and send to Grafana Cloud.
---
### Method B: Using Grafana Cloud Integration (for Railway Production)
For Railway deployment, you'll need to make your `/metrics/prometheus` endpoint publicly accessible.
#### 3.1 Deploy to Railway
```bash
# Push your code to GitHub (metrics already implemented!)
git add .
git commit -m "Add Prometheus metrics for Grafana Cloud"
git push
# Railway will auto-deploy
```
#### 3.2 Get Your Railway URL
After deployment, get your app URL from Railway dashboard:
```
https://your-app-name.up.railway.app
```
#### 3.3 Configure Grafana Cloud to Scrape
In Grafana Cloud:
1. Go to **Connections** → **Add new connection**
2. Search for "Prometheus"
3. Click "Configure"
4. Add scrape config:
```yaml
scrape_configs:
- job_name: 'tech-stack-advisor-railway'
scrape_interval: 60s
scrape_timeout: 30s
metrics_path: /metrics/prometheus
static_configs:
- targets:
- your-app-name.up.railway.app
```
5. Click "Save & Test"
---
## Step 4: Verify Data is Flowing (2 minutes)
### 4.1 Check Grafana Explore
1. In Grafana Cloud, go to **Explore** (compass icon)
2. Select your Prometheus data source
3. In the metrics browser, search for: `http_requests_total`
4. Click "Run Query"
You should see a graph with your metrics!
### 4.2 Test All Metrics
Try these queries in Explore:
**HTTP Requests by Endpoint:**
```promql
rate(http_requests_total[5m])
```
**Request Duration P95:**
```promql
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
**LLM Daily Cost:**
```promql
llm_daily_cost_usd
```
**Active Conversation Sessions:**
```promql
active_conversation_sessions
```
---
## Step 5: Create Dashboards (10 minutes)
### 5.1 Import Pre-built Dashboard Template
1. Go to **Dashboards** → **Import**
2. Use the template below or create your own
### 5.2 Custom Dashboard JSON (Copy & Import)
Save this as `tech-stack-advisor-dashboard.json`:
```json
{
"dashboard": {
"title": "Tech Stack Advisor Monitoring",
"panels": [
{
"title": "HTTP Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
],
"type": "graph"
},
{
"title": "Request Duration P95",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"type": "graph"
},
{
"title": "LLM Daily Cost",
"targets": [
{
"expr": "llm_daily_cost_usd"
}
],
"type": "stat"
},
{
"title": "Active Sessions",
"targets": [
{
"expr": "active_conversation_sessions"
}
],
"type": "stat"
}
]
}
}
```
---
## Step 6: Set Up Alerts (5 minutes)
### 6.1 Create Alert Rules
1. Go to **Alerting** → **Alert rules**
2. Click "New alert rule"
### 6.2 Example Alert: High Error Rate
**Alert Name:** High HTTP Error Rate
**Query:**
```promql
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
```
**Condition:** Alert when error rate > 5%
**Notification:** Email or Slack
### 6.3 Example Alert: Daily Budget Exceeded
**Alert Name:** Daily Budget Exceeded
**Query:**
```promql
llm_daily_cost_usd > 1.8
```
**Condition:** Alert when cost > 90% of $2 budget
---
## Available Metrics Reference
### HTTP Metrics
- `http_requests_total{method, endpoint, status_code}` - Total HTTP requests
- `http_request_duration_seconds{method, endpoint}` - Request duration histogram
### LLM Metrics
- `llm_tokens_total{agent, token_type}` - Token usage by agent
- `llm_cost_usd_total{agent}` - Cost by agent
- `llm_requests_total{agent, status}` - LLM request count
- `llm_daily_tokens` - Daily token usage
- `llm_daily_cost_usd` - Daily cost
- `llm_daily_queries` - Daily query count
### Application Metrics
- `active_conversation_sessions` - Active conversation count
- `user_registrations_total{oauth_provider}` - User registrations
- `user_logins_total{oauth_provider}` - User logins
- `recommendations_total{status, authenticated}` - Recommendations generated
---
## Useful Dashboard Queries
### API Performance
**Request Rate (requests/second):**
```promql
sum(rate(http_requests_total[5m]))
```
**P50, P95, P99 Latency:**
```promql
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
```
**Error Rate Percentage:**
```promql
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
```
### Cost Monitoring
**Daily Cost Trend:**
```promql
llm_daily_cost_usd
```
**Cost by Agent (if implemented):**
```promql
sum(llm_cost_usd_total) by (agent)
```
**Budget Utilization:**
```promql
(llm_daily_cost_usd / 2.0) * 100
```
### User Activity
**Active Users (sessions):**
```promql
active_conversation_sessions
```
**New Registrations (per hour):**
```promql
rate(user_registrations_total[1h]) * 3600
```
**Login Success Rate:**
```promql
sum(user_logins_total{oauth_provider="local"}) / sum(user_registrations_total{oauth_provider="local"})
```
---
## Troubleshooting
### Metrics Not Showing Up
**Check 1: Verify endpoint is accessible**
```bash
curl https://your-app-name.up.railway.app/metrics/prometheus
```
**Check 2: Verify Grafana Agent is running**
```bash
ps aux | grep grafana-agent
```
**Check 3: Check agent logs**
```bash
# If using systemd
journalctl -u grafana-agent -f
```
### No Data in Grafana
**Check 1: Verify data source connection**
- Go to Configuration → Data Sources
- Click on your Prometheus data source
- Click "Save & Test"
**Check 2: Check scrape targets**
- In Grafana Explore, run: `up{job="tech-stack-advisor-railway"}`
- Should show `1` if scraping successfully
### High Memory Usage
If Grafana Agent uses too much memory, reduce scrape frequency:
```yaml
scrape_interval: 120s # Changed from 60s
```
---
## Cost Optimization Tips
### Stay Within Free Tier
Grafana Cloud free tier limits:
- 10,000 series
- 50 GB logs/month
- 14-day retention
**Current metrics count:** ~15 series (well within limit!)
To check your usage:
1. Go to **Administration** → **Usage Insights**
2. Monitor "Active Series" count
### Reduce Scrape Frequency
If approaching limits:
- Change scrape_interval from 30s to 60s or 120s
- Still get good visibility with less frequent updates
---
## Next Steps
### Recommended Dashboards to Create
1. **API Performance Dashboard**
- Request rate
- Latency percentiles
- Error rate
- Active sessions
2. **Cost Monitoring Dashboard**
- Daily cost gauge
- Cost trend graph
- Budget utilization
- Token usage
3. **User Activity Dashboard**
- New registrations
- Login activity
- Active conversations
- Recommendations generated
### Recommended Alerts
1. Error rate > 5%
2. P95 latency > 5 seconds
3. Daily cost > 90% of budget
4. API down (no requests for 5 minutes)
---
## Support & Resources
- **Grafana Docs:** https://grafana.com/docs/
- **Prometheus Query Language:** https://prometheus.io/docs/prometheus/latest/querying/basics/
- **Grafana Community:** https://community.grafana.com/
---
## Summary
You now have:
- ✅ Prometheus metrics endpoint (`/metrics/prometheus`)
- ✅ Grafana Cloud account (free tier)
- ✅ Metrics being scraped and stored
- ✅ Real-time dashboards
- ✅ Alerting configured
- ✅ **Total cost: $0/month**
**Enjoy professional-grade monitoring for free!**