ML Model Serving Protocol Comparison

A comprehensive benchmark comparing REST, gRPC, and GraphQL for serving machine learning models

Project Motivation

When deploying machine learning models in production, choosing the right API protocol can significantly impact performance, scalability, and developer experience. This project aims to provide data-driven insights to help teams make informed decisions.

Goals

Current Status: This benchmark compares three popular API protocols (REST, gRPC, and GraphQL) for serving machine learning models, specifically text embeddings using the SentenceTransformer model.

Note: All implementations currently use HTTP/2. HTTP/1.1 comparison is planned for future work.

System Architecture

Executive Summary

Key Finding: gRPC demonstrates significantly lower latency (~30x faster) and higher throughput (~3x) compared to REST and GraphQL for this ML serving use case.

All implementations use HTTP/2 and the same underlying ML model (SentenceTransformer all-MiniLM-L6-v2) to ensure fair comparison.

Performance at a Glance

REST API

P50 Latency --
P95 Latency --
Throughput --
Success Rate --

GraphQL API

P50 Latency --
P95 Latency --
Throughput --
Success Rate --

gRPC API

P50 Latency --
P95 Latency --
Throughput --
Success Rate --

Detailed Benchmarking Results

Response Time Comparison

Response Time Comparison

Latency Percentiles

Latency Percentiles

Throughput Analysis

Throughput Comparison

Response Time Distribution

Response Time Distribution

Payload Size Comparison

Payload Size Comparison

Response Time Over Test Duration

Response Time Over Time

Success Rate

Success Rate

Model Inference Time

Inference Time Comparison

Test Methodology

System Under Test

  • Model: SentenceTransformer (all-MiniLM-L6-v2) - 384-dimensional embeddings
  • REST API: FastAPI + Hypercorn (HTTP/2)
  • GraphQL API: Strawberry GraphQL + Hypercorn (HTTP/2)
  • gRPC API: gRPC server (HTTP/2 native with Protocol Buffers)

Test Configuration

  • Load Testing Tool: Locust
  • Concurrent Users: 10
  • Spawn Rate: 2 users/second
  • Test Duration: 60 seconds per API
  • Test Scenarios: Single embeddings (75%), Batch embeddings (25%)

Workload

  • Text lengths: Short (10-50 words), Medium (50-150 words), Long (150-300 words)
  • Batch sizes: 5 texts per batch request
  • Think time: 1-3 seconds between requests

Environment

  • Deployment: Docker containers on the same host
  • Monitoring: Prometheus + Grafana
  • Network: Local host (minimal network latency)

Analysis and Recommendations

When to Use REST

When to Use GraphQL

When to Use gRPC

Reproduce These Results

Prerequisites

  • Docker & Docker Compose
  • Python 3.9+
  • 8GB RAM minimum

Steps

  1. Clone the repository:
    git clone https://github.com/ranjanarajendran/ml-serving-comparison.git
  2. Start all services:
    docker-compose up -d
  3. Run the tests:
    ./run_tests.sh
  4. View results in results/charts/