Whisper API Audio Transcription Made Simple with Docker

This case study demonstrates how to build a production-ready audio transcription service using the Whisper API and FastAPI, all containerized with Docker.
By the end of this guide, you’ll have a fully functional microservice capable of transcribing audio and video files in multiple formats through a simple HTTP API.

The service is split into two components:

A Whisper Service powered by faster-whisper, responsible for loading models and performing the transcription.
An API Gateway, which provides authentication, consistent endpoints, and request forwarding to the model service.

With this setup, you can deploy a self-hosted, scalable transcription system in just a few minutes: perfect for teams integrating ASR (Automatic Speech Recognition) into AI or media pipelines.

What You’ll Learn

How the Whisper API architecture is structured with two FastAPI services.
How to transcribe audio and video through simple REST endpoints.
How to containerize and orchestrate everything with Docker Compose.
How to tune parameters for speed, accuracy, and GPU usage.

{"section": "introduction", "version": "1.1.0"}

{"section": "introduction", "version": "1.1.0"}

Architecture at a Glance

The project follows a modular microservices architecture, separating concerns between the API gateway and the transcription engine. This design simplifies scaling, monitoring, and maintenance.

Components Overview

Whisper Service:
Runs the faster-whisper models locally. Each model (small, medium, large-v3, turbo) can be preloaded on startup.
It handles the heavy lifting, loading weights, managing devices (CPU/GPU), and performing the actual speech-to-text.
API Gateway:
Acts as a proxy for client requests. It validates bearer tokens, checks allowed models, and forwards files to the Whisper Service.
This ensures clients interact with a unified API surface, independent of the underlying model details.
Docker Compose:
Orchestrates both containers, manages networking, exposes ports (8080 for Whisper, 8081 for Gateway), and persists model cache volumes to avoid repeated downloads.

Request Flow

The user sends an audio file (e.g., .mp3, .wav, .mp4) to the Gateway via an endpoint like /transcribe/small.
The Gateway validates the API key, then forwards the file to the internal Whisper Service.
The Whisper Service transcribes the file using the requested model and returns the JSON response.
The Gateway sends the result back to the client.

Health and Monitoring

Both services expose /health endpoints for readiness checks:

http://localhost:8080/health -> Whisper Service status and loaded models.
http://localhost:8081/health -> API Gateway + underlying model health.

These endpoints make it easy to integrate with monitoring tools like Prometheus, Grafana, or Docker healthchecks.

ARCH_COMPONENTS=api-gateway,whisper-service
NETWORK=docker-compose

ARCH_COMPONENTS=api-gateway,whisper-service
NETWORK=docker-compose

Project Structure

The project is organized into two main folders (one for the API Gateway and another for the Whisper Service) both fully containerized. At the root, a docker-compose.yml file defines how these two services interact.

Here’s the full layout:

whisper_transcriptor/
├── api-gateway/
│ ├── app/
│ │ ├── main.py
│ │ └── settings.py
│ ├── Dockerfile
│ └── requirements.txt
├── whisper-service/
│ ├── app/
│ │ ├── main.py
│ │ └── settings.py
│ ├── Dockerfile
│ └── requirements.txt
└── docker-compose.yml

whisper_transcriptor/
├── api-gateway/
│ ├── app/
│ │ ├── main.py
│ │ └── settings.py
│ ├── Dockerfile
│ └── requirements.txt
├── whisper-service/
│ ├── app/
│ │ ├── main.py
│ │ └── settings.py
│ ├── Dockerfile
│ └── requirements.txt
└── docker-compose.yml

Each component runs as an independent FastAPI app, communicating over the internal Docker network.

Key Files

main.py: defines routes, including /transcribe* and /health.
settings.py: configuration via Pydantic and environment variables.
Dockerfile: container build instructions for each service.
requirements.txt: Python dependencies for each environment.
docker-compose.yml: orchestration of services, volumes, ports, and healthchecks.

This modular organization ensures separation of concerns:

You can update the transcription engine without changing the gateway.
You can extend the API easily with new routes or authentication layers.

{"section": "project-structure", "containers": ["api-gateway", "whisper-service"]}

{"section": "project-structure", "containers": ["api-gateway", "whisper-service"]}

Whisper Service Deep Dive

The Whisper Service loads models with faster-whisper and handles all transcription requests. It is lightweight, self-contained, and GPU-aware.

The faster-whisper project is a highly optimized CTranslate2-based reimplementation of Whisper, providing better speed and memory efficiency.

1) Service Initialization

On startup, the service loads all models defined in WHISPER_MODELS. Each model is cached in memory and ready on first request.

# app.main (excerpt)

@app.on_event("startup")
def load_models() -> None:
   device = _resolve_device()
   for name in settings.models:
      repo_or_size = settings.aliases.get(name, name)
      models[name] = WhisperModel(
         repo_or_size,
         device=device,
         compute_type=settings.compute_type
      )

# app.main (excerpt)

@app.on_event("startup")
def load_models() -> None:
   device = _resolve_device()
   for name in settings.models:
      repo_or_size = settings.aliases.get(name, name)
      models[name] = WhisperModel(
         repo_or_size,
         device=device,
         compute_type=settings.compute_type
      )

_resolve_device() automatically picks the best available device:

If CUDA is available: uses GPU (cuda).
Otherwise: falls back to CPU.

2) Environment Configuration

Tune service behavior with environment variables:

WHISPER_MODELS: comma-separated list of models to preload.
WHISPER_DEVICE: cpu, cuda, or auto (auto-detect).
WHISPER_COMPUTE_TYPE: precision (e.g., int8, float16).
WHISPER_BEAM_SIZE: beam search size (integer).
WHISPER_VAD_FILTER: enable/disable VAD (true/false).
WHISPER_LANGUAGE: optional language hint (e.g., en, pt).
WHISPER_ALIASES: JSON mapping alias (HF repo id).

Defaults in our compose:

{
   "WHISPER_MODELS": "small,medium,large-v3,large-v3-turbo",
   "WHISPER_DEVICE": "auto",
   "WHISPER_COMPUTE_TYPE": "int8",
   "WHISPER_BEAM_SIZE": "5",
   "WHISPER_VAD_FILTER": "true",
   "WHISPER_ALIASES": {
      "large-v3-turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo"
   }
}

{
   "WHISPER_MODELS": "small,medium,large-v3,large-v3-turbo",
   "WHISPER_DEVICE": "auto",
   "WHISPER_COMPUTE_TYPE": "int8",
   "WHISPER_BEAM_SIZE": "5",
   "WHISPER_VAD_FILTER": "true",
   "WHISPER_ALIASES": {
      "large-v3-turbo": "mobiuslabsgmbh/faster-whisper-large-v3-turbo"
   }
}

For detailed configuration options, refer to the faster-whisper documentation.

3) Transcription Workflow

Receive multipart/form-data with the audio/video file.
Validate the media type or extension (.mp3, .wav, .m4a, .mp4, .webm, .ogg).
Persist temporarily to disk.
Transcribe with selected model using beam search, VAD, and optional language hint.
Return JSON with the full text and detected language.

Implementation excerpt:

# app.main (excerpt)

segments, info = models[model_key].transcribe(
   tmp_path,
   beam_size=settings.beam_size,
   vad_filter=settings.vad_filter,
   language=settings.language
)
text = " ".join(seg.text.strip() for seg in segments).strip()
response = {"text": text, "language": info.language}

# app.main (excerpt)

segments, info = models[model_key].transcribe(
   tmp_path,
   beam_size=settings.beam_size,
   vad_filter=settings.vad_filter,
   language=settings.language
)
text = " ".join(seg.text.strip() for seg in segments).strip()
response = {"text": text, "language": info.language}

This design keeps the service stateless and horizontally scalable.

4) Endpoints

Predefined routes for convenience:

/transcribe-small
/transcribe-medium
/transcribe-large-v3
/transcribe-large-v3-turbo

Generic route for flexibility:

/transcribe/{model}

5) Health Endpoint

A simple GET /health confirms model availability and readiness.

# app.main (excerpt)

@app.get("/health")
def health():
   return {"status": "ok", "loaded_models": list(models.keys())

# app.main (excerpt)

@app.get("/health")
def health():
   return {"status": "ok", "loaded_models": list(models.keys())

6) Resource Management

Temporary files are deleted after processing. Model artifacts are cached on a shared Docker volume to avoid repeated downloads across restarts.

SERVICE_PORT=8080
CACHE_VOLUME=models-cache
LIBRARY=faster-whisper
PRELOADED=small,medium,large-v3,large-v3-turbo

SERVICE_PORT=8080
CACHE_VOLUME=models-cache
LIBRARY=faster-whisper
PRELOADED=small,medium,large-v3,large-v3-turbo

For detailed configuration options, refer to the faster-whisper documentation.

API Gateway Deep Dive

The API Gateway provides a unified, secure entry point for clients to interact with the transcription service.
It acts as a bridge between external requests and the internal Whisper models, ensuring consistent authentication and controlled access.

You can learn more about FastAPI (the modern, high-performance Python web framework used here) and its dependency injection, type hints, and async capabilities.

1) Purpose and Design

While the Whisper Service focuses on transcription, the API Gateway handles:

Authentication via Bearer tokens.
Model whitelisting to control which models can be accessed.
Request forwarding with file streaming to the internal service.
Error handling and consistent HTTP responses.

This layer isolates the model backend from direct exposure and makes it easier to evolve or scale each part independently.

2) Core Structure

The main components of the Gateway are implemented in main.py and settings.py.

api-gateway/
└── app/
    ├── main.py
    └── settings.py

api-gateway/
└── app/
    ├── main.py
    └── settings.py

main.py: Implements FastAPI routes (/transcribe/*, /health) and the forwarding logic using httpx.
settings.py: Loads environment variables for API_KEY, MODEL_URL, and ALLOWED_MODELS using Pydantic.

3) Authentication Flow

Every transcription request must include a Bearer token in the Authorization header.
The gateway compares it to the configured API_KEY.

If the header is missing or invalid:

401 (missing token).
403 (invalid token).

async def check_auth(authorization: str | None = Header(default=None)):
   if settings.api_key:
      if not authorization or not authorization.startswith("Bearer "):
         raise HTTPException(status_code=401, detail="Missing bearer token")
      token = authorization.removeprefix("Bearer ")
      if token != settings.api_key:
         raise HTTPException(status_code=403, detail="Invalid token")

async def check_auth(authorization: str | None = Header(default=None)):
   if settings.api_key:
      if not authorization or not authorization.startswith("Bearer "):
         raise HTTPException(status_code=401, detail="Missing bearer token")
      token = authorization.removeprefix("Bearer ")
      if token != settings.api_key:
         raise HTTPException(status_code=403, detail="Invalid token")

This ensures that only authenticated users can reach the transcription endpoints.

4) Forwarding Logic

The _forward function reads the uploaded file and sends it to the Whisper Service.
It automatically builds a multipart request and proxies the JSON response back to the client.

async def _forward(file: UploadFile, path: str):
   files = {
      "file": (
         file.filename, 
         await file.read(), 
         file.content_type or "application/octet-stream"
      )
   }
   async with httpx.AsyncClient(timeout=300) as client:
      r = await client.post(f"{settings.service_url}{path}", files=files)
   if r.status_code != 200:
      raise HTTPException(status_code=r.status_code, detail=r.text)
   return JSONResponse(r.json())<br>!!!

async def _forward(file: UploadFile, path: str):
   files = {
      "file": (
         file.filename, 
         await file.read(), 
         file.content_type or "application/octet-stream"
      )
   }
   async with httpx.AsyncClient(timeout=300) as client:
      r = await client.post(f"{settings.service_url}{path}", files=files)
   if r.status_code != 200:
      raise HTTPException(status_code=r.status_code, detail=r.text)
   return JSONResponse(r.json())<br>!!!

The httpx client used for forwarding requests is described in the httpx documentation.

5) Routes Overview

Predefined routes mirror the internal Whisper endpoints:

/transcribe/small
/transcribe/medium
/transcribe/large-v3
/transcribe/large-v3-turbo
/transcribe/{model} (generic route for allowed models)

A health check endpoint is also available:

/health -> validates gateway and underlying model service.

6) Settings and Environment Variables

{
   "MODEL_URL": "http://whisper-service:8080",
   "API_KEY": "change-me",
   "ALLOWED_MODELS": "small,medium,large-v3,large-v3-turbo"
}

{
   "MODEL_URL": "http://whisper-service:8080",
   "API_KEY": "change-me",
   "ALLOWED_MODELS": "small,medium,large-v3,large-v3-turbo"
}

These parameters can be overridden at runtime or injected via the docker-compose.yml file.

API_GATEWAY_PORT=8081
DEPENDENCY=httpx
FRAMEWORK=fastapi

API_GATEWAY_PORT=8081
DEPENDENCY=httpx
FRAMEWORK=fastapi

Docker & Compose

Both services (the API Gateway and the Whisper Service) are orchestrated through Docker Compose, enabling quick deployment and isolated execution.

To understand how Docker Compose orchestrates multi-container applications, check out the official Docker Compose documentation.

1) Compose Overview

The docker-compose.yml file defines:

Service builds from local folders.
Shared Docker network.
Persistent volume for model caching.
Healthchecks and inter-service dependencies.

services:
  whisper-service:
    build: ./whisper-service
    image: local/whisper-service:1.1.0
    environment:
      WHISPER_MODELS: "small,medium,large-v3,large-v3-turbo"
      WHISPER_DEVICE: "auto"
      WHISPER_COMPUTE_TYPE: "int8"
      WHISPER_BEAM_SIZE: "5"
      WHISPER_VAD_FILTER: "true"
    ports:
      - "8080:8080"
    volumes:
      - models-cache:/root/.cache
    healthcheck:
      test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 30
      start_period: 300s

  api-gateway:
    build: ./api-gateway
    image: local/transcription-gateway:1.1.0
    environment:
      MODEL_URL: "http://whisper-service:8080"
      API_KEY: "change-me"
      ALLOWED_MODELS: "small,medium,large-v3,large-v3-turbo"
    ports:
      - "8081:8081"
    depends_on:
      whisper-service:
      condition: service_healthy

volumes:
  models-cache:

services:
  whisper-service:
    build: ./whisper-service
    image: local/whisper-service:1.1.0
    environment:
      WHISPER_MODELS: "small,medium,large-v3,large-v3-turbo"
      WHISPER_DEVICE: "auto"
      WHISPER_COMPUTE_TYPE: "int8"
      WHISPER_BEAM_SIZE: "5"
      WHISPER_VAD_FILTER: "true"
    ports:
      - "8080:8080"
    volumes:
      - models-cache:/root/.cache
    healthcheck:
      test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 30
      start_period: 300s

  api-gateway:
    build: ./api-gateway
    image: local/transcription-gateway:1.1.0
    environment:
      MODEL_URL: "http://whisper-service:8080"
      API_KEY: "change-me"
      ALLOWED_MODELS: "small,medium,large-v3,large-v3-turbo"
    ports:
      - "8081:8081"
    depends_on:
      whisper-service:
      condition: service_healthy

volumes:
  models-cache:

2) Running the Services

Build and start everything with a single command:

docker compose up --build

docker compose up --build

Once both containers are healthy:

Gateway is available at http://localhost:8081
Whisper Service is available at http://localhost:8080

3) Health Verification

You can test both services with simple curl commands:

curl http://localhost:8081/health
curl http://localhost:8080/health

curl http://localhost:8081/health
curl http://localhost:8080/health

Expected response:

{"gateway": "ok", "model": {"status": "ok", "loaded_models": ["small","medium","large-v3","large-v3-turbo"]}}

{"gateway": "ok", "model": {"status": "ok", "loaded_models": ["small","medium","large-v3","large-v3-turbo"]}}

Docker’s healthcheck mechanism is explained in the Dockerfile reference.

4) Persistent Model Cache

The models-cache volume ensures that once models are downloaded, subsequent startups are instantaneous: ideal for production deployments and CI/CD pipelines.

5) Common Deployment Notes

To rebuild from scratch: docker compose build --no-cache
To detach from console: docker compose up -d
To check logs: docker compose logs -f

EXPOSED_PORTS=8080,8081
VOLUME=models-cache

EXPOSED_PORTS=8080,8081
VOLUME=models-cache

Trying It Out: cURL and Python Client

With both containers running, you can easily test the transcription endpoints using cURL or a simple Python script.

1) Testing with cURL

You can transcribe any supported audio or video file directly from your terminal:

curl -X POST "http://localhost:8081/transcribe/small" \
-H "Authorization: Bearer change-me" \
-F "file=@sample.mp3"

curl -X POST "http://localhost:8081/transcribe/small" \
-H "Authorization: Bearer change-me" \
-F "file=@sample.mp3"

The Gateway forwards the file to the Whisper Service and returns the transcription result in JSON format.

Expected output:

{
   "text": "Hello and welcome to our demo on building a transcription API using Whisper and FastAPI.",
   "language": "en"
}

{
   "text": "Hello and welcome to our demo on building a transcription API using Whisper and FastAPI.",
   "language": "en"
}

You can also specify other models via their endpoints, such as:

curl -X POST "http://localhost:8081/transcribe/large-v3-turbo" \
-H "Authorization: Bearer change-me" \
-F "file=@meeting_audio.wav"

curl -X POST "http://localhost:8081/transcribe/large-v3-turbo" \
-H "Authorization: Bearer change-me" \
-F "file=@meeting_audio.wav"

2) Testing with Python

If you prefer to automate transcription from your own code, you can use the httpx or requests library.

import requests

API_URL = "http://localhost:8081/transcribe/small"
TOKEN = "change-me"

with open("meeting.mp3", "rb") as f:
   files = {"file": ("meeting.mp3", f, "audio/mpeg")}
   headers = {"Authorization": f"Bearer {TOKEN}"}
   response = requests.post(API_URL, headers=headers, files=files)
   
print(response.status_code)
print(response.json())

import requests

API_URL = "http://localhost:8081/transcribe/small"
TOKEN = "change-me"

with open("meeting.mp3", "rb") as f:
   files = {"file": ("meeting.mp3", f, "audio/mpeg")}
   headers = {"Authorization": f"Bearer {TOKEN}"}
   response = requests.post(API_URL, headers=headers, files=files)
   
print(response.status_code)
print(response.json())

This returns a response similar to:

{
   "text": "Good morning everyone, let's start our meeting.",
   "language": "en"
}

{
   "text": "Good morning everyone, let's start our meeting.",
   "language": "en"
}

3) Handling Errors

If you encounter any error responses, check for:

Missing or invalid Bearer token (401, 403)
Unsupported file type (400)
Model not allowed (404)
Gateway or network timeout (5xx)

All responses include clear JSON-formatted error messages for easier debugging.

{
   "detail": "Invalid token"
}

{
   "detail": "Invalid token"
}

Handling Videos and Common Formats

The system supports a wide range of audio and video formats out of the box, thanks to FFmpeg, which is installed in the Whisper Service container.

1) Supported Formats

The service accepts:

Audio: .mp3, .wav, .m4a, .ogg, .webm
Video: .mp4, .mov, .mkv, .webm

The transcription pipeline automatically extracts the audio track before running the model, so there’s no manual pre-processing required.

2) Converting Unsupported Formats

If you have a media file in an uncommon format, convert it easily using FFmpeg:

ffmpeg -i input.mov -ar 16000 -ac 1 -c:a pcm_s16le output.wav

ffmpeg -i input.mov -ar 16000 -ac 1 -c:a pcm_s16le output.wav

This creates a high-quality mono WAV file optimized for ASR processing.
Learn more about FFmpeg and supported codecs in the official FFmpeg documentation.

3) Working with Large Files

For files exceeding several minutes:

Prefer the large-v3 or large-v3-turbo models for higher accuracy.
Ensure sufficient memory or GPU resources.
You can limit upload size in FastAPI using the --limit-max-request-size flag on Uvicorn.

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8081", "--limit-max-request-size", "200"]

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8081", "--limit-max-request-size", "200"]

4) Language Detection and Multilingual Transcription

The Whisper models automatically detect the language of the input.
However, you can force a specific language through an environment variable:

WHISPER_LANGUAGE=pt

WHISPER_LANGUAGE=pt

This is useful if you’re processing content that mixes multiple languages or when you want consistent output in a single target language.

5) Example: Transcribing a Video File

curl -X POST "http://localhost:8081/transcribe/large-v3-turbo" 
-H "Authorization: Bearer change-me" \
-F "file=@webinar.mp4"

curl -X POST "http://localhost:8081/transcribe/large-v3-turbo" 
-H "Authorization: Bearer change-me" \
-F "file=@webinar.mp4"

Expected response:

{   
   "text": "Today we're exploring how to build transcription systems using Whisper and Docker.",
   "language": "en"
}

{   
   "text": "Today we're exploring how to build transcription systems using Whisper and Docker.",
   "language": "en"
}

6) Tip: Speed vs Accuracy

Model	Speed	Accuracy	Ideal Use Case
small	⚡ Fast	🟡 Moderate	Real-time or lightweight environments
medium	⚖️ Balanced	🟢 High	General purpose
large-v3	🐢 Slower	🟢 Very High	Long or noisy recordings
large-v3-turbo	⚡⚡ Fastest	🟢 High	Near real-time apps

MODEL_CHOICE=large-v3-turbo
USE_CASE=near-real-time-transcription

MODEL_CHOICE=large-v3-turbo
USE_CASE=near-real-time-transcription

These options give you control over the trade-off between processing time and transcription quality.

Tuning for Accuracy & Speed

The Whisper API offers several ways to fine-tune transcription speed, accuracy, and resource usage.
These adjustments let you optimize performance for your specific workloads: from lightweight edge deployments to GPU-backed servers.

1) Choosing the Right Model

Whisper models vary in size, accuracy, and speed. The Gateway exposes them through intuitive endpoints:

Endpoint	Model	Speed	Accuracy	Ideal Use
`/transcribe/small`	small	⚡ Fast	🟡 Moderate	Real-time or CPU usage
`/transcribe/medium`	medium	⚖️ Balanced	🟢 High	Most common use case
`/transcribe/large-v3`	large-v3	🐢 Slower	🟢 Very High	Long-form or noisy audio
`/transcribe/large-v3-turbo`	large-v3-turbo	⚡⚡ Fastest	🟢 High	Real-time GPU workloads

# Example using a specific model

curl -X POST "http://localhost:8081/transcribe/large-v3-turbo" \
-H "Authorization: Bearer change-me" \
-F "file=@podcast.mp3"

# Example using a specific model

curl -X POST "http://localhost:8081/transcribe/large-v3-turbo" \
-H "Authorization: Bearer change-me" \
-F "file=@podcast.mp3"

2) Compute Type and Device

Performance heavily depends on the WHISPER_COMPUTE_TYPE and WHISPER_DEVICE environment variables.

WHISPER_DEVICE=auto -> uses GPU if available, otherwise CPU.
WHISPER_COMPUTE_TYPE=int8 -> faster and smaller memory footprint, minor accuracy loss.
WHISPER_COMPUTE_TYPE=float16 -> higher accuracy, slower on CPU.

# Example configuration for GPU optimization (if available)
WHISPER_DEVICE=cuda
WHISPER_COMPUTE_TYPE=float16

# Example configuration for GPU optimization (if available)
WHISPER_DEVICE=cuda
WHISPER_COMPUTE_TYPE=float16

3) Beam Search and VAD

You can control transcription quality by adjusting these parameters:

WHISPER_BEAM_SIZE: Larger values (5–10) improve accuracy but slow down processing.
WHISPER_VAD_FILTER: Removes silence and non-speech intervals for cleaner results.

WHISPER_BEAM_SIZE=8
WHISPER_VAD_FILTER=true

WHISPER_BEAM_SIZE=8
WHISPER_VAD_FILTER=true

4) Language Parameter

Although Whisper can auto-detect language, setting it explicitly may improve consistency and reduce inference time.

WHISPER_LANGUAGE=en

WHISPER_LANGUAGE=en

For example, setting WHISPER_LANGUAGE=pt forces Portuguese recognition, avoiding unnecessary language detection overhead.

5) Measuring Performance

The Whisper Service logs runtime performance and model loading time.
You can measure throughput using time in your terminal or a simple benchmarking script.

time curl -X POST "http://localhost:8081/transcribe/small" \
-H "Authorization: Bearer change-me" \
-F "file=@audio.wav"

time curl -X POST "http://localhost:8081/transcribe/small" \
-H "Authorization: Bearer change-me" \
-F "file=@audio.wav"

Security & Operational Tips

When deploying your transcription system in production, security and reliability are essential.
The architecture already provides solid isolation, but here are a few best practices to reinforce it.

1) API Security

Use the Gateway as the only public entry point.
Protect it with authentication, rate limiting, and HTTPS.

API_KEY=my-secret-token

API_KEY=my-secret-token

Best practices:

Rotate the API_KEY periodically.
Store secrets using environment variables or a vault, never commit them to Git.
Use reverse proxies (e.g., Nginx or Traefik) for SSL termination and extra rate-limiting.

For secure key management, see the 12-Factor App configuration guidelines.
You can also use Docker secrets for secret storage.

2) Resource Limits

Define resource limits for each service in Docker Compose to prevent overload.

services:
  whisper-service:
    cpus: "2.0"
    memory: "8G"

services:
  whisper-service:
    cpus: "2.0"
    memory: "8G"

This keeps Whisper Service stable under high load.

3) Timeout Management

Set appropriate request timeouts to avoid long-running connections, especially for large audio files.

# In the API Gateway
httpx.AsyncClient(timeout=360)

# In the API Gateway
httpx.AsyncClient(timeout=360)

Adjust the value according to your use case (e.g., 360 seconds = 6 minutes).

4) Monitoring and Healthchecks

Both services provide /health endpoints. You can integrate them with monitoring tools such as Prometheus or Grafana.

curl http://localhost:8081/health
curl http://localhost:8080/health

curl http://localhost:8081/health
curl http://localhost:8080/health

For advanced setups, expose metrics endpoints using prometheus-fastapi-instrumentator.

5) Logging and Observability

Enable structured logging and store logs in centralized systems like ELK.
You can extend the FastAPI apps to include request IDs and correlation tokens for distributed tracing.

LOG_LEVEL=info
ENABLE_TRACING=true

LOG_LEVEL=info
ENABLE_TRACING=true

6) Scaling and Load Balancing

Because the services are stateless, you can easily scale horizontally.
Add replicas behind a load balancer or an API gateway manager.

7) Backup and Recovery

Preserve the models-cache volume to avoid redownloading models after restarts.
You can also snapshot the cache for faster redeployment in cloud environments.

Troubleshooting

Even with a stable setup, you may encounter occasional issues related to configuration, file types, or system resources.
This section summarizes the most common problems and their solutions.

1) Slow First Request

Symptom: The first transcription request takes a long time (sometimes minutes).
Cause: Models are being downloaded and loaded into memory for the first time.
Solution: Allow the service to initialize fully on first boot. Once cached in models-cache, subsequent runs are instant.

2) CUDA or GPU Not Detected

Symptom: The service runs on CPU even when a GPU is available.
Causes:

Missing NVIDIA Container Toolkit.
Incorrect device configuration in Docker.
WHISPER_DEVICE not set to cuda.

Fix:
Install the NVIDIA toolkit and run with GPU access enabled.

sudo apt install nvidia-container-toolkit
docker run --gpus all local/whisper-service:1.1.0

sudo apt install nvidia-container-toolkit
docker run --gpus all local/whisper-service:1.1.0

3) Out of Memory (OOM)

Symptom: Container crashes during transcription of large files.
Cause: Model too large or insufficient memory/GPU resources.
Solution:

Switch to a smaller model (e.g., medium or small).
Increase container memory limits.
Use the int8 compute type for reduced footprint.

WHISPER_COMPUTE_TYPE=int8

WHISPER_COMPUTE_TYPE=int8

4) Invalid Token or 403 Errors

Symptom: Requests return 401 or 403 errors.
Cause: Missing or invalid Bearer token.
Solution: Include the correct header in each request.

-H "Authorization: Bearer change-me"

-H "Authorization: Bearer change-me"

Check the API Gateway logs for Invalid token messages.

5) 415 Unsupported Media Type

Symptom: Upload fails for certain video files.
Cause: Missing or unsupported codec, wrong content type, or extension not recognized.
Solution: Convert the file with FFmpeg to a supported format (e.g., .wav, .mp3, .mp4).

ffmpeg -i input.mov -ar 16000 -ac 1 -c:a pcm_s16le output.wav

ffmpeg -i input.mov -ar 16000 -ac 1 -c:a pcm_s16le output.wav

6) Empty Transcriptions

Symptom: The transcription returns an empty string.
Cause: File contained silence, corrupted audio, or incorrect format.
Solution:

Check input volume and duration.
Validate audio with ffprobe or play it locally.
Retry using a different model.

7) Gateway Timeout or 5xx Errors

Symptom: Request times out or returns a 500 error.
Cause: Whisper Service took longer than the configured timeout.
Solution:

Increase timeout in the Gateway (httpx.AsyncClient(timeout=600)).
Use smaller models for quicker responses.

8) Healthcheck Fails in Compose

Symptom: The API Gateway remains stuck waiting for the Whisper Service to be “healthy”.
Cause: Whisper Service startup taking longer than the healthcheck start_period.
Solution: Increase the start_period or healthcheck retries in the Compose file.

healthcheck:
  test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
  interval: 30s
  timeout: 10s
  retries: 30
  start_period: 600s

healthcheck:
  test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
  interval: 30s
  timeout: 10s
  retries: 30
  start_period: 600s

What’s Next

You now have a complete, containerized Whisper API transcription system.
The foundation is solid, but there are many exciting improvements you can build on top of it.

1) Asynchronous Transcription

For long audio files, consider an async pattern:

Queue the job.
Return a job_id.
Process the transcription in background workers.
Retrieve results via /results/{job_id}.

This can be implemented using Celery, Redis, or RabbitMQ.

2) Batch Processing

Allow users to upload multiple files in one request.
The Gateway can iterate over all uploaded files and process them concurrently.

3) Timestamps and Subtitle Generation

Extend the Whisper Service to output detailed segments with timestamps.
You can then generate .srt or .vtt subtitle files automatically.

4) Speaker Diarization

Integrate diarization models (like pyannote.audio) to separate different speakers.
This is ideal for meetings, interviews, and podcasts.

5) Multilingual Translation

Instead of transcribing in the original language, automatically translate output to English or another target language using a transformer model such as NLLB or M2M100.

6) Real-Time Streaming API

Transform the architecture into a streaming transcription service using WebSockets or Server-Sent Events (SSE).
This allows partial transcription results while audio is still uploading.

7) Authentication and Multi-Tenant API Keys

Add multi-user API key management, rate limits, and billing metrics to make it SaaS-ready.
You can integrate with Payment Gateway systems.

8) Deployment to Cloud

The current setup is ideal for local or private deployments.
For production:

Deploy to AWS ECS, GCP Cloud Run, or Azure Container Apps.
Use object storage (S3, GCS) for input/output files.
Add HTTPS with managed certificates (e.g., Let’s Encrypt via Traefik).

9) Observability and Metrics

Integrate Prometheus exporters and Grafana dashboards to monitor throughput, latency, and GPU utilization.

Conclusion

Congratulations! You’ve just built a scalable, containerized, and extensible Whisper API transcription system that’s both practical and production-ready.

Building a transcription system with Whisper, FastAPI, and Docker isn’t just a proof of concept: it’s a blueprint for scalable, real-world AI infrastructure.

Beyond technical achievement, this approach demonstrates how open-source tools can empower teams to deploy production-grade speech recognition pipelines without relying on external APIs or costly proprietary models.
With a few configuration tweaks, the same foundation can serve countless use cases (from podcast indexing and meeting transcription to multilingual content processing).

As you extend this project, consider adding real-time streaming, translation layers, and cloud deployment automation.
These next steps will transform your transcription API into a powerful, intelligent, and fully self-hosted audio intelligence platform ready for modern AI-driven applications.

You can download a version of the code showcased in this article from my GitHub repository.

If you want to review the docker setup process, take a look at: My DevOps Adventure: Building with Elasticsearch, Kibana, CouchDB, and Logstash. Thank you for reading this article!

ETL: ~4Hours to complete 2025-11-11T17:54:15.752-03:00
2025-11-12T04:53:31.053-03:00

Enricher: ~16 Hours