My DevOps Adventure: Adding Vosk for Audio Transcription (Part 5)

As my journey with the Elastic Stack progressed, I became increasingly curious about how artificial intelligence could enhance my projects beyond data storage and search. One area that has always fascinated me is speech-to-text transcription — the ability to transform spoken language into accurate, searchable text.

Instead of relying on closed, third-party services, I wanted to explore open-source AI tools that give me full control of the process. That’s when I came across Vosk, a lightweight yet powerful speech recognition toolkit. With ready-to-use models in multiple languages, including Portuguese, Vosk made it possible to bring transcription into my DevOps adventure in a self-contained and reproducible way.

In this part, we’ll start by preparing a Docker container dedicated to downloading and hosting the Portuguese Vosk model. Later, this model will serve as the foundation for a Python-based service capable of receiving audio or video files and producing transcriptions — opening the door to new AI-powered
workflows in my environment.

Preparing the Vosk Model Container

To keep things modular, I created a container whose sole purpose is downloading and storing the Vosk Portuguese Model. This makes it reusable by other services that may need access to the model files.

Here’s the docker-compose.yml snippet for the vosk_pt service:

  vosk_pt:
    image: alpine:3.20
    container_name: vosk_pt
    command: >
      sh -c "
        set -eux;
        apk add --no-cache wget unzip ca-certificates;
        mkdir -p /models/vosk-pt;
        if [ ! -f /models/.pt_downloaded ]; then
          echo 'Downloading pt-BR Vosk Model (small-pt-0.3)...';
          wget -O /tmp/vosk-pt.zip https://alphacephei.com/vosk/models/vosk-model-small-pt-0.3.zip;
          unzip -o /tmp/vosk-pt.zip -d /tmp;
          mv -f /tmp/vosk-model-small-pt-0.3/* /models/vosk-pt/ || true;
          touch /models/.pt_downloaded;
          echo 'Vosk Model ready: /models/vosk-pt';
        else
          echo 'Vosk Modelis already deployed. Skipping download.';
        fi
      "
    volumes:
      - vosk-data:/models

How This Works

Base image: We use alpine:3.20 for its minimal size and simplicity.
Dependencies: Installs wget, unzip, and ca-certificates to fetch the
model over HTTPS and unpack it.
Model directory: Ensures /models/vosk-pt exists for storing the model files.
Idempotency: Uses a marker file (.pt_downloaded) to avoid re-downloading
the model every time the container restarts.
Persistent storage: Mounts a Docker volume (vosk-data) so the model
persists even if the container is removed.

Result

Once this container runs, the Portuguese Vosk model will be available under:

/models/vosk-pt

This prepares the ground for the Python transcription container, which will use this shared model directory to process audio and video files in real time.

That will be the focus of the next step.

Building the Python Transcriber Service

With the Portuguese Vosk model downloaded and persisted in the previous step, the next challenge was to create a microservice that could actually use this model.
I decided to build a Python application with FastAPI, packaged inside Docker, capable of receiving audio or video files (or even URLs), normalizing them with FFmpeg, and returning a clean transcription as JSON.

Defining the Transcriber Service in Docker Compose

The new service is called transcriber, and it depends on the vosk_pt container, ensuring the model is available before the API starts.

  transcriber:
    build:
      context: ./transcriber
      dockerfile: Dockerfile
    container_name: transcriber
    depends_on:
      vosk_pt:
        condition: service_completed_successfully
    environment:
      VOSK_MODEL_DIR: /models/vosk-pt
      SAMPLE_RATE: 16000
      CHANNELS: 1
      MAX_DURATION_MIN: 30
    volumes:
      - vosk-data:/models
    ports:
      - "8000:8000"

Key Points

Builds from a custom Dockerfile (./transcriber/Dockerfile).
Waits for the vosk_pt container to finish downloading the model.
Exposes port 8000 for API access.
Uses environment variables for model location, sample rate, and limits.
Mounts the same vosk-data volume, ensuring both services share the model files.

The Transcriber Dockerfile

This container includes Python 3.12, FFmpeg, and the required Python libraries.

FROM python:3.12-slim

# Install ffmpeg for audio/video transcoding
RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \
    ffmpeg curl ca-certificates &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application source
COPY app ./app

# Model mount point
VOLUME ["/models"]
ENV VOSK_MODEL_DIR=/models/vosk-pt
ENV PYTHONUNBUFFERED=1

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

This ensures:

FFmpeg is available for preprocessing.
Dependencies are installed via requirements.txt.
The application lives under /app.
The model is mounted at /models/vosk-pt.

Python Dependencies

fastapi==0.115.0
uvicorn==0.30.6
vosk==0.3.45
pydub==0.25.1
httpx==0.27.2
python-multipart==0.0.9

These bring together:

FastAPI → web API framework.
Uvicorn → ASGI server.
Vosk → speech recognition engine.
Pydub + FFmpeg → audio handling and normalization.
HTTPX → fetching remote files.
Multipart → supporting file uploads.

The Transcriber API (main.py)

The application exposes a single endpoint /transcribe, which accepts either an
uploaded file or a remote URL.

from fastapi import FastAPI, UploadFile, File, HTTPException, Form
from fastapi.responses import JSONResponse
from vosk import Model, KaldiRecognizer
import httpx, json, logging, os, re, subprocess, tempfile, uuid, wave, time, threading
from typing import Optional

# Logging
LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO").upper()
logging.basicConfig(level=getattr(logging, LOG_LEVEL, logging.INFO))
logger = logging.getLogger("vosk-transcriber")

# Config
MODEL_DIR = os.environ.get("VOSK_MODEL_DIR", "/models/vosk-pt")
SAMPLE_RATE = int(os.environ.get("SAMPLE_RATE", "16000"))
CHANNELS = int(os.environ.get("CHANNELS", "1"))
MAX_DURATION_MIN = int(os.environ.get("MAX_DURATION_MIN", "30"))

# Lazy-load model
_model_lock = threading.Lock()
_MODEL: Optional[Model] = None
def get_model() -> Model:
    global _MODEL
    if _MODEL is None:
        with _model_lock:
            if _MODEL is None:
                if not os.path.isdir(MODEL_DIR):
                    raise RuntimeError(f"Vosk model not found at {MODEL_DIR}")
                _MODEL = Model(MODEL_DIR)
    return _MODEL

# FastAPI
app = FastAPI(title="Vosk Transcriber")

@app.on_event("startup")
async def on_startup():
    logger.info(f"Loading model from {MODEL_DIR}")
    get_model()

# Helpers
def to_wav_16k_mono(src_path: str, dst_path: str):
    filters = ["afftdn", "silenceremove=1:0:-50dB", "loudnorm=I=-16:TP=-1.5:LRA=11"]
    cmd = ["ffmpeg", "-hide_banner", "-nostdin", "-y",
           "-i", src_path, "-vn", "-ac", str(CHANNELS), "-ar", str(SAMPLE_RATE),
           "-acodec", "pcm_s16le", "-af", ",".join(filters), "-f", "wav", dst_path]
    subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

def duration_seconds_wav(path: str) -> float:
    with wave.open(path, "rb") as wf:
        return wf.getnframes() / float(wf.getframerate())

def transcribe_wav(path: str, with_words: bool = False):
    model = get_model()
    with wave.open(path, "rb") as wf:
        rec = KaldiRecognizer(model, wf.getframerate())
        rec.SetWords(with_words)
        while True:
            data = wf.readframes(8000)
            if not data:
                break
            rec.AcceptWaveform(data)
        final = json.loads(rec.FinalResult())
        result = {"text": final.get("text", "")}
        if with_words and "result" in final:
            result["words"] = final["result"]
    return result

@app.post("/transcribe")
async def transcribe(
    file: UploadFile | None = File(default=None),
    sourceUrl: str | None = Form(default=None),
    return_words: bool = Form(default=False),
):
    if not file and not sourceUrl:
        raise HTTPException(status_code=400, detail="Send 'file' (multipart) or 'sourceUrl' (Form).")

    with tempfile.TemporaryDirectory() as tmp:
        src_bin = os.path.join(tmp, f"input-{uuid.uuid4().hex}")
        wav16 = os.path.join(tmp, "audio.wav")

        # Input: upload or URL
        if file:
            with open(src_bin, "wb") as f:
                f.write(await file.read())
        else:
            async with httpx.AsyncClient(follow_redirects=True, timeout=60) as client:
                r = await client.get(sourceUrl)
                r.raise_for_status()
                with open(src_bin, "wb") as f:
                    f.write(r.content)

        # Convert, check duration, transcribe
        to_wav_16k_mono(src_bin, wav16)
        sec = duration_seconds_wav(wav16)
        if sec > MAX_DURATION_MIN * 60:
            raise HTTPException(status_code=413, detail=f"Audio exceeds {MAX_DURATION_MIN} minutes")

        result = transcribe_wav(wav16, with_words=return_words)
        return JSONResponse(content=result, media_type="application/json; charset=utf-8")

Result

At this stage, we now have a fully working transcription API:

Upload an audio/video file or send a URL.
The service normalizes it into 16kHz mono WAV.
Vosk processes the file and returns text (and optionally word-level timings).

This service is lightweight, reproducible, and integrates seamlessly into a modern DevOps pipeline — bringing AI-powered transcription into the same ecosystem as Elasticsearch, CouchDB, and Logstash.

Running the Transcriber Service

With both the vosk_pt (model) and transcriber (API) containers defined in our docker-compose.yml, we can start the transcription service by running:

docker-compose up -d vosk_pt transcriber

This will:

Start the vosk_pt container, which downloads and prepares the Portuguese Vosk model.
Launch the transcriber container, mounting the shared model volume and exposing the API at http://localhost:8000.
Ensure that the transcription service is ready to process audio and video files.

You can confirm that everything is running correctly with:

docker-compose ps

Once both containers show as Up, the service is live and ready for testing.

Testing the Transcriber API with curl

With the transcriber service up and running on port 8000, it’s time to validate that everything works as expected.
Instead of writing a custom client, we can simply use curl to interact with the API.

Checking Service Health

The service is powered by FastAPI, so if you open a browser at:

http://localhost:8000/docs

you’ll see the Swagger UI where you can try out the /transcribe endpoint interactively.
But let’s stick to command line tests with curl.

Uploading a Local Audio File

To send a file to the service for transcription:

curl -X POST "http://localhost:8000/transcribe" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample.mp3" \
  -F "return_words=true"

Explanation:

file=@sample.mp3 → Replace sample.mp3 with your audio or video file.
return_words=true → Optional flag to include word-level timings.
Content-Type: multipart/form-data → Required for file uploads.

The output will look like:

{
  "text": "Olá, este é um teste de transcrição em português.",
  "words": [
    {"word": "olá", "start": 0.1, "end": 0.5, "conf": 0.95},
    {"word": "este", "start": 0.6, "end": 0.9, "conf": 0.92},
    ...
  ]
}

Transcribing from a Remote URL

You can also pass a remote file (audio or video) via the sourceUrl form parameter:

curl -X POST "http://localhost:8000/transcribe" \
  -H "accept: application/json" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "sourceUrl=https://www.example.com/audio/teste.mp3" \
  -d "return_words=false"

This will:

Download the remote file inside the container.
Normalize it to 16kHz mono WAV with FFmpeg.
Run transcription and return just the text output.

Result

With these two curl tests, we now have:

File upload support for local testing.
Remote URL ingestion for flexible input sources.
Word-level timings available when requested.

This confirms the transcriber API is operational and ready to be integrated into broader pipelines — from document indexing in Elasticsearch to more advanced AI workflows.

Conclusion

By the end of this stage in my DevOps journey, I managed to extend my stack beyond search and analytics into the world of AI-powered transcription.
With Vosk running inside Docker, I now have a reproducible, open-source environment capable of turning spoken Portuguese into accurate text — without depending on external providers.

What I find most valuable here is how this approach aligns with the same principles that guided my Elastic Stack setup:

Modularity: Each container (CouchDB, Logstash, Elasticsearch, Kibana, and now Vosk) has a clear, isolated purpose.
Security & Control: From TLS certificates in CouchDB to the self-managed Vosk model, every part of the pipeline is transparent and under my control.
Extensibility: Just like I used Logstash filters to normalize company data, I can now imagine applying NLP pipelines or text-mining tools on top of the transcribed content.

This project shows that it’s possible to blend data engineering with machine learning in a way that is both educational and practical. From here, the possibilities are endless:

Enhancing transcription with speaker diarization or punctuation models
Sending transcripts into Elasticsearch for search and analytics
Applying LLMs for summarization or entity extraction

And most importantly: all of it can run locally, securely, and reproducibly.

You can download a version of the code showcased in this article from my GitHub repository.

If you want to review the setup process, take a look at the Part 1. Thank you for reading this article!

My DevOps Adventure: Adding Vosk for Audio Transcription (Part 5)

Published by ACMattos on 28 de September de 202528 de September de 2025

Preparing the Vosk Model Container

How This Works

Result

Building the Python Transcriber Service

Defining the Transcriber Service in Docker Compose

Key Points

The Transcriber Dockerfile

Python Dependencies

The Transcriber API (main.py)

Result

Running the Transcriber Service

Testing the Transcriber API with curl

Checking Service Health

Uploading a Local Audio File

Transcribing from a Remote URL

Result

Conclusion

0 Comments

Leave a Reply Cancel reply

My DevOps Adventure: Building with Camel, CouchDB, and Elasticsearch (Part 3)

Whisper API Audio Transcription Made Simple with Docker

My DevOps Adventure: Building with Elasticsearch, Kibana, CouchDB, and Logstash (Part 2)

My DevOps Adventure: Adding Vosk for Audio Transcription (Part 5)

Published by ACMattos on 28 de September de 202528 de September de 2025

Preparing the Vosk Model Container

How This Works

Result

Building the Python Transcriber Service

Defining the Transcriber Service in Docker Compose

Key Points

The Transcriber Dockerfile

Python Dependencies

The Transcriber API (main.py)

Result

Running the Transcriber Service

Testing the Transcriber API with curl

Checking Service Health

Uploading a Local Audio File

Transcribing from a Remote URL

Result

Conclusion

0 Comments

Leave a Reply Cancel reply

Related Posts

My DevOps Adventure: Building with Camel, CouchDB, and Elasticsearch (Part 3)

Whisper API Audio Transcription Made Simple with Docker

My DevOps Adventure: Building with Elasticsearch, Kibana, CouchDB, and Logstash (Part 2)