
As my journey with the Elastic Stack progressed, I became increasingly curious about how artificial intelligence could enhance my projects beyond data storage and search. One area that has always fascinated me is speech-to-text transcription — the ability to transform spoken language into accurate, searchable text.
Instead of relying on closed, third-party services, I wanted to explore open-source AI tools that give me full control of the process. That’s when I came across Vosk, a lightweight yet powerful speech recognition toolkit. With ready-to-use models in multiple languages, including Portuguese, Vosk made it possible to bring transcription into my DevOps adventure in a self-contained and reproducible way.
In this part, we’ll start by preparing a Docker container dedicated to downloading and hosting the Portuguese Vosk model. Later, this model will serve as the foundation for a Python-based service capable of receiving audio or video files and producing transcriptions — opening the door to new AI-powered
workflows in my environment.
Preparing the Vosk Model Container
To keep things modular, I created a container whose sole purpose is downloading and storing the Vosk Portuguese Model. This makes it reusable by other services that may need access to the model files.
Here’s the docker-compose.yml snippet for the vosk_pt service:
vosk_pt:
image: alpine:3.20
container_name: vosk_pt
command: >
sh -c "
set -eux;
apk add --no-cache wget unzip ca-certificates;
mkdir -p /models/vosk-pt;
if [ ! -f /models/.pt_downloaded ]; then
echo 'Downloading pt-BR Vosk Model (small-pt-0.3)...';
wget -O /tmp/vosk-pt.zip https://alphacephei.com/vosk/models/vosk-model-small-pt-0.3.zip;
unzip -o /tmp/vosk-pt.zip -d /tmp;
mv -f /tmp/vosk-model-small-pt-0.3/* /models/vosk-pt/ || true;
touch /models/.pt_downloaded;
echo 'Vosk Model ready: /models/vosk-pt';
else
echo 'Vosk Modelis already deployed. Skipping download.';
fi
"
volumes:
- vosk-data:/models
How This Works
- Base image: We use
alpine:3.20for its minimal size and simplicity. - Dependencies: Installs
wget,unzip, andca-certificatesto fetch the
model over HTTPS and unpack it. - Model directory: Ensures
/models/vosk-ptexists for storing the model files. - Idempotency: Uses a marker file (
.pt_downloaded) to avoid re-downloading
the model every time the container restarts. - Persistent storage: Mounts a Docker volume (
vosk-data) so the model
persists even if the container is removed.
Result
Once this container runs, the Portuguese Vosk model will be available under:
/models/vosk-pt
This prepares the ground for the Python transcription container, which will use this shared model directory to process audio and video files in real time.
That will be the focus of the next step.
Building the Python Transcriber Service
With the Portuguese Vosk model downloaded and persisted in the previous step, the next challenge was to create a microservice that could actually use this model.
I decided to build a Python application with FastAPI, packaged inside Docker, capable of receiving audio or video files (or even URLs), normalizing them with FFmpeg, and returning a clean transcription as JSON.
Defining the Transcriber Service in Docker Compose
The new service is called transcriber, and it depends on the vosk_pt container, ensuring the model is available before the API starts.
transcriber:
build:
context: ./transcriber
dockerfile: Dockerfile
container_name: transcriber
depends_on:
vosk_pt:
condition: service_completed_successfully
environment:
VOSK_MODEL_DIR: /models/vosk-pt
SAMPLE_RATE: 16000
CHANNELS: 1
MAX_DURATION_MIN: 30
volumes:
- vosk-data:/models
ports:
- "8000:8000"
Key Points
- Builds from a custom Dockerfile (
./transcriber/Dockerfile). - Waits for the
vosk_ptcontainer to finish downloading the model. - Exposes port 8000 for API access.
- Uses environment variables for model location, sample rate, and limits.
- Mounts the same
vosk-datavolume, ensuring both services share the model files.
The Transcriber Dockerfile
This container includes Python 3.12, FFmpeg, and the required Python libraries.
FROM python:3.12-slim
# Install ffmpeg for audio/video transcoding
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg curl ca-certificates && \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application source
COPY app ./app
# Model mount point
VOLUME ["/models"]
ENV VOSK_MODEL_DIR=/models/vosk-pt
ENV PYTHONUNBUFFERED=1
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
This ensures:
- FFmpeg is available for preprocessing.
- Dependencies are installed via
requirements.txt. - The application lives under
/app. - The model is mounted at
/models/vosk-pt.
Python Dependencies
fastapi==0.115.0
uvicorn==0.30.6
vosk==0.3.45
pydub==0.25.1
httpx==0.27.2
python-multipart==0.0.9
These bring together:
- FastAPI → web API framework.
- Uvicorn → ASGI server.
- Vosk → speech recognition engine.
- Pydub + FFmpeg → audio handling and normalization.
- HTTPX → fetching remote files.
- Multipart → supporting file uploads.
The Transcriber API (main.py)
The application exposes a single endpoint /transcribe, which accepts either an
uploaded file or a remote URL.
from fastapi import FastAPI, UploadFile, File, HTTPException, Form
from fastapi.responses import JSONResponse
from vosk import Model, KaldiRecognizer
import httpx, json, logging, os, re, subprocess, tempfile, uuid, wave, time, threading
from typing import Optional
# Logging
LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO").upper()
logging.basicConfig(level=getattr(logging, LOG_LEVEL, logging.INFO))
logger = logging.getLogger("vosk-transcriber")
# Config
MODEL_DIR = os.environ.get("VOSK_MODEL_DIR", "/models/vosk-pt")
SAMPLE_RATE = int(os.environ.get("SAMPLE_RATE", "16000"))
CHANNELS = int(os.environ.get("CHANNELS", "1"))
MAX_DURATION_MIN = int(os.environ.get("MAX_DURATION_MIN", "30"))
# Lazy-load model
_model_lock = threading.Lock()
_MODEL: Optional[Model] = None
def get_model() -> Model:
global _MODEL
if _MODEL is None:
with _model_lock:
if _MODEL is None:
if not os.path.isdir(MODEL_DIR):
raise RuntimeError(f"Vosk model not found at {MODEL_DIR}")
_MODEL = Model(MODEL_DIR)
return _MODEL
# FastAPI
app = FastAPI(title="Vosk Transcriber")
@app.on_event("startup")
async def on_startup():
logger.info(f"Loading model from {MODEL_DIR}")
get_model()
# Helpers
def to_wav_16k_mono(src_path: str, dst_path: str):
filters = ["afftdn", "silenceremove=1:0:-50dB", "loudnorm=I=-16:TP=-1.5:LRA=11"]
cmd = ["ffmpeg", "-hide_banner", "-nostdin", "-y",
"-i", src_path, "-vn", "-ac", str(CHANNELS), "-ar", str(SAMPLE_RATE),
"-acodec", "pcm_s16le", "-af", ",".join(filters), "-f", "wav", dst_path]
subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
def duration_seconds_wav(path: str) -> float:
with wave.open(path, "rb") as wf:
return wf.getnframes() / float(wf.getframerate())
def transcribe_wav(path: str, with_words: bool = False):
model = get_model()
with wave.open(path, "rb") as wf:
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(with_words)
while True:
data = wf.readframes(8000)
if not data:
break
rec.AcceptWaveform(data)
final = json.loads(rec.FinalResult())
result = {"text": final.get("text", "")}
if with_words and "result" in final:
result["words"] = final["result"]
return result
@app.post("/transcribe")
async def transcribe(
file: UploadFile | None = File(default=None),
sourceUrl: str | None = Form(default=None),
return_words: bool = Form(default=False),
):
if not file and not sourceUrl:
raise HTTPException(status_code=400, detail="Send 'file' (multipart) or 'sourceUrl' (Form).")
with tempfile.TemporaryDirectory() as tmp:
src_bin = os.path.join(tmp, f"input-{uuid.uuid4().hex}")
wav16 = os.path.join(tmp, "audio.wav")
# Input: upload or URL
if file:
with open(src_bin, "wb") as f:
f.write(await file.read())
else:
async with httpx.AsyncClient(follow_redirects=True, timeout=60) as client:
r = await client.get(sourceUrl)
r.raise_for_status()
with open(src_bin, "wb") as f:
f.write(r.content)
# Convert, check duration, transcribe
to_wav_16k_mono(src_bin, wav16)
sec = duration_seconds_wav(wav16)
if sec > MAX_DURATION_MIN * 60:
raise HTTPException(status_code=413, detail=f"Audio exceeds {MAX_DURATION_MIN} minutes")
result = transcribe_wav(wav16, with_words=return_words)
return JSONResponse(content=result, media_type="application/json; charset=utf-8")
Result
At this stage, we now have a fully working transcription API:
- Upload an audio/video file or send a URL.
- The service normalizes it into 16kHz mono WAV.
- Vosk processes the file and returns text (and optionally word-level timings).
This service is lightweight, reproducible, and integrates seamlessly into a modern DevOps pipeline — bringing AI-powered transcription into the same ecosystem as Elasticsearch, CouchDB, and Logstash.
Running the Transcriber Service
With both the vosk_pt (model) and transcriber (API) containers defined in our docker-compose.yml, we can start the transcription service by running:
docker-compose up -d vosk_pt transcriber
This will:
- Start the vosk_pt container, which downloads and prepares the Portuguese Vosk model.
- Launch the transcriber container, mounting the shared model volume and exposing the API at http://localhost:8000.
- Ensure that the transcription service is ready to process audio and video files.
You can confirm that everything is running correctly with:
docker-compose ps
Once both containers show as Up, the service is live and ready for testing.
Testing the Transcriber API with curl
With the transcriber service up and running on port 8000, it’s time to validate that everything works as expected.
Instead of writing a custom client, we can simply use curl to interact with the API.
Checking Service Health
The service is powered by FastAPI, so if you open a browser at:
http://localhost:8000/docs
you’ll see the Swagger UI where you can try out the /transcribe endpoint interactively.
But let’s stick to command line tests with curl.
Uploading a Local Audio File
To send a file to the service for transcription:
curl -X POST "http://localhost:8000/transcribe" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@sample.mp3" \
-F "return_words=true"
Explanation:
file=@sample.mp3→ Replacesample.mp3with your audio or video file.return_words=true→ Optional flag to include word-level timings.Content-Type: multipart/form-data→ Required for file uploads.
The output will look like:
{
"text": "Olá, este é um teste de transcrição em português.",
"words": [
{"word": "olá", "start": 0.1, "end": 0.5, "conf": 0.95},
{"word": "este", "start": 0.6, "end": 0.9, "conf": 0.92},
...
]
}
Transcribing from a Remote URL
You can also pass a remote file (audio or video) via the sourceUrl form parameter:
curl -X POST "http://localhost:8000/transcribe" \
-H "accept: application/json" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "sourceUrl=https://www.example.com/audio/teste.mp3" \
-d "return_words=false"
This will:
- Download the remote file inside the container.
- Normalize it to 16kHz mono WAV with FFmpeg.
- Run transcription and return just the text output.
Result
With these two curl tests, we now have:
- File upload support for local testing.
- Remote URL ingestion for flexible input sources.
- Word-level timings available when requested.
This confirms the transcriber API is operational and ready to be integrated into broader pipelines — from document indexing in Elasticsearch to more advanced AI workflows.
Conclusion
By the end of this stage in my DevOps journey, I managed to extend my stack beyond search and analytics into the world of AI-powered transcription.
With Vosk running inside Docker, I now have a reproducible, open-source environment capable of turning spoken Portuguese into accurate text — without depending on external providers.
What I find most valuable here is how this approach aligns with the same principles that guided my Elastic Stack setup:
- Modularity: Each container (CouchDB, Logstash, Elasticsearch, Kibana, and now Vosk) has a clear, isolated purpose.
- Security & Control: From TLS certificates in CouchDB to the self-managed Vosk model, every part of the pipeline is transparent and under my control.
- Extensibility: Just like I used Logstash filters to normalize company data, I can now imagine applying NLP pipelines or text-mining tools on top of the transcribed content.
This project shows that it’s possible to blend data engineering with machine learning in a way that is both educational and practical. From here, the possibilities are endless:
- Enhancing transcription with speaker diarization or punctuation models
- Sending transcripts into Elasticsearch for search and analytics
- Applying LLMs for summarization or entity extraction
And most importantly: all of it can run locally, securely, and reproducibly.
You can download a version of the code showcased in this article from my GitHub repository.
If you want to review the setup process, take a look at the Part 1. Thank you for reading this article!
0 Comments