Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@
samples
/output
TODO.md
BENCHMARKS.local.md
vendor/
3 changes: 2 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 4 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "transcribeit"
version = "1.3.0"
version = "1.4.0"
edition = "2024"
rust-version = "1.96"
license-file = "LICENSE"
Expand All @@ -14,7 +14,7 @@ strip = "symbols"
incremental = false

[features]
default = ["sherpa-onnx"]
default = []

[build-dependencies]
dotenvy = "0.15"
Expand All @@ -33,8 +33,9 @@ serde = { version = "1", features = ["derive"] }
serde_json = "1"
tempfile = "3"
regex = "1"
urlencoding = "2"
tokio = { version = "1", features = ["full"] }
sherpa-onnx = { version = "1.13", optional = true }
sherpa-onnx = { version = "1.13", default-features = false, features = ["shared"], optional = true }
tar = "0.4"
bzip2 = "0.6"
libc = "0.2"
Expand Down
28 changes: 20 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# transcribeit

A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, and Qwen ASR file transcription.
A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, Qwen ASR file transcription, and Gemini multimodal transcription.

Accepts any audio or video format — FFmpeg handles conversion automatically.

Expand All @@ -15,11 +15,11 @@ Accepts any audio or video format — FFmpeg handles conversion automatically.
## Quick start

```bash
# Build (reads SHERPA_ONNX_LIB_DIR from .env automatically via build.rs)
# Build the default binary
cargo build --release

# Build without sherpa-onnx (no shared library dependency needed)
cargo build --release --no-default-features
# Build with sherpa-onnx (reads SHERPA_ONNX_LIB_DIR from .env automatically via build.rs)
cargo build --release --features sherpa-onnx

# Download a GGML model (default format, for --provider local)
transcribeit download-model -s base
Expand Down Expand Up @@ -57,13 +57,21 @@ transcribeit run -i meeting.mp4 -m base -f srt -o ./output
# Transcribe via OpenAI API
transcribeit run -p openai -i recording.mp3

# Transcribe via OpenAI hosted diarization
transcribeit run -p openai --remote-model gpt-4o-transcribe-diarize \
-i meeting.mp3 -f srt -o ./output

# Transcribe via Azure OpenAI
transcribeit run -p azure -i recording.mp3 \
--azure-deployment my-whisper -b https://myresource.openai.azure.com

# Transcribe whole files with Qwen ASR via S3/R2 pre-signed URLs
transcribeit run -p qwen-filetrans -i recording.mp3 -f vtt -o ./output

# Transcribe whole files with Gemini Files API + generateContent
transcribeit run -p gemini --remote-model gemini-3.5-flash \
-i recording.mp3 -f vtt -o ./output

# Force language and normalize before transcription
transcribeit run -i recording.wav -m base --language en --normalize

Expand All @@ -79,18 +87,20 @@ transcribeit run -i interview.mp3 -m base --speakers 2 \
## Features

- **Any input format** — MP3, MP4, WAV, FLAC, OGG, etc. FFmpeg converts to mono 16kHz WAV automatically.
- **5 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, and Qwen file transcription. Extensible via the `Transcriber` trait.
- **6 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, Qwen file transcription, and Gemini. Extensible via the `Transcriber` trait.
- **Qwen ASR whole-file transcription** — `qwen-filetrans` stages audio in S3-compatible storage, passes a pre-signed URL to DashScope, polls the async task, and maps Qwen timestamps into the transcript model.
- **Stable manifest schema** — Manifests use `transcribeit.manifest.v2` with canonical millisecond timestamps, provider-neutral capabilities/quality fields, and provider-specific metadata under `provider_metadata.data`.
- **Qwen provider metadata** — Manifests include Qwen task timing/usage, audio info, per-segment language/emotion, and word-level timestamps. Temporary pre-signed URLs are not persisted.
- **Qwen model guardrails** — Accidental short-audio `qwen3-asr-flash` model selection is rejected before conversion and S3 upload; use `qwen3-asr-flash-filetrans` for this provider.
- **Gemini whole-file transcription** — `gemini` uploads prepared audio through Gemini Files API, calls `generateContent` with structured JSON output, and maps segment timestamps, speaker labels, language, and emotion when returned.
- **3 model architectures via sherpa-onnx** — Whisper, Moonshine, and SenseVoice are auto-detected from the model directory contents. Just point `--model` at any supported model directory.
- **Model aliases** — `-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers. The sherpa-onnx resolver also supports glob matching (e.g., `-m moonshine-base`, `-m sense-voice`).
- **Language hinting** — Pass `--language` to force local and API transcription language.
- **FFmpeg audio normalization** — Optional `--normalize` to apply loudnorm before transcription.
- **VAD-based segmentation** — Speech-aware segmentation via Silero VAD (sherpa-onnx). Detects speech boundaries with padding and gap merging to avoid mid-word cuts. Use `--vad-model .cache/silero_vad.onnx`.
- **Silence-based segmentation** — Fallback segmentation via FFmpeg `silencedetect` for API providers or when VAD model is not available.
- **sherpa-onnx auto-segmentation** — Whisper ONNX models only support ≤30s per call; segmentation is enabled automatically.
- **sherpa-onnx is optional** — Enabled by default as a Cargo feature. Build without it: `cargo build --no-default-features`.
- **sherpa-onnx is optional** — Enable it explicitly with `cargo build --features sherpa-onnx` when you need ONNX providers or Sherpa-backed diarization.
- **Auto-split for API limits** — Files exceeding 25MB are automatically segmented when using remote providers.
- **Progress spinner** — Shows live terminal feedback during transcription (single file and segmented mode).
- **Parallel API segment transcription** — Multiple segment requests can be processed concurrently with `--segment-concurrency`.
Expand All @@ -110,6 +120,8 @@ HF_TOKEN=hf_your_token_here
MODEL_CACHE_DIR=.cache
SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
OPENAI_API_KEY=sk-your_key_here
GEMINI_API_KEY=your_gemini_key_here
GEMINI_API_BASE_URL=https://generativelanguage.googleapis.com/v1beta
AZURE_API_KEY=your_azure_key_here
AZURE_OPENAI_ENDPOINT=https://myresource.openai.azure.com
AZURE_DEPLOYMENT_NAME=whisper
Expand Down Expand Up @@ -149,7 +161,7 @@ On first run, use `transcribeit setup` to download models and additional compone
To build a distributable binary:

```bash
cargo build --release
cargo build --release --features sherpa-onnx
# Copy binary + libs
cp target/release/transcribeit dist/
cp vendor/sherpa-onnx-*/lib/lib*.dylib dist/lib/
Expand All @@ -158,7 +170,7 @@ cp vendor/sherpa-onnx-*/lib/lib*.dylib dist/lib/
To build without sherpa-onnx (no shared library dependency):

```bash
cargo build --release --no-default-features
cargo build --release
```

## License
Expand Down
4 changes: 4 additions & 0 deletions Taskfile.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ dotenv:
- .env

tasks:
test-hf-token:
cmds:
- |
test -n "$HF_TOKEN" || (echo "HF_TOKEN is not set" >&2; exit 1)
print-openai-base-url:
cmds:
- |
Expand Down
42 changes: 35 additions & 7 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ src/
├── sherpa_onnx.rs # Local sherpa-onnx engine (auto-detects Whisper, Moonshine, SenseVoice)
├── openai_api.rs # OpenAI-compatible REST API
├── azure_openai.rs # Azure OpenAI REST API
├── gemini.rs # Gemini Files API + generateContent
├── qwen_filetrans.rs # Qwen async file transcription provider
├── qwen_filetrans/ # Qwen request/response types and model limits
├── rate_limit.rs # Retry logic and 429 handling
Expand Down Expand Up @@ -55,6 +56,7 @@ pub trait Transcriber: Send + Sync {
- **Sherpa-ONNX engine** (`sherpa_onnx`) uses `transcribe()` — it needs decoded samples for the ONNX runtime.
- **OpenAI/Azure API engines** override `transcribe_path()` to upload files directly via multipart, and `transcribe_wav()` to upload in-memory bytes — avoiding the decode→re-encode round-trip.
- **Qwen file transcription** overrides `transcribe_path()` to upload prepared audio to S3-compatible storage, generate a pre-signed URL, and submit that URL to DashScope.
- **Gemini** overrides `transcribe_path()` to upload prepared audio through Gemini Files API and call `generateContent` with structured JSON output.

## Processing pipeline

Expand All @@ -64,7 +66,7 @@ The `pipeline.rs` module orchestrates the full flow:
Input file (any format)
├─ needs_conversion()? ──→ extract_to_wav(normalize) for local provider
├─ upload_as_mp3(normalize) for API providers and Qwen filetrans (16kHz mono MP3)
├─ upload_as_mp3(normalize) for OpenAI/Azure, Qwen filetrans, and Gemini (16kHz mono MP3)
├─ get_duration() via ffprobe
Expand Down Expand Up @@ -99,11 +101,23 @@ Input file (any format)
├─ Text to stdout or `<input_stem>.txt`
├─ VTT to file or stdout (with `<v Speaker N>` tags when diarized)
├─ SRT to file or stdout (with `[Speaker N]` labels when diarized)
└─ JSON manifest to output directory (includes speaker field per segment)
└─ JSON manifest to output directory (`transcribeit.manifest.v2`)
```

Temporary files use the `tempfile` crate and are cleaned up automatically on drop.

## Manifest contract

When `--output-dir` is set, the JSON manifest is the stable machine-readable contract for downstream applications. The current schema is `transcribeit.manifest.v2`.

- `transcript.text` and `transcript.segments` are the preferred consumer-facing transcript fields.
- Segment and word timestamps include canonical integer millisecond fields (`start_ms`, `end_ms`) plus second fields for readability.
- `capabilities` describes which optional fields are present, such as word timestamps, speaker labels, segment language, and emotion.
- `quality` describes how reliable timing/speaker metadata is, including `timing_source`, `timing_reliable`, and `timestamps_clamped`.
- `provider_metadata` is a stable envelope: `{ "provider": "...", "schema_version": "...", "data": { ... } }`.
- Provider-specific payloads live only under `provider_metadata.data`; temporary URLs and secrets must not be persisted.
- The top-level `segments` array remains as a compatibility mirror for older consumers.

## Engines

### Local (`whisper_local.rs`)
Expand Down Expand Up @@ -147,6 +161,20 @@ The S3 staging implementation lives in `storage::s3` and works with AWS S3-compa

Short synchronous Qwen models such as `qwen3-asr-flash` use a different API path and have strict 10 MB / 300 second limits. If one is selected with `-p qwen-filetrans`, the CLI fails before conversion or S3 upload.

### Gemini (`gemini.rs`)

Uses Gemini Files API and `generateContent` for whole-file multimodal transcription. The provider:

- converts input audio/video to 16 kHz mono MP3
- uploads the prepared file with a resumable Files API upload
- waits for the file to become `ACTIVE`
- requests structured JSON with `text`, segment timestamps, speaker, language, and emotion fields
- maps valid segments into the normalized transcript/manifest model
- falls back to generated transcript text when structured JSON is missing or invalid
- deletes the temporary Gemini file after the transcription request

Gemini is not a dedicated ASR endpoint. Timestamp, speaker, language, and emotion values come from the model's structured output, so benchmark quality before relying on them for subtitle workflows.

### Sherpa-ONNX (`sherpa_onnx.rs`)

Local inference using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) with automatic model architecture detection. Uses a **dedicated worker thread pattern**: the `OfflineRecognizer` is created on a plain `std::thread` (not on the Tokio runtime) and stays there for its entire lifetime. Transcription requests are sent to the thread via an `mpsc` channel and results come back through `tokio::sync::oneshot` channels. This design avoids:
Expand Down Expand Up @@ -190,7 +218,7 @@ All settings (timeout, retries, wait times) are configurable via CLI flags and e

### Shared WAV encoding

OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.
OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. Gemini uploads MP3 through Gemini Files API. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.

## Model cache (`model_cache.rs`)

Expand All @@ -202,7 +230,7 @@ OpenAI/Azure engines can send file uploads directly and choose the correct conta

## Build requirements

The `sherpa-onnx` Cargo feature is **enabled by default**. It requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.
The `sherpa-onnx` Cargo feature is opt-in. It requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.

Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:

Expand All @@ -211,13 +239,13 @@ Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:
SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
```

To build without the sherpa-onnx dependency entirely:
To build with sherpa-onnx enabled:

```bash
cargo build --release --no-default-features
cargo build --release --features sherpa-onnx
```

This removes the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`.
The default build omits the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`.

## VAD-based segmentation (`audio/vad.rs`)

Expand Down
Loading