transcriptintel · skitsanos · Jun 10, 2026 · Jun 9, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,5 @@
 samples
 /output
 TODO.md
+BENCHMARKS.local.md
 vendor/
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "transcribeit"
-version = "1.3.0"
+version = "1.4.0"
 edition = "2024"
 rust-version = "1.96"
 license-file = "LICENSE"
@@ -14,7 +14,7 @@ strip = "symbols"
 incremental = false
 
 [features]
-default = ["sherpa-onnx"]
+default = []
 
 [build-dependencies]
 dotenvy = "0.15"
@@ -33,8 +33,9 @@ serde = { version = "1", features = ["derive"] }
 serde_json = "1"
 tempfile = "3"
 regex = "1"
+urlencoding = "2"
 tokio = { version = "1", features = ["full"] }
-sherpa-onnx = { version = "1.13", optional = true }
+sherpa-onnx = { version = "1.13", default-features = false, features = ["shared"], optional = true }
 tar = "0.4"
 bzip2 = "0.6"
 libc = "0.2"

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # transcribeit
 
-A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, and Qwen ASR file transcription.
+A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, Qwen ASR file transcription, and Gemini multimodal transcription.
 
 Accepts any audio or video format — FFmpeg handles conversion automatically.
 
@@ -15,11 +15,11 @@ Accepts any audio or video format — FFmpeg handles conversion automatically.
 ## Quick start
 
 ```bash
-# Build (reads SHERPA_ONNX_LIB_DIR from .env automatically via build.rs)
+# Build the default binary
 cargo build --release
 
-# Build without sherpa-onnx (no shared library dependency needed)
-cargo build --release --no-default-features
+# Build with sherpa-onnx (reads SHERPA_ONNX_LIB_DIR from .env automatically via build.rs)
+cargo build --release --features sherpa-onnx
 
 # Download a GGML model (default format, for --provider local)
 transcribeit download-model -s base
@@ -57,13 +57,21 @@ transcribeit run -i meeting.mp4 -m base -f srt -o ./output
 # Transcribe via OpenAI API
 transcribeit run -p openai -i recording.mp3
 
+# Transcribe via OpenAI hosted diarization
+transcribeit run -p openai --remote-model gpt-4o-transcribe-diarize \
+  -i meeting.mp3 -f srt -o ./output
+
 # Transcribe via Azure OpenAI
 transcribeit run -p azure -i recording.mp3 \
   --azure-deployment my-whisper -b https://myresource.openai.azure.com
 
 # Transcribe whole files with Qwen ASR via S3/R2 pre-signed URLs
 transcribeit run -p qwen-filetrans -i recording.mp3 -f vtt -o ./output
 
+# Transcribe whole files with Gemini Files API + generateContent
+transcribeit run -p gemini --remote-model gemini-3.5-flash \
+  -i recording.mp3 -f vtt -o ./output
+
 # Force language and normalize before transcription
 transcribeit run -i recording.wav -m base --language en --normalize
 
@@ -79,18 +87,20 @@ transcribeit run -i interview.mp3 -m base --speakers 2 \
 ## Features
 
 - **Any input format** — MP3, MP4, WAV, FLAC, OGG, etc. FFmpeg converts to mono 16kHz WAV automatically.
-- **5 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, and Qwen file transcription. Extensible via the `Transcriber` trait.
+- **6 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, Qwen file transcription, and Gemini. Extensible via the `Transcriber` trait.
 - **Qwen ASR whole-file transcription** — `qwen-filetrans` stages audio in S3-compatible storage, passes a pre-signed URL to DashScope, polls the async task, and maps Qwen timestamps into the transcript model.
+- **Stable manifest schema** — Manifests use `transcribeit.manifest.v2` with canonical millisecond timestamps, provider-neutral capabilities/quality fields, and provider-specific metadata under `provider_metadata.data`.
 - **Qwen provider metadata** — Manifests include Qwen task timing/usage, audio info, per-segment language/emotion, and word-level timestamps. Temporary pre-signed URLs are not persisted.
 - **Qwen model guardrails** — Accidental short-audio `qwen3-asr-flash` model selection is rejected before conversion and S3 upload; use `qwen3-asr-flash-filetrans` for this provider.
+- **Gemini whole-file transcription** — `gemini` uploads prepared audio through Gemini Files API, calls `generateContent` with structured JSON output, and maps segment timestamps, speaker labels, language, and emotion when returned.
 - **3 model architectures via sherpa-onnx** — Whisper, Moonshine, and SenseVoice are auto-detected from the model directory contents. Just point `--model` at any supported model directory.
 - **Model aliases** — `-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers. The sherpa-onnx resolver also supports glob matching (e.g., `-m moonshine-base`, `-m sense-voice`).
 - **Language hinting** — Pass `--language` to force local and API transcription language.
 - **FFmpeg audio normalization** — Optional `--normalize` to apply loudnorm before transcription.
 - **VAD-based segmentation** — Speech-aware segmentation via Silero VAD (sherpa-onnx). Detects speech boundaries with padding and gap merging to avoid mid-word cuts. Use `--vad-model .cache/silero_vad.onnx`.
 - **Silence-based segmentation** — Fallback segmentation via FFmpeg `silencedetect` for API providers or when VAD model is not available.
 - **sherpa-onnx auto-segmentation** — Whisper ONNX models only support ≤30s per call; segmentation is enabled automatically.
-- **sherpa-onnx is optional** — Enabled by default as a Cargo feature. Build without it: `cargo build --no-default-features`.
+- **sherpa-onnx is optional** — Enable it explicitly with `cargo build --features sherpa-onnx` when you need ONNX providers or Sherpa-backed diarization.
 - **Auto-split for API limits** — Files exceeding 25MB are automatically segmented when using remote providers.
 - **Progress spinner** — Shows live terminal feedback during transcription (single file and segmented mode).
 - **Parallel API segment transcription** — Multiple segment requests can be processed concurrently with `--segment-concurrency`.
@@ -110,6 +120,8 @@ HF_TOKEN=hf_your_token_here
 MODEL_CACHE_DIR=.cache
 SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
 OPENAI_API_KEY=sk-your_key_here
+GEMINI_API_KEY=your_gemini_key_here
+GEMINI_API_BASE_URL=https://generativelanguage.googleapis.com/v1beta
 AZURE_API_KEY=your_azure_key_here
 AZURE_OPENAI_ENDPOINT=https://myresource.openai.azure.com
 AZURE_DEPLOYMENT_NAME=whisper
@@ -149,7 +161,7 @@ On first run, use `transcribeit setup` to download models and additional compone
 To build a distributable binary:
 
 ```bash
-cargo build --release
+cargo build --release --features sherpa-onnx
 # Copy binary + libs
 cp target/release/transcribeit dist/
 cp vendor/sherpa-onnx-*/lib/lib*.dylib dist/lib/
@@ -158,7 +170,7 @@ cp vendor/sherpa-onnx-*/lib/lib*.dylib dist/lib/
 To build without sherpa-onnx (no shared library dependency):
 
 ```bash
-cargo build --release --no-default-features
+cargo build --release
 ```
 
 ## License

diff --git a/Taskfile.yaml b/Taskfile.yaml
@@ -6,6 +6,10 @@ dotenv:
   - .env
 
 tasks:
+  test-hf-token:
+    cmds:
+      - |
+        test -n "$HF_TOKEN" || (echo "HF_TOKEN is not set" >&2; exit 1)
   print-openai-base-url:
     cmds:
       - |

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -26,6 +26,7 @@ src/
     ├── sherpa_onnx.rs     # Local sherpa-onnx engine (auto-detects Whisper, Moonshine, SenseVoice)
     ├── openai_api.rs      # OpenAI-compatible REST API
     ├── azure_openai.rs    # Azure OpenAI REST API
+    ├── gemini.rs          # Gemini Files API + generateContent
     ├── qwen_filetrans.rs  # Qwen async file transcription provider
     ├── qwen_filetrans/    # Qwen request/response types and model limits
     ├── rate_limit.rs      # Retry logic and 429 handling
@@ -55,6 +56,7 @@ pub trait Transcriber: Send + Sync {
 - **Sherpa-ONNX engine** (`sherpa_onnx`) uses `transcribe()` — it needs decoded samples for the ONNX runtime.
 - **OpenAI/Azure API engines** override `transcribe_path()` to upload files directly via multipart, and `transcribe_wav()` to upload in-memory bytes — avoiding the decode→re-encode round-trip.
 - **Qwen file transcription** overrides `transcribe_path()` to upload prepared audio to S3-compatible storage, generate a pre-signed URL, and submit that URL to DashScope.
+- **Gemini** overrides `transcribe_path()` to upload prepared audio through Gemini Files API and call `generateContent` with structured JSON output.
 
 ## Processing pipeline
 
@@ -64,7 +66,7 @@ The `pipeline.rs` module orchestrates the full flow:
 Input file (any format)
   │
   ├─ needs_conversion()? ──→ extract_to_wav(normalize) for local provider
-  ├─ upload_as_mp3(normalize) for API providers and Qwen filetrans (16kHz mono MP3)
+  ├─ upload_as_mp3(normalize) for OpenAI/Azure, Qwen filetrans, and Gemini (16kHz mono MP3)
   │
   ├─ get_duration() via ffprobe
   │
@@ -99,11 +101,23 @@ Input file (any format)
       ├─ Text to stdout or `<input_stem>.txt`
       ├─ VTT to file or stdout (with `<v Speaker N>` tags when diarized)
       ├─ SRT to file or stdout (with `[Speaker N]` labels when diarized)
-      └─ JSON manifest to output directory (includes speaker field per segment)
+      └─ JSON manifest to output directory (`transcribeit.manifest.v2`)
 ```
 
 Temporary files use the `tempfile` crate and are cleaned up automatically on drop.
 
+## Manifest contract
+
+When `--output-dir` is set, the JSON manifest is the stable machine-readable contract for downstream applications. The current schema is `transcribeit.manifest.v2`.
+
+- `transcript.text` and `transcript.segments` are the preferred consumer-facing transcript fields.
+- Segment and word timestamps include canonical integer millisecond fields (`start_ms`, `end_ms`) plus second fields for readability.
+- `capabilities` describes which optional fields are present, such as word timestamps, speaker labels, segment language, and emotion.
+- `quality` describes how reliable timing/speaker metadata is, including `timing_source`, `timing_reliable`, and `timestamps_clamped`.
+- `provider_metadata` is a stable envelope: `{ "provider": "...", "schema_version": "...", "data": { ... } }`.
+- Provider-specific payloads live only under `provider_metadata.data`; temporary URLs and secrets must not be persisted.
+- The top-level `segments` array remains as a compatibility mirror for older consumers.
+
 ## Engines
 
 ### Local (`whisper_local.rs`)
@@ -147,6 +161,20 @@ The S3 staging implementation lives in `storage::s3` and works with AWS S3-compa
 
 Short synchronous Qwen models such as `qwen3-asr-flash` use a different API path and have strict 10 MB / 300 second limits. If one is selected with `-p qwen-filetrans`, the CLI fails before conversion or S3 upload.
 
+### Gemini (`gemini.rs`)
+
+Uses Gemini Files API and `generateContent` for whole-file multimodal transcription. The provider:
+
+- converts input audio/video to 16 kHz mono MP3
+- uploads the prepared file with a resumable Files API upload
+- waits for the file to become `ACTIVE`
+- requests structured JSON with `text`, segment timestamps, speaker, language, and emotion fields
+- maps valid segments into the normalized transcript/manifest model
+- falls back to generated transcript text when structured JSON is missing or invalid
+- deletes the temporary Gemini file after the transcription request
+
+Gemini is not a dedicated ASR endpoint. Timestamp, speaker, language, and emotion values come from the model's structured output, so benchmark quality before relying on them for subtitle workflows.
+
 ### Sherpa-ONNX (`sherpa_onnx.rs`)
 
 Local inference using [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) with automatic model architecture detection. Uses a **dedicated worker thread pattern**: the `OfflineRecognizer` is created on a plain `std::thread` (not on the Tokio runtime) and stays there for its entire lifetime. Transcription requests are sent to the thread via an `mpsc` channel and results come back through `tokio::sync::oneshot` channels. This design avoids:
@@ -190,7 +218,7 @@ All settings (timeout, retries, wait times) are configurable via CLI flags and e
 
 ### Shared WAV encoding
 
-OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.
+OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. Gemini uploads MP3 through Gemini Files API. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.
 
 ## Model cache (`model_cache.rs`)
 
@@ -202,7 +230,7 @@ OpenAI/Azure engines can send file uploads directly and choose the correct conta
 
 ## Build requirements
 
-The `sherpa-onnx` Cargo feature is **enabled by default**. It requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.
+The `sherpa-onnx` Cargo feature is opt-in. It requires the sherpa-onnx shared libraries at both compile time and runtime. The `build.rs` script loads a `.env` file and reads `SHERPA_ONNX_LIB_DIR` to configure the linker search path and embed an `rpath` so the binary can find the dylibs at runtime.
 
 Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:
 
@@ -211,13 +239,13 @@ Set `SHERPA_ONNX_LIB_DIR` in your `.env` file or environment before building:
 SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
 ```
 
-To build without the sherpa-onnx dependency entirely:
+To build with sherpa-onnx enabled:
 
 ```bash
-cargo build --release --no-default-features
+cargo build --release --features sherpa-onnx
 ```
 
-This removes the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`.
+The default build omits the sherpa-onnx provider and eliminates the need for `SHERPA_ONNX_LIB_DIR`.
 
 ## VAD-based segmentation (`audio/vad.rs`)
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,4 +7,5 @@ @@
     samples
     /output
     TODO.md
+    BENCHMARKS.local.md
     vendor/