Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "transcribeit"
version = "1.5.0"
version = "1.6.0"
edition = "2024"
rust-version = "1.96"
license-file = "LICENSE"
Expand Down
41 changes: 38 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# transcribeit

A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, Qwen ASR file transcription, Gemini multimodal transcription, and NVIDIA hosted Riva ASR.
A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, Qwen ASR file transcription, Gemini multimodal transcription, NVIDIA hosted Riva ASR, and Deepgram.

Accepts any audio or video format — FFmpeg handles conversion automatically.

Expand All @@ -10,8 +10,9 @@ Accepts any audio or video format — FFmpeg handles conversion automatically.
- [FFmpeg](https://ffmpeg.org/) installed and on PATH
- C/C++ toolchain and CMake (for building whisper.cpp)
- sherpa-onnx shared libraries (if using the `sherpa-onnx` provider) — set `SHERPA_ONNX_LIB_DIR` in `.env` to the directory containing them
- S3-compatible storage credentials when using `qwen-filetrans`; Cloudflare R2 is supported through `S3_ENDPOINT_URL`
- S3-compatible storage credentials when using `qwen-filetrans` or Deepgram pre-signed URL mode; Cloudflare R2 is supported through `S3_ENDPOINT_URL`
- NVIDIA API key and hosted Riva function id when using `nvidia-riva`
- Deepgram API key when using `deepgram`

## Quick start

Expand Down Expand Up @@ -80,6 +81,14 @@ transcribeit run -p gemini --gemini-file-cache \
transcribeit run -p gemini --gemini-explicit-cache --gemini-cache-ttl-secs 3600 \
-i recording.mp3 -f vtt -o ./output

# Use S3/R2 pre-signed URL input for a one-off Gemini run
transcribeit run -p gemini --gemini-use-presigned-url \
-i recording.mp3 -f vtt -o ./output

# Delete temporary staged provider resources after the provider consumes them
transcribeit run -p qwen-filetrans --autoclean \
-i recording.mp3 -f vtt -o ./output

# Transcribe with Gemini and add a structured summary to the manifest
transcribeit run -p gemini --analysis summary \
-i interview.mp4 -f vtt -o ./output
Expand All @@ -90,6 +99,20 @@ transcribeit run -p nvidia-riva -i recording.wav \
--nvidia-riva-function-id "$NVIDIA_RIVA_FUNCTION_ID" \
-f vtt -o ./output

# Transcribe with Deepgram Nova-3 batch ASR and provider-native diarization
transcribeit run -p deepgram --remote-model nova-3 --diarize \
-i recording.wav -f vtt -o ./output

# Transcribe with Deepgram by staging the prepared audio in S3/R2 first
transcribeit run -p deepgram --remote-model nova-3 --deepgram-use-presigned-url \
-i recording.wav -f vtt -o ./output

# Transcribe with Deepgram Nova-3 Medical, intelligence metadata, and domain keyterms
transcribeit run -p deepgram --remote-model nova-3-medical \
--diarize --deepgram-intelligence \
--deepgram-keyterm Ofev --deepgram-keyterm Esbriet --deepgram-keyterm IPF \
-i interview.wav -f vtt -o ./output

# Force language and normalize before transcription
transcribeit run -i recording.wav -m base --language en --normalize

Expand All @@ -105,17 +128,21 @@ transcribeit run -i interview.mp3 -m base --diarize --speakers 2 \
## Features

- **Any input format** — MP3, MP4, WAV, FLAC, OGG, etc. FFmpeg converts to mono 16kHz WAV automatically.
- **7 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, Qwen file transcription, Gemini, and NVIDIA Riva. Extensible via the `Transcriber` trait.
- **8 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, Qwen file transcription, Gemini, NVIDIA Riva, and Deepgram. Extensible via the `Transcriber` trait.
- **Qwen ASR whole-file transcription** — `qwen-filetrans` stages audio in S3-compatible storage, passes a pre-signed URL to DashScope, polls the async task, and maps Qwen timestamps into the transcript model.
- **Stable manifest schema** — Manifests use `transcribeit.manifest.v2` with canonical millisecond timestamps, provider-neutral capabilities/quality fields, and provider-specific metadata under `provider_metadata.data`.
- **Cache telemetry** — Manifests normalize provider token-cache signals under `cache`, including Gemini `cachedContentTokenCount` and OpenAI/Azure-style `cached_tokens` when returned.
- **Qwen provider metadata** — Manifests include Qwen task timing/usage, audio info, per-segment language/emotion, and word-level timestamps. Temporary pre-signed URLs are not persisted.
- **Qwen model guardrails** — Accidental short-audio `qwen3-asr-flash` model selection is rejected before conversion and S3 upload; use `qwen3-asr-flash-filetrans` for this provider.
- **Gemini whole-file transcription** — `gemini` uploads prepared audio through Gemini Files API, streams `generateContent` response chunks with structured JSON output, and maps segment timestamps, speaker labels, language, and emotion when returned.
- **Gemini file reuse** — `--gemini-file-cache` keeps a local index of Gemini Files API uploads keyed by SHA-256 of the prepared 16 kHz mono MP3 bytes, verifies the remote file before reuse, and records reuse metadata in the manifest.
- **Gemini signed URL input** — `--gemini-use-presigned-url` stages prepared MP3 audio in S3/R2 and sends the signed URL as Gemini `file_uri` for one-off inputs up to 100 MB. Files API cache and explicit cached content remain Files API-only.
- **Gemini explicit cache** — `--gemini-explicit-cache` creates and reuses Gemini `cachedContent` objects with a configurable TTL, producing deterministic `cachedContentTokenCount` telemetry when Gemini accepts the cache.
- **Gemini summary analysis** — `--analysis summary` runs a second Gemini JSON pass over the transcript and stores a provider-neutral summary, key points, topics, questions, and follow-ups in the manifest.
- **Temporary resource cleanup** — `--autoclean` performs best-effort cleanup of temporary provider resources created by the run, including S3/R2 staged objects for Qwen, Gemini signed URL mode, and Deepgram signed URL mode.
- **NVIDIA hosted Riva ASR** — `nvidia-riva` calls hosted NVIDIA Riva gRPC endpoints with provider-native word timestamps, optional server-side diarization, and manifest metadata.
- **Deepgram Nova batch ASR** — `deepgram` calls Deepgram's `/listen` API, defaults to `nova-3`, requests utterances and smart formatting, supports provider-native diarization through `--diarize`, and can submit either direct audio bytes or an S3/R2 pre-signed URL with `--deepgram-use-presigned-url`.
- **Deepgram audio intelligence** — `--deepgram-intelligence` captures Deepgram summary, topics, intents, entity detection, and sentiment in `provider_metadata.data.intelligence`; `--deepgram-keyterm` passes Nova-3 keyterm prompts for domain terminology.
- **3 model architectures via sherpa-onnx** — Whisper, Moonshine, and SenseVoice are auto-detected from the model directory contents. Just point `--model` at any supported model directory.
- **Model aliases** — `-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers. The sherpa-onnx resolver also supports glob matching (e.g., `-m moonshine-base`, `-m sense-voice`).
- **Language hinting** — Pass `--language` to force local and API transcription language.
Expand Down Expand Up @@ -145,9 +172,15 @@ SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
OPENAI_API_KEY=sk-your_key_here
GEMINI_API_KEY=your_gemini_key_here
GEMINI_API_BASE_URL=https://generativelanguage.googleapis.com/v1beta
GEMINI_USE_PRESIGNED_URL=false
NVIDIA_API_KEY=your_nvidia_key_here
NVIDIA_RIVA_FUNCTION_ID=your_hosted_riva_function_id
NVIDIA_RIVA_SERVER=grpc.nvcf.nvidia.com:443
DEEPGRAM_API_KEY=your_deepgram_key_here
DEEPGRAM_API_BASE_URL=https://api.deepgram.com/v1
DEEPGRAM_INTELLIGENCE=false
DEEPGRAM_KEYTERM=Ofev,Esbriet,IPF
DEEPGRAM_USE_PRESIGNED_URL=false
AZURE_API_KEY=your_azure_key_here
AZURE_OPENAI_ENDPOINT=https://myresource.openai.azure.com
AZURE_DEPLOYMENT_NAME=whisper
Expand All @@ -159,9 +192,11 @@ S3_REGION=auto
S3_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.com
S3_ACCESS_KEY_ID=your_s3_access_key
S3_SECRET_ACCESS_KEY=your_s3_secret_key
# Optional; when unset, URL-staging providers choose their own prefix.
S3_PREFIX=transcribeit/qwen-filetrans
S3_PRESIGN_EXPIRES_SECS=3600
S3_FORCE_PATH_STYLE=false
TRANSCRIBEIT_AUTOCLEAN=false
TRANSCRIBEIT_MAX_RETRIES=5
TRANSCRIBEIT_REQUEST_TIMEOUT_SECS=120
TRANSCRIBEIT_RETRY_WAIT_BASE_SECS=10
Expand Down
13 changes: 13 additions & 0 deletions Taskfile.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,16 @@ tasks:
"enable_itn": false
}
}'
test-deepgram:
cmds:
- |
test -n "$DEEPGRAM_API_KEY" || (echo "DEEPGRAM_API_KEY is not set" >&2; exit 1)
DEEPGRAM_API_BASE_URL="${DEEPGRAM_API_BASE_URL:-https://api.deepgram.com/v1}"

curl --silent --show-error --location \
--request POST \
--write-out "\nHTTP_STATUS:%{http_code}\n" \
--header "Authorization: Token ${DEEPGRAM_API_KEY}" \
--header "Content-Type: application/json" \
--data '{"url":"https://dpgr.am/spacewalk.wav"}' \
"${DEEPGRAM_API_BASE_URL%/}/listen?model=nova-3&smart_format=true"
29 changes: 26 additions & 3 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ src/
├── azure_openai.rs # Azure OpenAI REST API
├── gemini.rs # Gemini Files API + streamed generateContent
├── nvidia_riva.rs # NVIDIA hosted Riva gRPC ASR
├── deepgram.rs # Deepgram Nova batch ASR + audio intelligence
├── qwen_filetrans.rs # Qwen async file transcription provider
├── qwen_filetrans/ # Qwen request/response types and model limits
├── rate_limit.rs # Retry logic and 429 handling
Expand Down Expand Up @@ -57,8 +58,9 @@ pub trait Transcriber: Send + Sync {
- **Sherpa-ONNX engine** (`sherpa_onnx`) uses `transcribe()` — it needs decoded samples for the ONNX runtime.
- **OpenAI/Azure API engines** override `transcribe_path()` to upload files directly via multipart, and `transcribe_wav()` to upload in-memory bytes — avoiding the decode→re-encode round-trip.
- **Qwen file transcription** overrides `transcribe_path()` to upload prepared audio to S3-compatible storage, generate a pre-signed URL, and submit that URL to DashScope.
- **Gemini** overrides `transcribe_path()` to upload prepared audio through Gemini Files API and call streamed `streamGenerateContent` with structured JSON output.
- **Gemini** overrides `transcribe_path()` to upload prepared audio through Gemini Files API and call streamed `streamGenerateContent` with structured JSON output. In signed URL mode, it stages the prepared MP3 in S3-compatible storage and sends the pre-signed URL as Gemini `file_uri` instead.
- **NVIDIA Riva** overrides `transcribe_path()` and `transcribe_wav()` to send WAV bytes to a hosted Riva gRPC endpoint with provider-native timestamps.
- **Deepgram** overrides `transcribe_path()` and `transcribe_wav()` to post WAV bytes to Deepgram's `/listen` endpoint with utterances, word timestamps, optional diarization, and optional audio intelligence flags. In URL mode, it stages the prepared WAV in S3-compatible storage and sends Deepgram a pre-signed URL JSON request instead.

## Processing pipeline

Expand All @@ -69,7 +71,7 @@ Input file (any format)
├─ needs_conversion()? ──→ extract_to_wav(normalize) for local provider
├─ upload_as_mp3(normalize) for OpenAI/Azure, Qwen filetrans, and Gemini (16kHz mono MP3)
├─ hosted Riva path keeps WAV audio for gRPC recognition
├─ hosted Riva and Deepgram paths keep WAV audio for recognition
├─ get_duration() via ffprobe
Expand Down Expand Up @@ -190,9 +192,14 @@ Uses Gemini Files API and streamed `streamGenerateContent` for whole-file multim
- deletes the temporary Gemini file after the transcription request by default
- optionally reuses Gemini Files API uploads with `--gemini-file-cache`, using a local index keyed by SHA-256 of the exact prepared upload bytes
- optionally creates and reuses Gemini explicit `cachedContent` objects with `--gemini-explicit-cache`
- optionally bypasses Gemini Files API upload with `--gemini-use-presigned-url`, staging the prepared MP3 in S3/R2 and passing the signed URL as `file_uri`

Gemini is not a dedicated ASR endpoint. Timestamp, speaker, language, and emotion values come from the model's structured output, so benchmark quality before relying on them for subtitle workflows. The default path keeps Gemini whole-file for speaker continuity; explicit segmentation and long-input fallback are available with the expected risk that speakers may not remain stable between chunks.

Gemini signed URL mode is for one-off prepared inputs up to 100 MB. It is rejected for Gemini 2.0 family models and cannot be combined with Gemini Files API cache or explicit cached content.

`--autoclean` deletes temporary provider resources created during a run when the provider lifecycle makes that safe. For S3/R2 URL-staging providers, cleanup runs after the provider has consumed the URL and records best-effort cleanup metadata without failing a successful transcription.

### NVIDIA Riva (`nvidia_riva.rs`)

Uses hosted NVIDIA Riva ASR over gRPC through generated protobuf bindings in `proto/riva/proto/`. The provider:
Expand All @@ -206,6 +213,22 @@ Uses hosted NVIDIA Riva ASR over gRPC through generated protobuf bindings in `pr

The provider is implemented entirely in Rust with `tonic`/`prost`. It does not download local NVIDIA NIM containers or require Python clients.

### Deepgram (`deepgram.rs`)

Uses Deepgram's pre-recorded `/listen` REST API for batch transcription. The provider:

- defaults to `nova-3`, with `nova-3-medical` available through `--remote-model` when enabled for the account
- requests `smart_format=true` and `utterances=true`
- enables provider-native diarization with `diarize_model=latest` when `--diarize` or `--speakers` is set
- can send either direct audio bytes or a staged pre-signed S3/R2 URL with `--deepgram-use-presigned-url`
- accepts Nova-3 keyterm prompts through `--deepgram-keyterm`
- can enable Deepgram audio intelligence through `--deepgram-intelligence` or individual flags for summary, topics, intents, entities, and sentiment
- maps Deepgram utterances and word timestamps into normalized segments and words
- preserves returned intelligence blocks under `provider_metadata.data.intelligence`
- clamps provider timestamps to `metadata.duration` when necessary and records that under `provider_metadata.data.response.timestamps_clamped`

Deepgram's intelligence JSON is intentionally kept as provider metadata because it is richer than the normalized transcript schema and because downstream Transcript Intelligence consumers may want to inspect provider-native topics, intents, sentiments, entities, and token usage. URL-mode metadata records only that a file URL was used; temporary pre-signed URLs are not persisted.

## Analysis (`analysis.rs`)

Post-transcription analysis is separate from transcription. The first supported analysis is `--analysis summary`, which currently uses Gemini to run a second structured JSON call over the transcript text. Results are written to the manifest only when `--output-dir` is set:
Expand Down Expand Up @@ -263,7 +286,7 @@ All settings (timeout, retries, wait times) are configurable via CLI flags and e

### Shared WAV encoding

OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. Gemini uploads MP3 through Gemini Files API. NVIDIA Riva sends WAV bytes through gRPC. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.
OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. Gemini uploads MP3 through Gemini Files API by default, or stages MP3 in S3-compatible storage and sends a pre-signed URL when signed URL mode is enabled. NVIDIA Riva sends WAV bytes through gRPC. Deepgram posts WAV bytes to `/listen` by default, or stages WAV in S3-compatible storage and sends a pre-signed URL when URL mode is enabled. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.

## Model cache (`model_cache.rs`)

Expand Down
Loading