transcriptintel · skitsanos · Jun 17, 2026 · Jun 17, 2026 · Jun 17, 2026 · Jun 17, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "transcribeit"
-version = "1.5.0"
+version = "1.6.0"
 edition = "2024"
 rust-version = "1.96"
 license-file = "LICENSE"

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # transcribeit
 
-A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, Qwen ASR file transcription, Gemini multimodal transcription, and NVIDIA hosted Riva ASR.
+A Rust CLI for speech-to-text transcription. Supports local inference via [whisper.cpp](https://github.com/ggerganov/whisper.cpp), local inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx), remote transcription via OpenAI-compatible APIs, Azure OpenAI, Qwen ASR file transcription, Gemini multimodal transcription, NVIDIA hosted Riva ASR, and Deepgram.
 
 Accepts any audio or video format — FFmpeg handles conversion automatically.
 
@@ -10,8 +10,9 @@ Accepts any audio or video format — FFmpeg handles conversion automatically.
 - [FFmpeg](https://ffmpeg.org/) installed and on PATH
 - C/C++ toolchain and CMake (for building whisper.cpp)
 - sherpa-onnx shared libraries (if using the `sherpa-onnx` provider) — set `SHERPA_ONNX_LIB_DIR` in `.env` to the directory containing them
-- S3-compatible storage credentials when using `qwen-filetrans`; Cloudflare R2 is supported through `S3_ENDPOINT_URL`
+- S3-compatible storage credentials when using `qwen-filetrans` or Deepgram pre-signed URL mode; Cloudflare R2 is supported through `S3_ENDPOINT_URL`
 - NVIDIA API key and hosted Riva function id when using `nvidia-riva`
+- Deepgram API key when using `deepgram`
 
 ## Quick start
 
@@ -80,6 +81,14 @@ transcribeit run -p gemini --gemini-file-cache \
 transcribeit run -p gemini --gemini-explicit-cache --gemini-cache-ttl-secs 3600 \
   -i recording.mp3 -f vtt -o ./output
 
+# Use S3/R2 pre-signed URL input for a one-off Gemini run
+transcribeit run -p gemini --gemini-use-presigned-url \
+  -i recording.mp3 -f vtt -o ./output
+
+# Delete temporary staged provider resources after the provider consumes them
+transcribeit run -p qwen-filetrans --autoclean \
+  -i recording.mp3 -f vtt -o ./output
+
 # Transcribe with Gemini and add a structured summary to the manifest
 transcribeit run -p gemini --analysis summary \
   -i interview.mp4 -f vtt -o ./output
@@ -90,6 +99,20 @@ transcribeit run -p nvidia-riva -i recording.wav \
   --nvidia-riva-function-id "$NVIDIA_RIVA_FUNCTION_ID" \
   -f vtt -o ./output
 
+# Transcribe with Deepgram Nova-3 batch ASR and provider-native diarization
+transcribeit run -p deepgram --remote-model nova-3 --diarize \
+  -i recording.wav -f vtt -o ./output
+
+# Transcribe with Deepgram by staging the prepared audio in S3/R2 first
+transcribeit run -p deepgram --remote-model nova-3 --deepgram-use-presigned-url \
+  -i recording.wav -f vtt -o ./output
+
+# Transcribe with Deepgram Nova-3 Medical, intelligence metadata, and domain keyterms
+transcribeit run -p deepgram --remote-model nova-3-medical \
+  --diarize --deepgram-intelligence \
+  --deepgram-keyterm Ofev --deepgram-keyterm Esbriet --deepgram-keyterm IPF \
+  -i interview.wav -f vtt -o ./output
+
 # Force language and normalize before transcription
 transcribeit run -i recording.wav -m base --language en --normalize
 
@@ -105,17 +128,21 @@ transcribeit run -i interview.mp3 -m base --diarize --speakers 2 \
 ## Features
 
 - **Any input format** — MP3, MP4, WAV, FLAC, OGG, etc. FFmpeg converts to mono 16kHz WAV automatically.
-- **7 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, Qwen file transcription, Gemini, and NVIDIA Riva. Extensible via the `Transcriber` trait.
+- **8 providers** — Local whisper.cpp, sherpa-onnx, OpenAI API, Azure OpenAI, Qwen file transcription, Gemini, NVIDIA Riva, and Deepgram. Extensible via the `Transcriber` trait.
 - **Qwen ASR whole-file transcription** — `qwen-filetrans` stages audio in S3-compatible storage, passes a pre-signed URL to DashScope, polls the async task, and maps Qwen timestamps into the transcript model.
 - **Stable manifest schema** — Manifests use `transcribeit.manifest.v2` with canonical millisecond timestamps, provider-neutral capabilities/quality fields, and provider-specific metadata under `provider_metadata.data`.
 - **Cache telemetry** — Manifests normalize provider token-cache signals under `cache`, including Gemini `cachedContentTokenCount` and OpenAI/Azure-style `cached_tokens` when returned.
 - **Qwen provider metadata** — Manifests include Qwen task timing/usage, audio info, per-segment language/emotion, and word-level timestamps. Temporary pre-signed URLs are not persisted.
 - **Qwen model guardrails** — Accidental short-audio `qwen3-asr-flash` model selection is rejected before conversion and S3 upload; use `qwen3-asr-flash-filetrans` for this provider.
 - **Gemini whole-file transcription** — `gemini` uploads prepared audio through Gemini Files API, streams `generateContent` response chunks with structured JSON output, and maps segment timestamps, speaker labels, language, and emotion when returned.
 - **Gemini file reuse** — `--gemini-file-cache` keeps a local index of Gemini Files API uploads keyed by SHA-256 of the prepared 16 kHz mono MP3 bytes, verifies the remote file before reuse, and records reuse metadata in the manifest.
+- **Gemini signed URL input** — `--gemini-use-presigned-url` stages prepared MP3 audio in S3/R2 and sends the signed URL as Gemini `file_uri` for one-off inputs up to 100 MB. Files API cache and explicit cached content remain Files API-only.
 - **Gemini explicit cache** — `--gemini-explicit-cache` creates and reuses Gemini `cachedContent` objects with a configurable TTL, producing deterministic `cachedContentTokenCount` telemetry when Gemini accepts the cache.
 - **Gemini summary analysis** — `--analysis summary` runs a second Gemini JSON pass over the transcript and stores a provider-neutral summary, key points, topics, questions, and follow-ups in the manifest.
+- **Temporary resource cleanup** — `--autoclean` performs best-effort cleanup of temporary provider resources created by the run, including S3/R2 staged objects for Qwen, Gemini signed URL mode, and Deepgram signed URL mode.
 - **NVIDIA hosted Riva ASR** — `nvidia-riva` calls hosted NVIDIA Riva gRPC endpoints with provider-native word timestamps, optional server-side diarization, and manifest metadata.
+- **Deepgram Nova batch ASR** — `deepgram` calls Deepgram's `/listen` API, defaults to `nova-3`, requests utterances and smart formatting, supports provider-native diarization through `--diarize`, and can submit either direct audio bytes or an S3/R2 pre-signed URL with `--deepgram-use-presigned-url`.
+- **Deepgram audio intelligence** — `--deepgram-intelligence` captures Deepgram summary, topics, intents, entity detection, and sentiment in `provider_metadata.data.intelligence`; `--deepgram-keyterm` passes Nova-3 keyterm prompts for domain terminology.
 - **3 model architectures via sherpa-onnx** — Whisper, Moonshine, and SenseVoice are auto-detected from the model directory contents. Just point `--model` at any supported model directory.
 - **Model aliases** — `-m base`, `-m tiny`, etc. resolve from `MODEL_CACHE_DIR` for both `local` and `sherpa-onnx` providers. The sherpa-onnx resolver also supports glob matching (e.g., `-m moonshine-base`, `-m sense-voice`).
 - **Language hinting** — Pass `--language` to force local and API transcription language.
@@ -145,9 +172,15 @@ SHERPA_ONNX_LIB_DIR=/path/to/sherpa-onnx/lib
 OPENAI_API_KEY=sk-your_key_here
 GEMINI_API_KEY=your_gemini_key_here
 GEMINI_API_BASE_URL=https://generativelanguage.googleapis.com/v1beta
+GEMINI_USE_PRESIGNED_URL=false
 NVIDIA_API_KEY=your_nvidia_key_here
 NVIDIA_RIVA_FUNCTION_ID=your_hosted_riva_function_id
 NVIDIA_RIVA_SERVER=grpc.nvcf.nvidia.com:443
+DEEPGRAM_API_KEY=your_deepgram_key_here
+DEEPGRAM_API_BASE_URL=https://api.deepgram.com/v1
+DEEPGRAM_INTELLIGENCE=false
+DEEPGRAM_KEYTERM=Ofev,Esbriet,IPF
+DEEPGRAM_USE_PRESIGNED_URL=false
 AZURE_API_KEY=your_azure_key_here
 AZURE_OPENAI_ENDPOINT=https://myresource.openai.azure.com
 AZURE_DEPLOYMENT_NAME=whisper
@@ -159,9 +192,11 @@ S3_REGION=auto
 S3_ENDPOINT_URL=https://<account-id>.r2.cloudflarestorage.com
 S3_ACCESS_KEY_ID=your_s3_access_key
 S3_SECRET_ACCESS_KEY=your_s3_secret_key
+# Optional; when unset, URL-staging providers choose their own prefix.
 S3_PREFIX=transcribeit/qwen-filetrans
 S3_PRESIGN_EXPIRES_SECS=3600
 S3_FORCE_PATH_STYLE=false
+TRANSCRIBEIT_AUTOCLEAN=false
 TRANSCRIBEIT_MAX_RETRIES=5
 TRANSCRIBEIT_REQUEST_TIMEOUT_SECS=120
 TRANSCRIBEIT_RETRY_WAIT_BASE_SECS=10

diff --git a/Taskfile.yaml b/Taskfile.yaml
@@ -68,3 +68,16 @@ tasks:
                 "enable_itn": false
             }
         }'
+  test-deepgram:
+    cmds:
+      - |
+        test -n "$DEEPGRAM_API_KEY" || (echo "DEEPGRAM_API_KEY is not set" >&2; exit 1)
+        DEEPGRAM_API_BASE_URL="${DEEPGRAM_API_BASE_URL:-https://api.deepgram.com/v1}"
+
+        curl --silent --show-error --location \
+          --request POST \
+          --write-out "\nHTTP_STATUS:%{http_code}\n" \
+          --header "Authorization: Token ${DEEPGRAM_API_KEY}" \
+          --header "Content-Type: application/json" \
+          --data '{"url":"https://dpgr.am/spacewalk.wav"}' \
+          "${DEEPGRAM_API_BASE_URL%/}/listen?model=nova-3&smart_format=true"
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -28,6 +28,7 @@ src/
     ├── azure_openai.rs    # Azure OpenAI REST API
     ├── gemini.rs          # Gemini Files API + streamed generateContent
     ├── nvidia_riva.rs     # NVIDIA hosted Riva gRPC ASR
+    ├── deepgram.rs        # Deepgram Nova batch ASR + audio intelligence
     ├── qwen_filetrans.rs  # Qwen async file transcription provider
     ├── qwen_filetrans/    # Qwen request/response types and model limits
     ├── rate_limit.rs      # Retry logic and 429 handling
@@ -57,8 +58,9 @@ pub trait Transcriber: Send + Sync {
 - **Sherpa-ONNX engine** (`sherpa_onnx`) uses `transcribe()` — it needs decoded samples for the ONNX runtime.
 - **OpenAI/Azure API engines** override `transcribe_path()` to upload files directly via multipart, and `transcribe_wav()` to upload in-memory bytes — avoiding the decode→re-encode round-trip.
 - **Qwen file transcription** overrides `transcribe_path()` to upload prepared audio to S3-compatible storage, generate a pre-signed URL, and submit that URL to DashScope.
-- **Gemini** overrides `transcribe_path()` to upload prepared audio through Gemini Files API and call streamed `streamGenerateContent` with structured JSON output.
+- **Gemini** overrides `transcribe_path()` to upload prepared audio through Gemini Files API and call streamed `streamGenerateContent` with structured JSON output. In signed URL mode, it stages the prepared MP3 in S3-compatible storage and sends the pre-signed URL as Gemini `file_uri` instead.
 - **NVIDIA Riva** overrides `transcribe_path()` and `transcribe_wav()` to send WAV bytes to a hosted Riva gRPC endpoint with provider-native timestamps.
+- **Deepgram** overrides `transcribe_path()` and `transcribe_wav()` to post WAV bytes to Deepgram's `/listen` endpoint with utterances, word timestamps, optional diarization, and optional audio intelligence flags. In URL mode, it stages the prepared WAV in S3-compatible storage and sends Deepgram a pre-signed URL JSON request instead.
 
 ## Processing pipeline
 
@@ -69,7 +71,7 @@ Input file (any format)
   │
   ├─ needs_conversion()? ──→ extract_to_wav(normalize) for local provider
   ├─ upload_as_mp3(normalize) for OpenAI/Azure, Qwen filetrans, and Gemini (16kHz mono MP3)
-  ├─ hosted Riva path keeps WAV audio for gRPC recognition
+  ├─ hosted Riva and Deepgram paths keep WAV audio for recognition
   │
   ├─ get_duration() via ffprobe
   │
@@ -190,9 +192,14 @@ Uses Gemini Files API and streamed `streamGenerateContent` for whole-file multim
 - deletes the temporary Gemini file after the transcription request by default
 - optionally reuses Gemini Files API uploads with `--gemini-file-cache`, using a local index keyed by SHA-256 of the exact prepared upload bytes
 - optionally creates and reuses Gemini explicit `cachedContent` objects with `--gemini-explicit-cache`
+- optionally bypasses Gemini Files API upload with `--gemini-use-presigned-url`, staging the prepared MP3 in S3/R2 and passing the signed URL as `file_uri`
 
 Gemini is not a dedicated ASR endpoint. Timestamp, speaker, language, and emotion values come from the model's structured output, so benchmark quality before relying on them for subtitle workflows. The default path keeps Gemini whole-file for speaker continuity; explicit segmentation and long-input fallback are available with the expected risk that speakers may not remain stable between chunks.
 
+Gemini signed URL mode is for one-off prepared inputs up to 100 MB. It is rejected for Gemini 2.0 family models and cannot be combined with Gemini Files API cache or explicit cached content.
+
+`--autoclean` deletes temporary provider resources created during a run when the provider lifecycle makes that safe. For S3/R2 URL-staging providers, cleanup runs after the provider has consumed the URL and records best-effort cleanup metadata without failing a successful transcription.
+
 ### NVIDIA Riva (`nvidia_riva.rs`)
 
 Uses hosted NVIDIA Riva ASR over gRPC through generated protobuf bindings in `proto/riva/proto/`. The provider:
@@ -206,6 +213,22 @@ Uses hosted NVIDIA Riva ASR over gRPC through generated protobuf bindings in `pr
 
 The provider is implemented entirely in Rust with `tonic`/`prost`. It does not download local NVIDIA NIM containers or require Python clients.
 
+### Deepgram (`deepgram.rs`)
+
+Uses Deepgram's pre-recorded `/listen` REST API for batch transcription. The provider:
+
+- defaults to `nova-3`, with `nova-3-medical` available through `--remote-model` when enabled for the account
+- requests `smart_format=true` and `utterances=true`
+- enables provider-native diarization with `diarize_model=latest` when `--diarize` or `--speakers` is set
+- can send either direct audio bytes or a staged pre-signed S3/R2 URL with `--deepgram-use-presigned-url`
+- accepts Nova-3 keyterm prompts through `--deepgram-keyterm`
+- can enable Deepgram audio intelligence through `--deepgram-intelligence` or individual flags for summary, topics, intents, entities, and sentiment
+- maps Deepgram utterances and word timestamps into normalized segments and words
+- preserves returned intelligence blocks under `provider_metadata.data.intelligence`
+- clamps provider timestamps to `metadata.duration` when necessary and records that under `provider_metadata.data.response.timestamps_clamped`
+
+Deepgram's intelligence JSON is intentionally kept as provider metadata because it is richer than the normalized transcript schema and because downstream Transcript Intelligence consumers may want to inspect provider-native topics, intents, sentiments, entities, and token usage. URL-mode metadata records only that a file URL was used; temporary pre-signed URLs are not persisted.
+
 ## Analysis (`analysis.rs`)
 
 Post-transcription analysis is separate from transcription. The first supported analysis is `--analysis summary`, which currently uses Gemini to run a second structured JSON call over the transcript text. Results are written to the manifest only when `--output-dir` is set:
@@ -263,7 +286,7 @@ All settings (timeout, retries, wait times) are configurable via CLI flags and e
 
 ### Shared WAV encoding
 
-OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. Gemini uploads MP3 through Gemini Files API. NVIDIA Riva sends WAV bytes through gRPC. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.
+OpenAI/Azure engines can send file uploads directly and choose the correct container format for compatibility (WAV for local transcribe path, MP3 for API provider uploads). Qwen file transcription stages MP3 in S3-compatible storage and sends DashScope a pre-signed URL. Gemini uploads MP3 through Gemini Files API by default, or stages MP3 in S3-compatible storage and sends a pre-signed URL when signed URL mode is enabled. NVIDIA Riva sends WAV bytes through gRPC. Deepgram posts WAV bytes to `/listen` by default, or stages WAV in S3-compatible storage and sends a pre-signed URL when URL mode is enabled. The `audio::wav::encode_wav()` helper is still used by local engines and non-file upload paths.
 
 ## Model cache (`model_cache.rs`)