Backend for the YouTube Intelligence Platform: it ingests YouTube data, extracts health-related claims with an LLM, groups claims into narratives, and exposes the results through a versioned REST API.
- Pulls YouTube metadata and transcripts into Supabase
- Filters and prioritizes content for public-health relevance (including semantic filtering and impact-style signals in batch flows).
- Uses an LLM to extract generalizable health-related claims from transcripts.
- Matches claims to narratives using embeddings and creates new narratives when needed.
- Serves versioned API endpoints (
/api/v1/...) so clients can query health, overview, claims, narratives, ingestion, and related resources.
Requests hit a FastAPI service. It reads and writes from Supabase via SQLModel/SQLAlchemy. Ingestion pipelines fetch YouTube data (Data API + transcripts) and upsert into core tables. A separate LLM pipeline extracts claims and runs narrative matching (embedding similarity, backed by Amazon Bedrock Titan embeddings in the default setup). Deployment can be configured viarender.yaml.
| Area | Technologies |
|---|---|
| API & runtime | FastAPI, Uvicorn |
| DB & Schema | Supabase (Postgres), SQLAlchemy 2.x, SQLModel |
| Migrations | Alembic (alembic/) |
| YouTube Data Ingestion | YouTube Data API v3, youtube-transcript-api |
| Large Language Models (LLMs) | Switchable Ollama / Amazon Bedrock (environment-driven) |
| Narrative Embeddings | Amazon Bedrock - Titan Text Embeddings V2 (amazon.titan-embed-text-v2:0) via bedrock-runtime invoke_model |
| Quality of life | Ruff (format + lint), python-dotenv, Pydantic settings |
app/— The live API: routers under/api/v1, configuration, database sessions, SQLModel tables, and request/response schemas. Pipeline helpers used by HTTP endpoints (for example, single-video ingest) live here underapp/pipelines/.pipelines/— Batch jobs you run locally or can be configured to run async via AWS Lambda: YouTube search + filtering + persistence, LLM claim extraction and narrative linking, embedding-based matching helpers, and shared pipeline utilities.alembic/— Database migrations (versions/) and Alembic runtime configuration (env.py).scripts/— Developer utilities (for example, git hook setup and local orchestration scripts).
From backend/, create a virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtOptional formatting, linting, and pre-commit hooks:
pip install -r requirements-dev.txt
bash scripts/setup-hooks.shFormat and lint with Ruff:
ruff format .
ruff check .If you ran scripts/setup-hooks.sh, Ruff also runs on staged files via the pre-commit hook.
API server — Starts the FastAPI app with auto-reload for local development:
uvicorn app.main:app --reloadHealth check: GET /api/v1/health
Ingest a single video (via API) — Fetches video, channel, and transcript data and upserts into Supabase (channels, videos, transcripts):
POST /api/v1/ingest/videowith JSON{ "video_id": "<id>" }
(Implementation usesapp/pipelines/yt_ingest.py.)
Batch ingest + filter — Searches YouTube, applies an LLM semantic filter for public-health relevance, applies impact-style filtering, and persists videos and transcripts to Supabase:
python -m pipelines.yt_data_ingestionExample provider overrides:
# local test
LLM_PROVIDER=ollama LLM_MODEL=gemma2 python -m pipelines.yt_data_ingestion
# cloud runtime
LLM_PROVIDER=bedrock LLM_MODEL=gemma2 python -m pipelines.yt_data_ingestionClaims + narratives (LLM insight generation) — Reads transcripts without claims yet, extracts health-related claims, matches them to narratives with embeddings (default: Amazon Bedrock Titan Text Embeddings V2), opens new narratives when appropriate, and writes claims, narratives, and claim_narratives:
python -m pipelines.llm_insight_generationCreate a local .env (never commit it). The app loads it through Pydantic settings and python-dotenv.
Required for typical local development
| Variable | Purpose |
|---|---|
DATABASE_URL |
Postgres connection for SQLModel/Alembic |
SUPABASE_URL |
Supabase project URL |
SUPABASE_SERVICE_ROLE_KEY |
Service role access for Supabase |
YOUTUBE_DATA_API_KEY or YOUTUBE_API_KEY |
YouTube Data API v3 |
Required for narrative matching / embeddings (default Bedrock path)
| Variable | Purpose |
|---|---|
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY |
AWS credentials |
AWS_REGION or AWS_DEFAULT_REGION |
Region for Bedrock calls |
Common optional
| Variable | Notes |
|---|---|
FRONTEND_URL, PORT, ENV |
App/runtime tuning |
LLM_PROVIDER |
Default ollama; switch with env |
LLM_MODEL, OLLAMA_BASE_URL |
Model and Ollama endpoint |
YT_QUOTA_DAILY_BUDGET_UNITS |
Caps YouTube quota usage |
NARR_EMBEDDING_BACKEND |
Default bedrock |
NARR_EMBEDDING_MODEL |
Default amazon.titan-embed-text-v2:0 |
NARR_EMBEDDING_DIMENSIONS |
Default 512 |
NARR_EMBEDDING_NORMALIZE |
Default true |
NARR_EMBEDDING_TIMEOUT, NARR_STRONG_MATCH, NARR_MULTI_LINK, NARR_NEW_MIN, NARR_MAX_PER_CLAIM |
Matching tuning knobs |
render.yamldefines a basic Render deployment for the FastAPI service
pipelines/yt_data_ingestion.pyusesLLM_PROVIDER+LLM_MODEL(ollamaorbedrock). IfLLM_MODELis unset, ingestion defaults togemma2.pipelines/llm_insight_generation.pyuses the same variables; ifLLM_MODELis unset, it defaults toqwen3.
- Google Cloud: enable YouTube Data API v3, create an API key, set
YOUTUBE_DATA_API_KEY. - AWS: in your chosen region, request access to
amazon.titan-embed-text-v2:0in the Bedrock model catalog and ensure the IAM principal canbedrock:InvokeModelfor that model.
- Migrations live in
alembic/versions/. Alembic readsDATABASE_URLfromalembic/env.py;alembic.iniintentionally omits the URL.
alembic upgrade head
alembic current
alembic revision --autogenerate -m "describe your change"For contributors who want a quick map of the API package:
app/main.py— FastAPI app, CORS, routers mounted at/api/v1app/api/v1/endpoints/— Endpoints such as health, overview, claims, narratives, ingest, etc.app/core/config.py— Settings +.envloadingapp/core/database.py— Database sessions for routesapp/models/— SQLModel models (videos, claims, narratives, joins, etc.)app/schemas/— Pydantic request/response shapes
pipelines/yt_data_ingestion.py— Search + semantic filter + impact filter + transcript extractionpipelines/llm_insight_generation.py— Claims, narratives extractionpipelines/narrative_matching.py— Embedding + cosine similarity (Bedrock Titan Text Embeddings V2 viainvoke_model)pipelines/shared/— Shared helpers, interfaces and dataclasses