Skip to content

pavex/mcp-web-fetch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mcp-web-fetch

Token-efficient web reading and HTTP requests for MCP agents.

An MCP server with two tools: fetch_text strips web pages down to clean readable text — dramatically reducing token usage when an agent needs to read a URL. http_request is a full HTTP client for REST API calls, form submissions, and anything requiring raw control.

Built for Claude, Cursor, and any MCP-compatible agent. No browser required. Pure Node.js, single bundled file.


Why fetch_text matters for agents

A typical web page weighs 300–800 KB of raw HTML — scripts, styles, nav bars, footers. Most of it is noise. An agent reading that page burns thousands of tokens on markup it cannot use.

fetch_text scrapes the page and returns only the readable content:

google.com raw HTML   →  ~480 000 chars
google.com fetch_text →       177 chars
manifesto page HTML   →  ~42 000 chars  
manifesto fetch_text  →    5 800 chars  (~7× smaller)

This is a simple HTML scraper — not a full browser renderer. It does not execute JavaScript, handle SPAs, or bypass bot protection. That is the tradeoff for zero dependencies and minimal overhead. For static pages, documentation, articles, and llms.txt files it works excellently.


Tools

fetch_text — low-token web content

Fetches a URL and returns clean readable text. Skips all scripts, styles, navigation, and layout noise. Extracts <title> separately. Prefers <main> or <article> when available.

param type default description
url string required Any valid URL
max_chars number 20000 Output character cap
timeout_ms number 10000 Request timeout in ms

Response:

{
  "ok": true,
  "url": "https://example.com/article",
  "status": 200,
  "title": "Article title",
  "text": "Clean readable content without any HTML...",
  "char_count": 4821,
  "truncated": false,
  "elapsed_ms": 248
}

Examples:

# Read an article or documentation page
fetch_text("https://docs.example.com/guide")

# Read a manifesto or about page
fetch_text("https://unpredictablemachine.com/manifesto")

# Read llms.txt
fetch_text("https://example.com/llms.txt")

# Limit output for large pages
fetch_text("https://en.wikipedia.org/wiki/Node.js", max_chars=5000)

Limits:

  • Does not execute JavaScript — SPAs and dynamically rendered content may return empty or partial text
  • Does not handle bot protection or CAPTCHAs
  • Not a replacement for a headless browser

http_request — full HTTP client

Universal HTTP client with full control over method, headers, and body. Use for REST APIs, form posts, webhooks, localhost, and internal network addresses.

param type default description
url string required Any valid URL (https, http, localhost, internal IP)
method string GET GET, POST, PUT, PATCH, DELETE, HEAD, OPTIONS
headers object {} Custom request headers
body string Raw request body (XML, form-data, plain text)
body_json object Auto-serialized JSON + sets Content-Type: application/json
timeout_ms number 10000 Request timeout in ms
max_bytes number 500000 Response body size cap

body_json takes priority over body when both are provided.

Response:

{
  "ok": true,
  "url": "https://api.example.com/posts",
  "method": "POST",
  "status": 201,
  "status_text": "Created",
  "content_type": "application/json",
  "headers": { "content-type": "application/json" },
  "body": "{\"id\": 42}",
  "truncated": false,
  "elapsed_ms": 142
}

Examples:

# REST POST with JSON body
http_request("https://api.example.com/posts",
  method="POST",
  body_json={"title": "Hello", "published": true})

# PUT with Authorization header
http_request("https://api.example.com/users/1",
  method="PUT",
  headers={"Authorization": "Bearer TOKEN"},
  body_json={"name": "Pavel"})

# Raw XML payload
http_request("https://legacy.api/endpoint",
  method="POST",
  headers={"Content-Type": "application/xml"},
  body="<root><item>value</item></root>")

# DELETE
http_request("http://localhost:8080/api/posts/42", method="DELETE")

# Internal network
http_request("http://192.168.1.100:8080/api/status")

When to use which

situation tool
Reading articles, docs, blog posts fetch_text
Reading llms.txt or plain text files fetch_text
REST API calls (POST / PUT / DELETE) http_request
Raw response body or headers needed http_request
Localhost or internal network both work
JavaScript-rendered SPA neither (use a browser)

Logging

All requests logged to .var/requests.log — one JSON line per request:

{"ts":"2026-06-10T08:20:00.000Z","tool":"fetch_text","method":"GET","url":"https://example.com","status":200,"ok":true,"elapsed_ms":248}

Rotates at ~1 MB → keeps one .1 backup. Configure or disable in src/Config.js:

LOG_FILE: '.var/requests.log',  // '' = disabled
LOG_MAX_BYTES: 1_000_000

Install & build

build.cmd

Installs dependencies, bundles to dist/mcp.js, runs tests. The dist/ folder is self-contained — no node_modules needed at runtime.

Claude Desktop config

{
  "mcpServers": {
    "mcp-web-fetch": {
      "command": "node",
      "args": ["D:/dev/ai/mcp-web-fetch/dist/mcp.js"]
    }
  }
}

Stack

  • Node.js 22+ (native fetch built-in, no extra HTTP dependency)
  • @modelcontextprotocol/sdk
  • node-html-parser — fast pure-JS HTML parser, no native bindings
  • zod + zod-to-json-schema
  • esbuild (build only)

About

Token-efficient web reading and HTTP requests for MCP agents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors