diff --git a/content/blog/netlify-edge-excludedpath-ai-crawlers.md b/content/blog/netlify-edge-excludedpath-ai-crawlers.md index 674ee71..fe139f8 100644 --- a/content/blog/netlify-edge-excludedpath-ai-crawlers.md +++ b/content/blog/netlify-edge-excludedpath-ai-crawlers.md @@ -5,7 +5,7 @@ date: "2025-12-21" slug: "netlify-edge-excludedpath-ai-crawlers" published: true tags: ["netlify", "edge-functions", "ai", "troubleshooting", "help"] -readTime: "4 min read" +readTime: "5 min read" featured: false --- @@ -36,78 +36,157 @@ The page could not be loaded with the tools currently available, so its raw mark **Claude:** Works. Loads and reads the markdown successfully. -## Current configuration +## Attempted solutions log -Static files exist in `public/raw/` and are served via `_redirects`: +### December 24, 2025 -``` -/raw/* /raw/:splat 200 -``` +**Attempt 1: excludedPath in netlify.toml** -Edge function configuration in `netlify.toml`: +Added array of excluded paths to the edge function declaration: ```toml [[edge_functions]] path = "/*" function = "botMeta" - excludedPath = "/raw/*" + excludedPath = [ + "/raw/*", + "/assets/*", + "/api/*", + "/.netlify/*", + "/favicon.ico", + "/favicon.svg", + "/robots.txt", + "/sitemap.xml", + "/llms.txt", + "/openapi.yaml" + ] ``` -The `botMeta` function also has a code-level check: +Result: ChatGPT and Perplexity still blocked. + +**Attempt 2: Hard bypass in botMeta.ts** + +Added early return at top of handler to guarantee static markdown is never intercepted: ```typescript -// Skip if it's the home page, static assets, API routes, or raw markdown files +const url = new URL(request.url); if ( - pathParts.length === 0 || - pathParts[0].includes(".") || - pathParts[0] === "api" || - pathParts[0] === "_next" || - pathParts[0] === "raw" // This check exists + url.pathname.startsWith("/raw/") || + url.pathname.startsWith("/assets/") || + url.pathname.startsWith("/api/") || + url.pathname.startsWith("/.netlify/") || + url.pathname.endsWith(".md") ) { return context.next(); } ``` -## Why it's not working +Result: ChatGPT and Perplexity still blocked. -Despite `excludedPath = "/raw/*"` and the code check, the edge function still intercepts requests to `/raw/*.md` before static files are served. +**Attempt 3: AI crawler whitelist** -According to Netlify docs, edge functions run before redirects and static file serving. The `excludedPath` should prevent the function from running, but it appears the function still executes and may be returning a response that blocks static file access. - -## What we've tried - -1. Added `excludedPath = "/raw/*"` in netlify.toml -2. Added code-level check in botMeta.ts to skip `/raw/` paths -3. Verified static files exist in `public/raw/` after build -4. Confirmed `_redirects` rule for `/raw/*` is in place -5. Tested with different URLPattern syntax (`/raw/*`, `/**/*.md`) - -All attempts result in the same behavior: ChatGPT and Perplexity cannot access the files, while Claude can. - -## Why Claude works - -Claude's web fetcher may use different headers or handle Netlify's edge function responses differently. It successfully bypasses whatever is blocking ChatGPT and Perplexity. - -## The question - -How can we configure Netlify edge functions to truly exclude `/raw/*` paths so static markdown files are served directly to all AI crawlers without interception? - -Is there a configuration issue with `excludedPath`? Should we use a different approach like header-based matching to exclude AI crawlers from the botMeta function? Or is there a processing order issue where edge functions always run before static files regardless of exclusions? - -## Code reference - -The CopyPageDropdown component sends these URLs to AI services: +Added explicit bypass for known AI user agents: ```typescript -const rawMarkdownUrl = `${origin}/raw/${props.slug}.md`; +const AI_CRAWLERS = [ + "gptbot", "chatgpt", "chatgpt-user", "oai-searchbot", + "claude-web", "claudebot", "anthropic", "perplexitybot" +]; + +if (isAICrawler(userAgent)) { + return context.next(); +} ``` -Example: `https://www.markdown.fast/raw/fork-configuration-guide.md` +Result: ChatGPT and Perplexity still blocked. -The files exist. The redirects are configured. The edge function has exclusions. But AI crawlers still cannot access them. +**Attempt 4: Netlify Function at /api/raw/:slug** + +Created a serverless function to serve markdown files directly: + +```javascript +// netlify/functions/raw.js +exports.handler = async (event) => { + const slug = event.queryStringParameters?.slug; + // Read from dist/raw/${slug}.md or public/raw/${slug}.md + return { + statusCode: 200, + headers: { "Content-Type": "text/plain; charset=utf-8" }, + body: markdownContent + }; +}; +``` + +With redirect rule: + +```toml +[[redirects]] + from = "/api/raw/*" + to = "/.netlify/functions/raw?slug=:splat" + status = 200 + force = true +``` + +Result: Netlify build failures due to function bundling issues and `package-lock.json` dependency conflicts. + +**Attempt 5: Header adjustments** + +Removed `Link` header from global scope to prevent header merging on `/raw/*`: + +```toml +[[headers]] + for = "/*" + [headers.values] + X-Frame-Options = "DENY" + # Link header removed from global scope + +[[headers]] + for = "/index.html" + [headers.values] + Link = "; rel=\"author\"" +``` + +Removed `X-Robots-Tag = "noindex"` from `/raw/*` headers. + +Result: ChatGPT and Perplexity still blocked. + +### Why these attempts failed + +The core issue appears to be how ChatGPT and Perplexity fetch URLs. Their tools receive 400 or 403 responses even when `curl` from the command line works. This suggests: + +1. Netlify may handle AI crawler user agents differently at the CDN level +2. The edge function exclusions work for browsers but not for AI fetch tools +3. There may be rate limiting or bot protection enabled by default + +## Current workaround + +Users can still share content with AI tools by: + +1. **Copy page** copies markdown to clipboard, then paste into any AI +2. **View as Markdown** opens the raw `.md` file in a browser tab for manual copying +3. **Download as SKILL.md** downloads in Anthropic Agent Skills format + +The direct "Open in ChatGPT/Claude/Perplexity" buttons have been disabled since the URLs don't work reliably. + +## Working features + +Despite AI crawler issues, these features work correctly: + +- `/raw/*.md` files load in browsers +- `llms.txt` discovery file is accessible +- `openapi.yaml` API spec loads properly +- Sitemap and RSS feeds generate correctly +- Social preview bots (Twitter, Facebook, LinkedIn) receive OG metadata +- Claude's web fetcher can access raw markdown ## Help needed -If you've solved this or have suggestions, we'd appreciate guidance. The goal is simple: serve static markdown files at `/raw/*.md` to all clients, including AI crawlers, without edge function interception. +If you've solved this or have suggestions, open an issue. We've tried: -GitHub raw URLs work as a workaround, but we'd prefer to use Netlify-hosted files for consistency and to avoid requiring users to configure GitHub repo details when forking. +- netlify.toml excludedPath arrays +- Code-level path checks in edge functions +- AI crawler user agent whitelisting +- Netlify Functions as an alternative endpoint +- Header configuration adjustments + +None have worked for ChatGPT or Perplexity. GitHub raw URLs remain the most reliable option for AI consumption, but require additional repository configuration when forking. diff --git a/netlify.toml b/netlify.toml index c4359c6..5fe7ede 100644 --- a/netlify.toml +++ b/netlify.toml @@ -5,13 +5,6 @@ [build.environment] NODE_VERSION = "20" -# API raw markdown endpoint for AI tools (ChatGPT, Claude, Perplexity) -[[redirects]] - from = "/api/raw/*" - to = "/.netlify/functions/raw?slug=:splat" - status = 200 - force = true - # Raw markdown passthrough - explicit rule prevents SPA fallback from intercepting [[redirects]] from = "/raw/*" diff --git a/netlify/functions/raw.js b/netlify/functions/raw.js deleted file mode 100644 index ab0a6a7..0000000 --- a/netlify/functions/raw.js +++ /dev/null @@ -1,77 +0,0 @@ -const fs = require("fs"); -const path = require("path"); - -/** - * Netlify Function: /api/raw/:slug - * - * Serves raw markdown files for AI tools (ChatGPT, Claude, Perplexity). - * Returns text/plain with minimal headers for reliable AI ingestion. - */ - -function normalizeSlug(input) { - return (input || "").trim().replace(/^\/+|\/+$/g, ""); -} - -function tryRead(p) { - try { - if (!fs.existsSync(p)) return null; - const body = fs.readFileSync(p, "utf8"); - if (!body || body.trim().length === 0) return null; - return body; - } catch { - return null; - } -} - -exports.handler = async (event) => { - const slugRaw = - event.queryStringParameters && event.queryStringParameters.slug; - const slug = normalizeSlug(slugRaw); - - if (!slug) { - return { - statusCode: 400, - headers: { - "Content-Type": "text/plain; charset=utf-8", - "Access-Control-Allow-Origin": "*", - }, - body: "missing slug", - }; - } - - const filename = slug.endsWith(".md") ? slug : `${slug}.md`; - const root = process.cwd(); - - const candidates = [ - path.join(root, "public", "raw", filename), - path.join(root, "dist", "raw", filename), - ]; - - let body = null; - for (const p of candidates) { - body = tryRead(p); - if (body) break; - } - - if (!body) { - return { - statusCode: 404, - headers: { - "Content-Type": "text/plain; charset=utf-8", - "Access-Control-Allow-Origin": "*", - }, - body: `not found: ${filename}`, - }; - } - - return { - statusCode: 200, - headers: { - "Content-Type": "text/plain; charset=utf-8", - "Access-Control-Allow-Origin": "*", - "Cache-Control": "public, max-age=3600", - }, - body, - }; -}; - diff --git a/src/components/CopyPageDropdown.tsx b/src/components/CopyPageDropdown.tsx index a762d1e..0ed02fe 100644 --- a/src/components/CopyPageDropdown.tsx +++ b/src/components/CopyPageDropdown.tsx @@ -1,85 +1,9 @@ import { useState, useRef, useEffect, useCallback } from "react"; -import { - Copy, - MessageSquare, - Sparkles, - Search, - Check, - AlertCircle, - FileText, - Download, -} from "lucide-react"; +import { Copy, Check, AlertCircle, FileText, Download } from "lucide-react"; // Maximum URL length for query parameters (conservative limit) const MAX_URL_LENGTH = 6000; -// AI service configurations -interface AIService { - id: string; - name: string; - icon: typeof Copy; - baseUrl: string; - description: string; - supportsUrlPrefill: boolean; - // Custom URL builder for services with special formats - buildUrl?: (prompt: string) => string; - // URL-based builder - takes raw markdown file URL for better AI parsing - buildUrlFromRawMarkdown?: (rawMarkdownUrl: string) => string; -} - -// AI services configuration - uses raw markdown URLs for better AI parsing -const AI_SERVICES: AIService[] = [ - { - id: "chatgpt", - name: "ChatGPT", - icon: MessageSquare, - baseUrl: "https://chatgpt.com/", - description: "Analyze with ChatGPT", - supportsUrlPrefill: true, - // Uses raw markdown file URL for direct content access - buildUrlFromRawMarkdown: (rawMarkdownUrl) => { - const prompt = - `Attempt to load and read the raw markdown at the URL below.\n` + - `If successful provide a concise summary and then ask what the user needs help with.\n` + - `If not accessible do not guess the content. State that the page could not be loaded and ask the user how you can help.\n\n` + - `${rawMarkdownUrl}`; - return `https://chatgpt.com/?q=${encodeURIComponent(prompt)}`; - }, - }, - { - id: "claude", - name: "Claude", - icon: Sparkles, - baseUrl: "https://claude.ai/", - description: "Analyze with Claude", - supportsUrlPrefill: true, - buildUrlFromRawMarkdown: (rawMarkdownUrl) => { - const prompt = - `Attempt to load and read the raw markdown at the URL below.\n` + - `If successful provide a concise summary and then ask what the user needs help with.\n` + - `If not accessible do not guess the content. State that the page could not be loaded and ask the user how you can help.\n\n` + - `${rawMarkdownUrl}`; - return `https://claude.ai/new?q=${encodeURIComponent(prompt)}`; - }, - }, - { - id: "perplexity", - name: "Perplexity", - icon: Search, - baseUrl: "https://www.perplexity.ai/search", - description: "Research with Perplexity", - supportsUrlPrefill: true, - buildUrlFromRawMarkdown: (rawMarkdownUrl) => { - const prompt = - `Attempt to load and read the raw markdown at the URL below.\n` + - `If successful provide a concise summary and then ask what the user needs help with.\n` + - `If not accessible do not guess the content. State that the page could not be loaded and ask the user how you can help.\n\n` + - `${rawMarkdownUrl}`; - return `https://www.perplexity.ai/search?q=${encodeURIComponent(prompt)}`; - }, - }, -]; - // Extended props interface with optional metadata interface CopyPageDropdownProps { title: string; @@ -321,67 +245,6 @@ export default function CopyPageDropdown(props: CopyPageDropdownProps) { setTimeout(() => setIsOpen(false), 1500); }; - // Generic handler for opening AI services - // Uses /api/raw/:slug endpoint for AI tools (ChatGPT, Claude, Perplexity) - // IMPORTANT: window.open must happen BEFORE any await to avoid popup blockers - const handleOpenInAI = async (service: AIService) => { - // Use /api/raw/:slug endpoint for AI tools - more reliable than static /raw/*.md files - if (service.buildUrlFromRawMarkdown) { - // Build absolute API URL using current origin - // Uses Netlify Function endpoint that returns text/plain with minimal headers - const apiRawUrl = new URL( - `/api/raw/${props.slug}`, - window.location.origin, - ).toString(); - const targetUrl = service.buildUrlFromRawMarkdown(apiRawUrl); - - window.open(targetUrl, "_blank"); - setIsOpen(false); - return; - } - - // Other services: send full markdown content - const markdown = formatAsMarkdown(props); - const prompt = `Please analyze this article:\n\n${markdown}`; - - // Build the target URL using the service's buildUrl function - if (!service.buildUrl) { - // Fallback: open base URL FIRST (sync), then copy to clipboard - window.open(service.baseUrl, "_blank"); - const success = await writeToClipboard(markdown); - if (success) { - setFeedback("url-too-long"); - setFeedbackMessage("Copied! Paste in " + service.name); - } else { - setFeedback("error"); - setFeedbackMessage("Failed to copy content"); - } - clearFeedback(); - return; - } - - const targetUrl = service.buildUrl(prompt); - - // Check URL length - if too long, open base URL then copy to clipboard - if (isUrlTooLong(targetUrl)) { - // Open window FIRST (must be sync to avoid popup blocker) - window.open(service.baseUrl, "_blank"); - const success = await writeToClipboard(markdown); - if (success) { - setFeedback("url-too-long"); - setFeedbackMessage("Copied! Paste in " + service.name); - } else { - setFeedback("error"); - setFeedbackMessage("Failed to copy content"); - } - clearFeedback(); - } else { - // URL is within limits, open directly with prefilled content - window.open(targetUrl, "_blank"); - setIsOpen(false); - } - }; - // Handle download skill file (Anthropic Agent Skills format) const handleDownloadSkill = () => { const skillContent = formatAsSkill(props); @@ -423,6 +286,10 @@ export default function CopyPageDropdown(props: CopyPageDropdownProps) { } }; + // Suppress unused variable warnings for functions that may be used later + void isUrlTooLong; + void MAX_URL_LENGTH; + return (