Building an Auto-Pilot API Documentation Scraper with n8n

Ever tried feeding API docs to an LLM, only to realize you're stuck clicking through fifty pages manually? Yeah, I've been there. Turns out, there's a better way—one that doesn't involve turning into a human web-scraper for three hours straight.

This n8n workflow handles the boring part: it crawls through entire API reference sites, captures every page (even the nested stuff), and spits out clean Markdown ready for your AI assistant. No more copy-pasting, no more missed pages, just automated documentation extraction that actually works.

How It Actually Works

The workflow kicks off with a simple form where you drop in the first page of an API reference. From there, it's pretty much hands-off:

First, it sets up the workspace. A fresh Google Doc gets created automatically—this becomes your documentation dump. The workflow grabs the URL you provided and gets ready to loop through every page it can find.

Then comes the scraping magic. Using Puppeteer (specifically the community node from npmjs.com/package/n8n-nodes-puppeteer), the workflow does something most scrapers don't: it actually clicks things. It unfurls all those collapsed sections and nested menus that hide half the content on modern documentation sites. Takes a full-page screenshot too, because context matters.

Now here's where it gets interesting. Both the screenshot and the raw text get fed to Gemini. The LLM looks at the visual structure alongside the content and converts everything into proper Markdown. This dual-input approach catches formatting that pure text scraping misses—code blocks, headers, nested lists, all that structure developers actually need.

The scraped content flows into your Google Doc, accumulating page by page. And here's the kicker: the workflow automatically hunts for the "Next" button on each page. Found it? Loop continues. No next button? Job's done.

Quick reality check though: This was built with Fern documentation in mind. If your API docs don't use a "Next" button navigation pattern, you'll need to tweak the script. But the core logic is solid—you're mostly just adjusting selectors to match whatever structure you're scraping.

Oh, and fair warning: this thing scrapes everything. Deprecated endpoints, beta features, that one weird legacy method nobody uses anymore—it all gets captured. You'll probably want to prune the output afterward, but at least you won't accidentally miss something important.

When dealing with complex documentation structures or sites that aggressively block automated access, having a robust scraping solution becomes essential. 👉 Get reliable API documentation extraction with ScraperAPI's infrastructure that handles anti-bot measures, rotating proxies, and JavaScript rendering automatically—perfect for maintaining consistent documentation workflows without the headache of manual intervention.

The Technical Setup

The workflow chain breaks down into specific nodes, each handling one piece of the puzzle:

Form Trigger → Gets the starting URL from you
Google Docs Creation → Spins up the storage document
URL Initialization → Sets up the loop variables
Puppeteer Script → The heavy lifter—navigates, expands collapsibles, extracts content
Code Node → Escapes text for safe API transmission
Screenshot Capture → Grabs the visual layout
Gemini Upload → Ships the screenshot to Google's API
Gemini Processing → Converts everything to Markdown
Google Docs Update → Appends the new content
Conditional Check → Looks for the next page
Loop Preparation → Sets up the next iteration

The Puppeteer script does most of the work. It tries multiple strategies to find and click expandable sections—looking for data-state attributes, aria-expanded properties, details elements, even text-based patterns like "show more" buttons. Casts a wide net to catch different documentation frameworks.

For the next-page detection, it runs through four fallback strategies: specific CSS selectors for Fern-style navigation, text-based searches for "Next" links, arrow icon detection, and finally checking pagination containers. First match wins, absolute URL gets extracted, loop continues.

The escaping logic in the Code node might seem overkill, but trust me—API documentation is full of weird characters, quotes inside quotes, backslashes in code examples. You need solid escaping or your API calls start failing randomly halfway through a 40-page scrape.

What You End Up With

At the end, you've got a single Google Doc containing every page of the API reference in Markdown format. Headers are preserved, code blocks are intact, structure makes sense. Copy it into your LLM context window and you're good to go—no manual cleanup needed beyond maybe removing sections you don't care about.

The workflow's pretty forgiving too. If a page fails to load or Puppeteer hits an error, it logs what happened and stops the loop cleanly. Won't leave you with a half-scraped mess wondering what went wrong.

For anyone regularly working with APIs—whether you're building integrations, writing documentation, or just trying to understand someone else's endpoints—having the full reference in a format your AI can actually process changes the game. No more "I need to look that up real quick" interrupting your flow every five minutes.

Wrapping Up

Look, manually scraping documentation is tedious. Building something that does it automatically? That's just practical. This workflow takes maybe twenty minutes to set up (longer if you're adapting it for non-Fern docs), then it just works. Feed it a starting URL, walk away, come back to complete Markdown documentation.

If you're using LLMs for development work, 👉 maintaining reliable data pipelines with ScraperAPI ensures your automation stays functional even as documentation sites change their structures or implement new anti-scraping measures. Worth checking out if you're building workflows that need to stay stable over time.

The JSON export's included in the original post if you want to import it directly into n8n. Tweak the selectors if needed, point it at your docs, let it run. Pretty straightforward once you see it in action.

Page updated

Google Sites

Report abuse