Transform Your Web Scrapers into Real-Time Data APIs

Struggling with delivering scraped data on demand? Learn how to convert Python scrapers into high-performance APIs using FastAPI—handling thousands of requests with caching, webhooks, and async processing in under 100 lines of code.

Picture this: you've built a solid web scraper that extracts valuable data. It works beautifully when you run it manually. But now you need to deliver that data in real-time to multiple users or applications. What do you do?

The answer is simpler than you think. We're going to turn a Python web scraper into a proper data API—one that can handle concurrent requests, cache results intelligently, and even support webhooks for long-running scrapes.

We'll build this using Yahoo Finance as our example, scraping stock data on demand. By the end, you'll have a production-ready API that can serve thousands of stock details with minimal overhead.

Why FastAPI Makes Sense Here

FastAPI isn't just another Python framework. It's asynchronous by nature, which is exactly what we need for web scraping APIs. When you're waiting for HTTP responses or parsing HTML, async code keeps everything running smoothly without blocking.

The beauty is in the efficiency. Our complete Yahoo stock API—with caching and webhook support—will clock in at less than 100 lines. That's the power of modern Python tooling.

If you're dealing with distributed scraping or need to scale beyond a single machine, async programming becomes even more critical. It's the difference between handling dozens versus hundreds of concurrent requests on the same hardware.

Getting Your Tools Ready

We need a few solid packages for this project. FastAPI handles our API server, uvicorn serves it up, and loguru gives us clean logging so we can actually see what's happening.

For the scraping side, httpx acts as our HTTP client (it's like requests but async-ready), and parsel makes HTML parsing straightforward with XPath selectors.

Install everything with pip:

bash
pip install fastapi uvicorn loguru httpx parsel

That's it. No complicated setup, no virtual environment wrestling (though you should probably use one anyway).

FastAPI in Five Minutes

FastAPI's documentation is excellent, but we only need the basics. Let's start with a single file called main.py:

python
from fastapi import FastAPI

app = FastAPI()

@app.get("/scrape/stock/{symbol}")
async def scrape_stock(symbol: str):
return {"symbol": symbol, "status": "placeholder"}

This defines our first route: /scrape/stock/<symbol>. The symbol parameter tells us which stock to scrape. Apple's stock symbol is AAPL, Microsoft is MSFT, and so on.

Run the API:

bash
uvicorn main:app --reload

Visit http://127.0.0.1:8000/scrape/stock/aapl and you'll see JSON:

json
{"symbol": "aapl", "status": "placeholder"}

Every time someone hits this endpoint, our scrape_stock function executes and returns data. Now let's make it actually scrape something.

Scraping Yahoo Finance Data

Yahoo Finance follows a predictable URL pattern: finance.yahoo.com/quote/<symbol>. As long as we know the stock symbol, we can scrape any company's data.

Let's grab Apple's page at yahoo.com/quote/AAPL/. We'll extract the price and all those summary details—previous close, day's range, market cap, PE ratio, and more.

Here's the scraper integrated into our API:

python
import httpx
from parsel import Selector
from time import time

stock_client = httpx.AsyncClient(timeout=httpx.Timeout(10.0))

async def scrape_yahoo_finance(symbol):
response = await stock_client.get(
f"https://finance.yahoo.com/quote/{symbol}?p={symbol}"
)
sel = Selector(response.text)
parsed = {}

for row in sel.xpath(

'//div[re:test(@data-test,"(left|right)-summary-table")]//td[@data-test]'

label = row.xpath("@data-test").get().split("-value")[0].lower()

value = " ".join(row.xpath(".//text()").getall())

parsed[label] = value

parsed["price"] = sel.css(

f'fin-streamer[data-field="regularMarketPrice"][data-symbol="{symbol}"]::attr(value)'

).get()

parsed["_scraped_on"] = time()

return parsed

@app.get("/scrape/stock/{symbol}")
async def scrape_stock(symbol: str):
return await scrape_yahoo_finance(symbol.upper())

Now when you curl the endpoint:

bash
curl http://127.0.0.1:8000/scrape/stock/aapl

You get real stock data:

json
{
"prev_close": "156.90",
"open": "157.34",
"days_range": "153.67 - 158.74",
"market_cap": "2.47T",
"pe_ratio": "25.41",
"price": "153.72",
"_scraped_on": 1663838493.6148243
}

Perfect. But there's a problem lurking here.

The Load Problem and How Caching Fixes It

What happens when ten users request Apple's stock data within the same minute? We scrape the same page ten times. That's wasteful, slow, and might even get us rate-limited.

The solution is caching. We'll store recent scrape results in memory and serve those instead of hitting Yahoo Finance repeatedly.

👉 If you're building production scrapers that need to handle complex anti-bot systems at scale, tools like ScraperAPI can handle rotating proxies and automatic retries so you can focus on your API logic instead of infrastructure headaches.

Here's caching added with minimal code:

python
STOCK_CACHE = {}
CACHE_TIME = 10 # seconds

async def scrape_yahoo_finance(symbol):
cache = STOCK_CACHE.get(symbol)
if cache and time() - CACHE_TIME < cache["_scraped_on"]:
log.debug(f"{symbol}: returning cached item")
return cache

log.info(f"{symbol}: scraping data")

# ... scraping code ...

STOCK_CACHE[symbol] = parsed

return parsed

Now if the same stock is requested multiple times within 10 seconds, only the first request scrapes. Everyone else gets cached data instantly.

We also need to clean up expired cache periodically:

python
async def clear_expired_cache(period=5.0):
while True:
global STOCK_CACHE
STOCK_CACHE = {
k: v for k, v in STOCK_CACHE.items()
if time() - CACHE_TIME < v["_scraped_on"]
}
await asyncio.sleep(period)

@app.on_event("startup")
async def app_startup():
asyncio.create_task(clear_expired_cache())

This background task runs every few seconds, removing stale entries. Python dictionaries are fast enough to cache thousands of results in memory without breaking a sweat.

Webhooks for Long-Running Scrapes

Some scrape jobs take minutes to complete. You can't keep an HTTP connection open that long—it'll timeout or block your entire API.

Enter webhooks. Instead of waiting for the scrape to finish, the API accepts the request immediately and promises to send results to a callback URL later.

Here's how it looks:

python
async def with_webhook(cor, webhook, retries=3):
result = await cor
async with httpx.AsyncClient(timeout=httpx.Timeout(15.0)) as client:
for i in range(retries):
try:
await client.post(webhook, json=result)
return
except Exception:
log.exception(f"Failed to send webhook {i}/{retries}")
await asyncio.sleep(5)

@app.get("/scrape/stock/{symbol}")
async def scrape_stock(symbol: str, webhook: Optional[str] = None):
scrape_cor = scrape_yahoo_finance(symbol.upper())

if webhook:

asyncio.create_task(with_webhook(scrape_cor, webhook))

return {"success": True, "webhook": webhook}

else:

return await scrape_cor

Now you can call the API with a webhook parameter:

bash
curl "http://127.0.0.1:8000/scrape/stock/aapl?webhook=https://webhook.site/your-unique-id"

The API responds instantly with confirmation, then scrapes in the background and posts results to your webhook URL. Test this with webhook.site to see it in action.

This pattern scales beautifully. As your scraping workload grows, you can spin up more worker processes to handle webhook tasks without blocking your main API.

When Your Scraper Outgrows Your API

Real-world scraping gets messy fast. Target sites add CAPTCHAs, block your IPs, serve different content to bots, or throttle requests. Your API ends up doing more anti-blocking work than actual API logic.

This is where specialized scraping infrastructure makes sense. ScrapFly offers web scraping, screenshot, and extraction APIs that handle all the complicated stuff—rotating proxies, browser automation, CAPTCHA solving, and more.

Your API stays clean and focused on business logic while ScrapFly deals with getting blocked or throttled. 👉 Plus, ScrapFly's built-in cache and webhook features integrate seamlessly with the patterns we've covered here.

Putting It All Together

We started with a simple scraper and transformed it into a production-ready API. It handles concurrent requests through async processing, serves cached results when appropriate, and supports webhooks for long-running jobs.

The complete implementation—caching, webhooks, error handling, and all—fits in under 100 lines of actual code. That's the power of FastAPI combined with Python's async ecosystem.

You can extend this pattern to any scraping project. Swap out Yahoo Finance for e-commerce sites, real estate listings, job boards, or whatever data you need. The architecture stays the same: async API layer, intelligent caching, and webhooks for heavy lifting.

Wrapping Up

Building data APIs from web scrapers isn't complicated when you have the right tools. FastAPI's async foundation makes it natural to handle multiple scraping requests efficiently. Add caching to avoid redundant scrapes, implement webhooks for long tasks, and suddenly you've got a scalable data service.

The patterns we covered—async processing, in-memory caching, background tasks, and webhook delivery—apply far beyond this Yahoo Finance example. Whether you're scraping product prices, aggregating social media data, or monitoring competitor websites, this architecture adapts to your needs. And when the anti-blocking challenges get serious, dedicated scraping infrastructure like ScrapFly handles the infrastructure complexity so you can focus on building great APIs.

Page updated

Google Sites

Report abuse