Scraping the Web With Node.js: A Practical Guide to Building Your First Crawler

Web scraping has been around for ages, powering everything from search engines to data science projects. If you've ever wondered how to extract data from websites programmatically, you're in the right place. Today, we're going to build a real-world scraper that pulls product reviews from Amazon using Node.js.

The best part? You don't need to be a JavaScript wizard to follow along.

What We're Building

We're going to create a script that crawls Amazon product pages and extracts customer reviews. Think of it as teaching your computer to read web pages and pull out the information you actually care about.

Here's what we'll be working with:

Node.js - Our runtime environment
Cheerio - A jQuery-like library for parsing HTML
A reliable crawling service - To handle the heavy lifting of web requests

Setting Up Your Project

First things first, let's get your development environment ready. Create a new folder for your project and initialize it:

npm init --y

Then install the dependencies we'll need:

npm i --S cheerio proxycrawl

Now, before we dive into the code, you'll need to handle one important aspect of web scraping: avoiding blocks and managing requests efficiently. Major websites like Amazon have sophisticated anti-bot measures, which is where professional crawling tools come in handy.

If you're serious about web scraping at scale, you'll want to check out 👉 reliable API solutions that handle proxy rotation and browser fingerprinting automatically. These services save you countless hours of dealing with CAPTCHAs and IP bans.

Building the Crawler Script

Here's where things get interesting. Create a file called crawl.js and let's break down what we're building:

Step 1: Load Your Dependencies and Set Up Configuration

javascript
const fs = require('fs');
const { ProxyCrawlAPI } = require('proxycrawl');
const cheerio = require('cheerio');
const productFile = fs.readFileSync('amazon-products-list.txt');
const urls = productFile.toString().split('\n');
const api = new ProxyCrawlAPI({ token: '<< put your token here >>' });
const outputFile = 'reviews.txt';

You'll need a list of Amazon product URLs in a text file (one URL per line). This gives your crawler a roadmap of pages to visit.

Step 2: Create Your HTML Parser

This function does the actual extraction work:

javascript
function parseHtml(html) {
const $ = cheerio.load(html);
const reviews = $('.review');

reviews.each((i, review) => {
const textReview = $(review).find('.review-text').text();
console.log(textReview);

fs.appendFile(outputFile, textReview, (err) => {

if(err) {

console.log('Error writing file...')

}

console.log('review is saved to file')

})

})
}

Cheerio lets you use CSS selectors just like you would in jQuery. It's intuitive and powerful.

Step 3: Implement Smart Request Throttling

Here's the engine that drives your crawler:

javascript
const requestsThreshold = 10;
var currentIndex = 0;

setInterval(() => {
for (let index = 0; index < requestsThreshold; index++) {
api.get(urls[currentIndex]).then(response => {
if (response.statusCode === 200 && response.originalStatus === 200) {
parseHtml(response.body);
} else {
console.log('Failed: ', response.statusCode, response.originalStatus);
}
});
currentIndex++;
}
}, 1000);

This code processes 10 URLs per second, which strikes a good balance between speed and being respectful to the target server. When dealing with large-scale data extraction projects, proper rate limiting isn't just courteous—it's essential for maintaining stable, long-term crawlers.

What Happens Next

Run your script with node crawl.js and watch the magic happen. You'll see reviews streaming into your console, and they're simultaneously being saved to reviews.txt.

The reviews file gives you raw data that you can process however you like. Want to analyze sentiment? Feed it to a machine learning model? Build a competitive intelligence dashboard? The possibilities are endless once you have the data.

Storing Your Data Properly

While saving to a text file works for learning purposes, production scrapers need proper databases. Consider these options based on your needs:

MySQL - Great for structured, relational data
MongoDB - Perfect for flexible, document-based storage
Elasticsearch - Excellent for search and analytics workloads

Each has its strengths depending on how you plan to query and analyze your scraped data.

Taking It Further

Web scraping opens up countless possibilities. You could build a price monitoring tool that tracks competitors, create a job aggregator that pulls listings from multiple sources, or gather market research data for business intelligence.

The key is understanding both the technical side (which we've covered here) and the ethical side. Always respect robots.txt files, rate limit your requests, and only scrape publicly available data.

Modern web scraping often requires more sophisticated infrastructure than a simple script can provide. When you're dealing with JavaScript-heavy sites, dynamic content, or need to scale beyond a few hundred requests, 👉 enterprise-grade crawling infrastructure becomes invaluable.

Have you built any web scrapers before? What challenges did you run into? The web scraping community is always learning from each other's experiences, so feel free to share your own crawling adventures.

Page updated

Google Sites

Report abuse