Sometimes you need data that's sitting on a website but isn't available through an API. Maybe you're building a feature that requires real-time pricing information, or you need to aggregate data from multiple sources. That's where web scraping comes in handy.
I recently ran into this exact situation. I needed comprehensive data about programming languages for a project, but couldn't find a public API that provided it. So I built my own web scraper to extract this information from Wikipedia and turn it into a usable API. Here's how you can do the same.
Web scraping is essentially fetching a webpage's HTML content and pulling out the specific data you need. For this tutorial, we'll use these Node.js libraries:
Axios handles the HTTP requests to fetch HTML content from URLs. It's straightforward and reliable for this purpose.
Cheerio is the real workhorse here. It parses HTML and lets you select elements just like you would with jQuery. This makes extracting specific data much easier than trying to parse raw HTML strings.
Mongoose stores our extracted data in MongoDB, so we can query it later through an API.
Express creates the API endpoints that serve up our scraped data in JSON format.
If you're dealing with websites that have anti-scraping measures like CAPTCHAs or IP blocking, managing these challenges manually can be time-consuming. 👉 Professional web scraping APIs handle these obstacles automatically, letting you focus on extracting the data you need rather than fighting with access issues.
Before diving into the scraping logic, let's set up the development environment. You'll need Node.js version 20.6 or higher and either NPM or Yarn installed on your machine.
For the database, we'll use MongoDB running in a Docker container. If you have Docker installed, spin up a MongoDB instance with this command:
docker run -d --rm -e MONGO_INITDB_ROOT_USERNAME=user -e MONGO_INITDB_ROOT_PASSWORD=secret -p 27018:27017 --name mongodb mongo:8.0
Your MongoDB connection string will be mongodb://user:secret@localhost:27018/admin.
Now clone the starter project and install dependencies:
git clone https://github.com/tericcabrel/node-ts-starter.git -b express-mongo node-web-scraping
cd node-web-scraping
cp .env.example .env
yarn install
yarn start
The application should now be running on port 4500. With the foundation in place, install the scraping libraries:
yarn add axios cheerio
yarn add -D @types/cheerio
The first step in any web scraping project is downloading the HTML content. With Axios, this is remarkably simple. We'll scrape Wikipedia's timeline of programming languages as our example.
Create a file called src/scraper.ts and add this code:
javascript
import axios from 'axios';
const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';
const scraper = async () => {
const response = await axios.get(PAGE_URL);
console.log(response.data);
};
(async () => {
await scraper();
})();
Run it with yarn ts-node src/scraper.ts and you'll see a massive wall of HTML. This is where Cheerio becomes essential.
Before writing extraction code, you need to know exactly what data you're after. Looking at the Wikipedia page, we want to capture:
The year category (like "1950s" or "2000s")
The year of creation
The programming language name
The author or creator
Predecessor languages that influenced it
Here's the TypeScript structure for this data:
typescript
type ProgrammingLanguage = {
yearCategory: string;
year: number[];
yearConfirmed: boolean;
name: string;
author: string;
predecessors: string[];
};
Notice that year is an array of numbers. That's because some languages were developed over several years, and others have uncertain creation dates marked with question marks on Wikipedia.
This is where web scraping gets tricky. You need to examine the page's HTML structure to find patterns you can rely on. Modern web scrapers that work with dynamic JavaScript-heavy sites often need more sophisticated approaches. 👉 Specialized scraping tools can handle complex page structures and JavaScript rendering automatically, saving you from wrestling with browser automation libraries.
After inspecting Wikipedia's HTML, the pattern becomes clear: year categories are in <h2> tags, followed by <table> tags containing the actual language data. Each row in the table has columns for year, name, author, and predecessors.
Here's the complete scraping logic:
javascript
import axios from 'axios';
import * as cheerio from 'cheerio';
const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';
const formatYear = (input: string) => {
const array = input.split('–');
if (array.length < 2) {
return [+input.substr(0, 4)];
}
return [+array[0], +(array[1].length < 4 ? ${array[0].substr(0, 2)}${array[1]} : array[1])];
};
const extractLanguagesData = (content: string) => {
const $ = cheerio.load(content);
const headers = $('body .mw-heading2');
const languages: ProgrammingLanguage[] = [];
for (let i = 0; i < headers.length; i++) {
const header = headers.eq(i);
const table = header.next('table');
if (!table.is('table')) {
continue;
}
const yearCategory = header.children('h2').first().text();
const tableRows = table.children('tbody').children('tr');
for (let i = 0; i < tableRows.length; i++) {
const rowColumns = tableRows.eq(i).children('td');
const name = rowColumns.eq(1).text().replace('\n', '');
if (!name) {
continue;
}
const language: ProgrammingLanguage = {
author: rowColumns.eq(2).text().replace('\n', ''),
name,
predecessors: rowColumns.eq(3).text().split(',').map((value) => value.trim()),
year: formatYear(rowColumns.eq(0).text()),
yearConfirmed: !rowColumns.eq(0).text().endsWith('?'),
yearCategory,
};
languages.push(language);
}
}
return languages;
};
Raw scraped data isn't much use if you can't query it later. Let's create a Mongoose model to store our programming languages. Create models/language.ts:
javascript
import mongoose, { Model, Schema, Document } from 'mongoose';
type LanguageDocument = Document & {
yearCategory: string;
year: number[];
yearConfirmed: boolean;
name: string;
author: string;
predecessors: string[];
};
const languageSchema = new Schema(
{
name: { type: Schema.Types.String, required: true, index: true },
yearCategory: { type: Schema.Types.String, required: true, index: true },
year: { type: [Schema.Types.Number], required: true },
yearConfirmed: { type: Schema.Types.Boolean, required: true },
author: { type: Schema.Types.String },
predecessors: { type: [Schema.Types.String], required: true },
},
{
collection: 'languages',
timestamps: true,
},
);
const Language: Model = mongoose.model('Language', languageSchema);
export { Language, LanguageDocument };
Update the scraper function to save data:
javascript
const scraper = async () => {
const response = await axios.get(PAGE_URL);
const languages = extractLanguagesData(response.data);
await connectToDatabase();
const insertPromises = languages.map(async (language) => {
const isPresent = await Language.exists({ name: language.name });
if (!isPresent) {
await Language.create(language);
}
});
await Promise.all(insertPromises);
console.log('Data inserted successfully!');
};
Run it with node -r ts-node/register --watch --env-file=.env ./src/scraper.ts and watch your database fill up with programming language data.
The final piece is serving this data through an API. Add this route to src/index.ts:
javascript
import { Language } from './models/languages';
app.get('/languages', async (req, res) => {
const languages = await Language.find().sort({ name: 1 }).exec();
return res.json({ data: languages });
});
Start your application with yarn start and navigate to http://localhost:4500/languages. You should see a JSON response containing all your scraped programming languages, sorted alphabetically.
Web scraping isn't without its challenges. The biggest problem is that your code is tightly coupled to the HTML structure of the page. If Wikipedia changes how they organize their tables tomorrow, your scraper breaks.
Always check whether a public API already exists before building a scraper. Many companies provide APIs specifically to avoid people scraping their sites.
Some websites actively prevent scraping with measures like rate limiting, CAPTCHAs, and IP blocking. If you encounter these obstacles frequently, consider using a dedicated web scraping service that handles these challenges automatically.
Finally, respect websites' terms of service. Some explicitly forbid scraping, and you should honor those rules.
Building a web scraper involves four key steps: downloading HTML content with a library like Axios, identifying the HTML selectors that contain your target data, extracting that data with a parsing tool like Cheerio, and storing it in a database for later use.
This tutorial covered static websites, which are relatively straightforward to scrape. Dynamic websites that load content with JavaScript require more advanced tools like Puppeteer, or alternatively, using a web scraping API that handles these complexities for you.
The approach we've covered here works well for personal projects and learning, but as your scraping needs grow more complex, you'll want to explore more robust solutions that can handle JavaScript rendering, rotating proxies, and anti-bot measures at scale.