Web Scraping with Playwright and Node.js: A Practical Guide

If you've ever tried scraping a website that loads content through JavaScript, you know the frustration of getting empty HTML instead of actual data. That's where Playwright comes in. This Node.js library lets you control real browsers programmatically, so you can scrape even the trickiest dynamic websites.

In this guide, we'll walk through everything from basic setup to handling authentication walls. No fluff, just practical examples you can use right away.

What is Playwright and Why Should You Care?

Playwright is an open-source browser automation library for Node.js. Think of it as a way to tell a browser exactly what to do: open pages, click buttons, fill forms, and grab data. It works with all major browsers like Chromium, Firefox, and WebKit, and it's surprisingly lightweight for what it does.

The real magic happens when you need to scrape websites that don't just hand over their content in plain HTML. Many modern sites build their pages with JavaScript, meaning the data you want only appears after the browser executes scripts. 👉 Get started with reliable web scraping solutions that handle JavaScript rendering seamlessly to avoid these common pitfalls.

Unlike traditional HTTP libraries, Playwright actually renders pages like a human visitor would see them. This makes it harder for websites to tell you're scraping, and it ensures you're getting the complete, fully-loaded content.

Understanding Headless Browsers

Before diving into code, let's clear up one term you'll see everywhere: headless browser.

A headless browser runs without any visible window. It processes web pages in the background, executing JavaScript and rendering content just like a regular browser, but you never see it happening. For web scraping, this means:

Faster performance since there's no GUI to render
Lower resource usage on your server
The ability to run multiple browsers simultaneously without cluttering your screen

You can always disable headless mode during development to watch what's happening. Just set headless: false and the browser window will pop up so you can see each step.

Setting Up Your Environment

Let's get your machine ready for web scraping with Playwright. The process is straightforward if you follow these steps in order.

Install Node.js and NPM

First, make sure Node.js and NPM are installed on your system. NPM manages packages, while Node runs your JavaScript code outside the browser.

Create Your Project

Open your terminal and set up a new project folder:

bash
$ mkdir play
$ cd play
$ npm init -y

The -y flag skips the setup questions and creates a basic configuration file automatically.

Add Required Packages

Now install Playwright for browser automation and Cheerio for parsing HTML:

bash
$ npm i playwright cheerio --save

Cheerio works like jQuery for Node.js, letting you navigate and extract data from HTML with simple, familiar syntax.

Download Browser Binaries

Playwright needs to download browser executables the first time:

bash
$ npx playwright install

This command grabs the latest versions of Chromium, Firefox, and WebKit.

Your First Playwright Script

Let's write a simple script that opens a web page and closes it. Create a file called 1_open_webpage.js:

javascript
const playwright = require('playwright');
const cheerio = require('cheerio')

async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');

await browser.close();
}

test()

Here's what's happening:

chromium.launch() starts a Chromium browser instance
newPage() opens a new tab
goto() navigates to the target URL
close() shuts everything down and frees up resources

Run it with:

bash
$ node 1_open_webpage.js

You should see a browser window flash open and close. That's Playwright in action.

Scraping Product Data

Now for the interesting part: actually extracting data. We'll scrape a product's name and price from an e-commerce site.

First, inspect the page to find where the data lives. In this example, the product name sits in an h1 tag with class pdp-title, and the price is in a span with class pdp-price.

When dealing with complex e-commerce sites that frequently change their structure, having robust scraping infrastructure becomes crucial. 👉 Explore powerful APIs designed specifically for e-commerce data extraction that handle anti-scraping measures automatically.

Create 2_parsing_html.js:

javascript
const playwright = require('playwright');
const cheerio = require('cheerio')
let obj = {}
let arr = []

async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');
let html = await page.content();

const $ = cheerio.load(html);
obj["name"] = $('h1.pdp-title').text()
obj["price"] = $('span.pdp-price').text()

arr.push(obj)
console.log(arr)
await browser.close();
}

test()

The content() method grabs the full HTML after JavaScript has executed. Then Cheerio's load() function creates a jQuery-like interface for navigating the DOM. The text() method extracts just the text content, stripping away HTML tags.

Run this script:

bash
$ node 2_parsing_html.js

You should see output like:

[ { name: 'The Indian Garage Co', price: '₹692' } ]

And just like that, you've scraped structured data from a dynamic website.

Taking Screenshots for Debugging

Screenshots are invaluable when troubleshooting scraping scripts. They show you exactly what the browser sees at any moment.

Create 3_screenshot.js:

javascript
const playwright = require('playwright');

async function test(){
const browser = await playwright.chromium.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://www.myntra.com/trousers/the+indian+garage+co/the-indian-garage-co-men-black-slim-fit-solid-joggers/9922235/buy');

await page.screenshot({ path: 'myntra.png' });
await browser.close();
}

test()

The screenshot() function captures the visible viewport and saves it as a PNG file. For capturing entire pages (including content below the fold), add the fullPage parameter:

javascript
await page.screenshot({ path: 'fp_myntra.png', fullPage: true });

This is especially useful when debugging why certain elements aren't being scraped correctly.

Handling Authentication

Many valuable data sources sit behind login walls. Playwright makes authentication straightforward with its fill() and click() methods.

Here's how to log into GitHub and scrape protected content:

javascript
const playwright = require('playwright');

async function test(){
const browser = await playwright.chromium.launch({headless: false});

const page = await browser.newPage();
await page.goto('https://github.com/login');

await page.fill('input[name="login"]', "your-user-name");
await page.fill('input[name="password"]', "your-password");
await page.click('input[type="submit"]');

await page.waitForNavigation();

await page.screenshot({ path: 'logged_git.png' });
await browser.close();
}

test()

The fill() method enters text into form fields, while click() simulates button presses. The waitForNavigation() call pauses execution until the login completes and the page redirects.

Once authenticated, you can navigate to any protected page and scrape away. Just remember to respect rate limits and terms of service.

Playwright vs Puppeteer: Which Should You Choose?

Puppeteer has been around longer and has a larger community, but Playwright is catching up fast. Both are maintained by experienced developers and offer excellent documentation.

Key differences:

Browser support: Playwright supports Firefox and WebKit out of the box, while Puppeteer focuses mainly on Chromium
API design: Playwright has some quality-of-life improvements over Puppeteer's API
Community size: Puppeteer currently has more Stack Overflow answers and tutorials, but Playwright's community is growing rapidly
Performance: Both are similarly fast for most use cases

For new projects, Playwright often makes more sense due to its broader browser support and more modern API. But if you're already familiar with Puppeteer, there's no urgent need to switch.

Where to Go From Here

You now have the foundation to scrape practically any website with Playwright. The techniques covered here—browser automation, data extraction, screenshot debugging, and authentication—form the core of most scraping projects.

As you build more complex scrapers, you'll encounter challenges like handling pagination, dealing with rate limits, and managing cookies. Each of these has solutions within Playwright's API.

The key is to start simple and add complexity only when needed. Scrape a single page successfully before attempting to crawl an entire site. Debug with screenshots before optimizing for headless performance. Master the basics, and the advanced stuff becomes much more manageable.

Now go build something interesting with your new web scraping skills.

Page updated

Google Sites

Report abuse