If you've ever needed to gather academic research data at scale, you know Google Scholar is a goldmine. But copying article titles, citation counts, and author profiles one by one? That's a recipe for carpal tunnel syndrome. In this guide, I'll walk you through scraping Google Scholar search results using Node.js, Unirest, and Cheerio—no PhD required.
Academic researchers, data analysts, and developers often need to extract information from Google Scholar for literature reviews, citation analysis, or building research databases. While Google Scholar doesn't offer an official API, web scraping provides a practical solution for gathering this publicly available data.
When dealing with large-scale data extraction from academic sources, using a reliable web scraping infrastructure becomes essential. 👉 Get started with scalable Google Scholar scraping using Scrapingdog's managed service, which handles anti-bot measures and provides clean, structured data through a simple API.
Before diving into the code, you'll need a few tools in your arsenal. Think of this as assembling your scraping toolkit.
We'll use two JavaScript libraries for this project:
Unirest JS - Handles HTTP requests to fetch HTML from Google Scholar
Cheerio JS - Parses the HTML and lets us extract specific data using CSS selectors
Install them by running these commands in your project terminal:
npm i unirest
npm i cheerio
Here's the thing about web scraping: finding the right HTML elements is like finding a needle in a haystack. CSS selectors make this process dramatically easier. I recommend using the CSS Selector Gadget Chrome extension—it visually highlights elements and generates the perfect selector for your needs.
Google Scholar can detect automated scraping bots and may block your requests. The solution? Use a User-Agent header that makes your scraper appear as a regular browser. This simple technique helps you maintain access to the data you need. You can rotate different User-Agent strings to further reduce detection risk.
Let's start with the main search results page. When you search for something like "IIT MUMBAI" on Google Scholar, you get a list of academic articles. We'll extract the title, link, snippet, citation count, and version information for each result.
Here's the complete code that does the job:
javascript
const cheerio = require("cheerio");
const unirest = require("unirest");
const getScholarData = async() => {
try {
const url = "https://www.google.com/scholar?q=IIT+MUMBAI&hl=en";
return unirest
.get(url)
.headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
})
.then((response) => {
let $ = cheerio.load(response.body);
let scholar_results = [];
$(".gs_ri").each((i, el) => {
scholar_results.push({
title: $(el).find(".gs_rt").text(),
title_link: $(el).find(".gs_rt a").attr("href"),
id: $(el).find(".gs_rt a").attr("id"),
displayed_link: $(el).find(".gs_a").text(),
snippet: $(el).find(".gs_rs").text().replace("\n", ""),
cited_by_count: $(el).find(".gs_nph+ a").text(),
cited_link: "https://scholar.google.com" + $(el).find(".gs_nph+ a").attr("href"),
versions_count: $(el).find("a~ a+ .gs_nph").text(),
versions_link: $(el).find("a~ a+ .gs_nph").text() ? "https://scholar.google.com" + $(el).find("a~ a+ .gs_nph").attr("href") : ""
})
})
for (let i = 0; i < scholar_results.length; i++) {
Object.keys(scholar_results[i]).forEach(key =>
scholar_results[i][key] === "" || scholar_results[i][key] === undefined ? delete scholar_results[i][key] : {}
);
}
console.log(scholar_results)
})
} catch(e) {
console.log(e);
}
}
getScholarData();
This code loops through each search result element and extracts all the relevant fields. The cleanup loop at the end removes any empty or undefined fields to keep your data clean.
Scholar profiles contain information about researchers themselves—their names, affiliations, research interests, and citation metrics. This data is particularly valuable for understanding academic networks and research trends.
For production-scale profile extraction across multiple researchers, 👉 leverage Scrapingdog's dedicated Google Scholar scraping endpoints that handle pagination, rate limiting, and data formatting automatically.
Here's how to scrape profile listings:
javascript
const unirest = require("unirest");
const cheerio = require("cheerio")
const getScholarProfiles = async() => {
try {
const url = "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=IIT+MUMBAI";
return unirest
.get(url)
.headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
})
.then((response) => {
let $ = cheerio.load(response.body);
let scholar_profiles = [];
$(".gsc_1usr").each((i, el) => {
scholar_profiles.push({
name: $(el).find(".gs_ai_name").text(),
name_link: "https://scholar.google.com" + $(el).find(".gs_ai_name a").attr("href"),
position: $(el).find(".gs_ai_aff").text(),
email: $(el).find(".gs_ai_eml").text(),
departments: $(el).find(".gs_ai_int").text(),
cited_by_count: $(el).find(".gs_ai_cby").text().split(" ")[2]
})
})
for (let i = 0; i < scholar_profiles.length; i++) {
Object.keys(scholar_profiles[i]).forEach(key =>
scholar_profiles[i][key] === "" || scholar_profiles[i][key] === undefined ? delete scholar_profiles[i][key] : {}
);
}
console.log(scholar_profiles)
});
} catch(e) {
console.log(e);
}
}
getScholarProfiles();
When you click the "Cite" button on a Google Scholar article, you get formatted citations in various styles like MLA, APA, and Chicago. Here's how to extract those:
javascript
const cheerio = require("cheerio");
const unirest = require("unirest");
const getData = async () => {
try {
const url = "https://scholar.google.com/scholar?q=info:TPhPjzP8H_MJ:scholar.google.com&output=cite";
return unirest
.get(url)
.headers({})
.then((response) => {
let $ = cheerio.load(response.body);
let cite_results = [];
$("#gs_citt tr").each((i, el) => {
cite_results.push({
title: $(el).find(".gs_cith").text(),
snippet: $(el).find(".gs_citr").text()
});
});
let links = [];
$("#gs_citi .gs_citi").each((i, el) => {
links.push({
name: $(el).text(),
link: $(el).attr("href")
});
});
console.log(cite_results);
console.log(links);
});
} catch (e) {
console.log(e);
}
};
getData();
Notice how the URL includes that article ID we extracted earlier? Each Scholar article has a unique identifier that you can use to access its citation formats.
Individual author profiles are treasure troves of information—they show publication history, citation metrics over time, h-index, and i10-index statistics. Here's the complete code to scrape an entire author profile:
javascript
const cheerio = require("cheerio");
const unirest = require("unirest");
const getAuthorProfileData = async () => {
try {
const url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ";
return unirest
.get(url)
.headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
})
.then((response) => {
let $ = cheerio.load(response.body);
let author_results = {};
let articles = [];
author_results.name = $("#gsc_prf_in").text();
author_results.position = $("#gsc_prf_inw+ .gsc_prf_il").text();
author_results.email = $("#gsc_prf_ivh").text();
author_results.departments = $("#gsc_prf_int").text();
$("#gsc_a_b .gsc_a_t").each((i, el) => {
articles.push({
title: $(el).find(".gsc_a_at").text(),
link: "https://scholar.google.com" + $(el).find(".gsc_a_at").attr("href"),
authors: $(el).find(".gsc_a_at+ .gs_gray").text(),
publication: $(el).find(".gs_gray+ .gs_gray").text()
})
})
for (let i = 0; i < articles.length; i++) {
Object.keys(articles[i]).forEach((key) =>
articles[i][key] === "" || articles[i][key] === undefined ? delete articles[i][key] : {}
);
}
let cited_by = {};
cited_by.table = [];
cited_by.table[0] = {};
cited_by.table[0].citations = {};
cited_by.table[0].citations.all = $("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text();
cited_by.table[0].citations.since_2017 = $("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text();
cited_by.table[1] = {};
cited_by.table[1].h_index = {};
cited_by.table[1].h_index.all = $("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text();
cited_by.table[1].h_index.since_2017 = $("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text();
cited_by.table[2] = {};
cited_by.table[2].i_index = {};
cited_by.table[2].i_index.all = $("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text();
cited_by.table[2].i_index.since_2017 = $("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text();
console.log(author_results);
console.log(articles);
console.log(cited_by.table);
})
} catch (e) {
console.log(e);
}
};
getAuthorProfileData();
This comprehensive scraper pulls basic author information, their publication list, and detailed citation metrics including historical data.
When scraping Google Scholar (or any website), you'll want to avoid getting blocked. Here are some practical tips: rotate your User-Agent strings regularly, add delays between requests, use residential proxies when scraping at scale, and always respect robots.txt guidelines. Remember that excessive scraping can strain servers and potentially violate terms of service.
You've just learned how to extract search results, profile listings, citation formats, and complete author profiles from Google Scholar using Node.js. This foundational knowledge lets you build custom research tools, automate literature reviews, or analyze academic trends across disciplines.
The techniques shown here work for educational and research purposes, but always consider the ethical and legal implications of web scraping. Use the data responsibly and avoid overwhelming the target servers with requests.