How to Scrape Reddit Data with Python: A Practical Guide

Scraping Reddit data doesn't have to be complicated. This guide walks you through collecting Reddit posts and comments using Python, handling anti-scraping measures, and exporting everything into clean JSON files—all without getting blocked.

What You'll Learn

By the end of this guide, you'll know how to extract posts, comments, upvotes, and author information from any subreddit. You'll also learn how to bypass Reddit's anti-bot systems and convert messy HTML into structured data. If you're doing market research, tracking sentiment, or feeding data into machine learning models, this is the foundation you need.

Before we dive in, let me show you the complete code. Then we'll break down exactly how it works.

The Complete Reddit Scraper Code

Here's the full scraper in Python:

python
import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup

scraper_api_key = 'YOUR API KEY'

def fetch_comments_from_post(post_data):
payload = { 'api_key': scraper_api_key, 'url': 'https://www.reddit.com//r/valheim/comments/15o9jfh/shifte_chest_reason_for_removal_from_valheim/' }
r = requests.get('https://api.scraperapi.com/', params=payload)
soup = BeautifulSoup(r.content, 'html.parser')

comment_elements = soup.find_all('div', class_='thing', attrs={'data-type': 'comment'})
parsed_comments = []

for comment_element in comment_elements:
try:
author = comment_element.find('a', class_='author').text.strip() if comment_element.find('a', class_='author') else None
dislikes = comment_element.find('span', class_='score dislikes').text.strip() if comment_element.find('span', class_='score dislikes') else None
unvoted = comment_element.find('span', class_='score unvoted').text.strip() if comment_element.find('span', class_='score unvoted') else None
likes = comment_element.find('span', class_='score likes').text.strip() if comment_element.find('span', class_='score likes') else None
timestamp = comment_element.find('time')['datetime'] if comment_element.find('time') else None
text = comment_element.find('div', class_='md').find('p').text.strip() if comment_element.find('div', class_='md') else None

if not text:

continue

parsed_comments.append({

'author': author,

'dislikes': dislikes,

'unvoted': unvoted,

'likes': likes,

'timestamp': timestamp,

'text': text

})

except Exception as e:

print(f"Error parsing comment: {e}")

return parsed_comments

reddit_query = f"https://www.reddit.com/t/valheim/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={reddit_query}'
r = requests.get(scraper_api_url)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article', class_='m-0')

parsed_posts = []

for article in articles:
post = article.find('shreddit-post')

post_title = post['post-title']
post_permalink = post['permalink']
content_href = post['content-href']
comment_count = post['comment-count']
score = post['score']
author_id = post.get('author-id', 'N/A')
author_name = post['author']
subreddit_id = post['subreddit-id']
post_id = post["id"]
subreddit_name = post['subreddit-prefixed-name']
comments = fetch_comments_from_post(post)

parsed_posts.append({
'post_title': post_title,
'post_permalink': post_permalink,
'content_href': content_href,
'comment_count': comment_count,
'score': score,
'author_id': author_id,
'author_name': author_name,
'subreddit_id': subreddit_id,
'post_id': post_id,
'subreddit_name': subreddit_name,
'comments': comments
})

output_file_path = 'parsed_posts.json'
with open(output_file_path, 'w', encoding='utf-8') as json_file:
json.dump(parsed_posts, json_file, ensure_ascii=False, indent=2)

print(f"Data has been saved to {output_file_path}")

Run this code, and you'll get a parsed_posts.json file with all the posts and comments neatly organized.

Now let's walk through how this actually works.

Setting Up Your Environment

You'll need Python 3.8 or newer for this project. You'll also need two libraries: Requests for fetching web pages and BeautifulSoup4 for parsing HTML.

Install them with:

bash
pip install requests bs4

Create a project folder and a Python file:

bash
mkdir reddit_scraper
echo. > app.py

That's it for setup. Now let's talk about what we're actually scraping.

Deciding What Data to Extract

Before writing any code, think about what you actually need. Are you tracking sentiment around a brand? Analyzing user behavior in a niche community? Looking for trending topics?

For this tutorial, we're targeting the Valheim subreddit and extracting posts along with their comments. You can swap this out for any public subreddit you want.

Why You Need ScraperAPI

Here's the thing: Reddit doesn't like scrapers. Try scraping at any meaningful scale without protection, and you'll get blocked fast. That's where ScraperAPI comes in. It routes your requests through rotating proxies and handles all the anti-bot measures automatically.

If you're serious about collecting Reddit data reliably, especially at scale, 👉 ScraperAPI handles all the heavy lifting so you can focus on analyzing the data instead of fighting captchas. You get 5,000 free API credits to start, which is plenty for testing.

Grab your API key from the dashboard, and let's move on.

Step 1: Import Your Libraries

Start by importing the tools you need:

python
import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup

Then create a variable to store your API key:

python
scraper_api_key = 'ENTER KEY HERE'

Step 2: Fetch Reddit Posts

Let's send our first request. We're targeting a specific subreddit, then routing the request through ScraperAPI:

python
reddit_query = f"https://www.reddit.com/t/valheim/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={reddit_query}'
r = requests.get(scraper_api_url)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article', class_='m-0')

This pulls down the HTML and finds all article elements with a CSS class of m-0. Each one represents a post on the page.

Step 3: Extract Post Information

Now we loop through each article and pull out the details:

python
parsed_posts = []

for article in articles:
post = article.find('shreddit-post')

post_title = post['post-title']

post_permalink = post['permalink']

content_href = post['content-href']

comment_count = post['comment-count']

score = post['score']

author_id = post.get('author-id', 'N/A')

author_name = post['author']

subreddit_id = post['subreddit-id']

post_id = post["id"]

subreddit_name = post['subreddit-prefixed-name']

parsed_posts.append({

'post_title': post_title,

'post_permalink': post_permalink,

'content_href': content_href,

'comment_count': comment_count,

'score': score,

'author_id': author_id,

'author_name': author_name,

'subreddit_id': subreddit_id,

'post_id': post_id,

'subreddit_name': subreddit_name

})

We're grabbing the title, link, score, author, and other metadata. Everything gets stored in a list for later.

Step 4: Fetch Comments from Posts

Comments require their own request. We create a function that fetches a post's comment page and parses the comments:

python
def fetch_comments_from_post(post_data):
payload = { 'api_key': scraper_api_key, 'url': 'https://www.reddit.com/r/valheim/comments/...' }
r = requests.get('https://api.scraperapi.com/', params=payload)
soup = BeautifulSoup(r.content, 'html.parser')

comment_elements = soup.find_all('div', class_='thing', attrs={'data-type': 'comment'})
parsed_comments = []

if not text:

continue

parsed_comments.append({

'author': author,

'text': text

})

except Exception as e:

print(f"Error parsing comment: {e}")

return parsed_comments

This loops through all comment elements, extracts the author and text, and skips any empty comments. Error handling prevents the script from crashing if the HTML structure changes slightly.

Step 5: Save Everything to JSON

Once you've collected all posts and comments, dump them into a JSON file:

python
output_file_path = 'parsed_posts.json'
with open(output_file_path, 'w', encoding='utf-8') as json_file:
json.dump(parsed_posts, json_file, ensure_ascii=False, indent=2)

print(f"Data has been saved to {output_file_path}")

The indent=2 makes the file human-readable. The ensure_ascii=False keeps Unicode characters intact.

Bonus: Turn Reddit Pages into LLM-Ready Data

If you're feeding Reddit data into language models like Gemini, there's an easier way. Instead of parsing HTML manually, use ScraperAPI's output_format=markdown parameter. This returns clean, structured text perfect for LLMs.

Here's how:

python
import requests

API_KEY = "YOUR_SCRAPERAPI_KEY"
url = "https://www.reddit.com/r/valheim/comments/..."

payload = {
"api_key": API_KEY,
"url": url,
"output_format": "markdown"
}

response = requests.get("http://api.scraperapi.com", params=payload)
markdown_data = response.text
print(markdown_data)

The response is already formatted as markdown—post title, upvotes, comments, everything. You can feed this directly into Gemini for sentiment analysis or summarization without writing any parsing logic.

Install the Gemini SDK:

bash
pip install google-generativeai

Then send the markdown to Gemini:

python
import google.generativeai as genai

genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel(model_name="gemini-2.0-flash")

prompt = f"""
You are an online community analyst. Based on the Reddit thread below, provide:

A short summary of the discussion
The top 3 user opinions or concerns
Overall sentiment (positive, negative, or mixed)

Here's the thread:
{markdown_data}
"""

response = model.generate_content(prompt)
print(response.text)

Gemini will return a summary of the discussion, top concerns, and overall sentiment. No manual parsing needed.

This approach is perfect if you're building tools to track community sentiment, analyze feedback trends, or compare perspectives across subreddits. You go from raw Reddit data to actionable insights with minimal code.

What You've Accomplished

You now know how to scrape Reddit posts and comments, avoid getting blocked, and export structured data. You can collect public discussions from any subreddit, track sentiment over time, or feed data into machine learning models.

The techniques here work for market research, competitor analysis, or just understanding what people are saying about a topic. 👉 If you're scraping Reddit at scale, ScraperAPI keeps your requests under the radar so you can focus on analysis instead of worrying about IP bans.

Stay curious, scrape responsibly, and remember: the best insights come from asking the right questions.

FAQs About Scraping Reddit

Why Should I Scrape Reddit?

Scraping Reddit gives you real-time insights into what people actually think. You can track sentiment around brands, identify emerging trends, understand competitor positioning, or gather data for machine learning models. It's market research without the survey costs.

What Can You Do with Reddit Data?

The possibilities are endless. Run sentiment analysis on product launches, map social networks within niche communities, track voting patterns, identify influencers, or train chatbots on real conversations. Reddit data fuels everything from trend forecasting to behavioral analysis.

Can I Scrape Private or Restricted Subreddits?

No. Private and restricted subreddits are off-limits, and scraping them violates Reddit's terms of service. Stick to publicly available data. It's not just about following the rules—it's about respecting the communities you're studying.

Page updated

Google Sites

Report abuse