Scraping Reddit data doesn't have to be complicated. This guide walks you through collecting Reddit posts and comments using Python, handling anti-scraping measures, and exporting everything into clean JSON files—all without getting blocked.
By the end of this guide, you'll know how to extract posts, comments, upvotes, and author information from any subreddit. You'll also learn how to bypass Reddit's anti-bot systems and convert messy HTML into structured data. If you're doing market research, tracking sentiment, or feeding data into machine learning models, this is the foundation you need.
Before we dive in, let me show you the complete code. Then we'll break down exactly how it works.
Here's the full scraper in Python:
python
import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup
scraper_api_key = 'YOUR API KEY'
def fetch_comments_from_post(post_data):
payload = { 'api_key': scraper_api_key, 'url': 'https://www.reddit.com//r/valheim/comments/15o9jfh/shifte_chest_reason_for_removal_from_valheim/' }
r = requests.get('https://api.scraperapi.com/', params=payload)
soup = BeautifulSoup(r.content, 'html.parser')
comment_elements = soup.find_all('div', class_='thing', attrs={'data-type': 'comment'})
parsed_comments = []
for comment_element in comment_elements:
try:
author = comment_element.find('a', class_='author').text.strip() if comment_element.find('a', class_='author') else None
dislikes = comment_element.find('span', class_='score dislikes').text.strip() if comment_element.find('span', class_='score dislikes') else None
unvoted = comment_element.find('span', class_='score unvoted').text.strip() if comment_element.find('span', class_='score unvoted') else None
likes = comment_element.find('span', class_='score likes').text.strip() if comment_element.find('span', class_='score likes') else None
timestamp = comment_element.find('time')['datetime'] if comment_element.find('time') else None
text = comment_element.find('div', class_='md').find('p').text.strip() if comment_element.find('div', class_='md') else None
if not text:
continue
parsed_comments.append({
'author': author,
'dislikes': dislikes,
'unvoted': unvoted,
'likes': likes,
'timestamp': timestamp,
'text': text
})
except Exception as e:
print(f"Error parsing comment: {e}")
return parsed_comments
reddit_query = f"https://www.reddit.com/t/valheim/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={reddit_query}'
r = requests.get(scraper_api_url)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article', class_='m-0')
parsed_posts = []
for article in articles:
post = article.find('shreddit-post')
post_title = post['post-title']
post_permalink = post['permalink']
content_href = post['content-href']
comment_count = post['comment-count']
score = post['score']
author_id = post.get('author-id', 'N/A')
author_name = post['author']
subreddit_id = post['subreddit-id']
post_id = post["id"]
subreddit_name = post['subreddit-prefixed-name']
comments = fetch_comments_from_post(post)
parsed_posts.append({
'post_title': post_title,
'post_permalink': post_permalink,
'content_href': content_href,
'comment_count': comment_count,
'score': score,
'author_id': author_id,
'author_name': author_name,
'subreddit_id': subreddit_id,
'post_id': post_id,
'subreddit_name': subreddit_name,
'comments': comments
})
output_file_path = 'parsed_posts.json'
with open(output_file_path, 'w', encoding='utf-8') as json_file:
json.dump(parsed_posts, json_file, ensure_ascii=False, indent=2)
print(f"Data has been saved to {output_file_path}")
Run this code, and you'll get a parsed_posts.json file with all the posts and comments neatly organized.
Now let's walk through how this actually works.
You'll need Python 3.8 or newer for this project. You'll also need two libraries: Requests for fetching web pages and BeautifulSoup4 for parsing HTML.
Install them with:
bash
pip install requests bs4
Create a project folder and a Python file:
bash
mkdir reddit_scraper
echo. > app.py
That's it for setup. Now let's talk about what we're actually scraping.
Before writing any code, think about what you actually need. Are you tracking sentiment around a brand? Analyzing user behavior in a niche community? Looking for trending topics?
For this tutorial, we're targeting the Valheim subreddit and extracting posts along with their comments. You can swap this out for any public subreddit you want.
Here's the thing: Reddit doesn't like scrapers. Try scraping at any meaningful scale without protection, and you'll get blocked fast. That's where ScraperAPI comes in. It routes your requests through rotating proxies and handles all the anti-bot measures automatically.
If you're serious about collecting Reddit data reliably, especially at scale, 👉 ScraperAPI handles all the heavy lifting so you can focus on analyzing the data instead of fighting captchas. You get 5,000 free API credits to start, which is plenty for testing.
Grab your API key from the dashboard, and let's move on.
Start by importing the tools you need:
python
import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup
Then create a variable to store your API key:
python
scraper_api_key = 'ENTER KEY HERE'
Let's send our first request. We're targeting a specific subreddit, then routing the request through ScraperAPI:
python
reddit_query = f"https://www.reddit.com/t/valheim/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={reddit_query}'
r = requests.get(scraper_api_url)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article', class_='m-0')
This pulls down the HTML and finds all article elements with a CSS class of m-0. Each one represents a post on the page.
Now we loop through each article and pull out the details:
python
parsed_posts = []
for article in articles:
post = article.find('shreddit-post')
post_title = post['post-title']
post_permalink = post['permalink']
content_href = post['content-href']
comment_count = post['comment-count']
score = post['score']
author_id = post.get('author-id', 'N/A')
author_name = post['author']
subreddit_id = post['subreddit-id']
post_id = post["id"]
subreddit_name = post['subreddit-prefixed-name']
parsed_posts.append({
'post_title': post_title,
'post_permalink': post_permalink,
'content_href': content_href,
'comment_count': comment_count,
'score': score,
'author_id': author_id,
'author_name': author_name,
'subreddit_id': subreddit_id,
'post_id': post_id,
'subreddit_name': subreddit_name
})
We're grabbing the title, link, score, author, and other metadata. Everything gets stored in a list for later.
Comments require their own request. We create a function that fetches a post's comment page and parses the comments:
python
def fetch_comments_from_post(post_data):
payload = { 'api_key': scraper_api_key, 'url': 'https://www.reddit.com/r/valheim/comments/...' }
r = requests.get('https://api.scraperapi.com/', params=payload)
soup = BeautifulSoup(r.content, 'html.parser')
comment_elements = soup.find_all('div', class_='thing', attrs={'data-type': 'comment'})
parsed_comments = []
for comment_element in comment_elements:
try:
author = comment_element.find('a', class_='author').text.strip() if comment_element.find('a', class_='author') else None
text = comment_element.find('div', class_='md').find('p').text.strip() if comment_element.find('div', class_='md') else None
if not text:
continue
parsed_comments.append({
'author': author,
'text': text
})
except Exception as e:
print(f"Error parsing comment: {e}")
return parsed_comments
This loops through all comment elements, extracts the author and text, and skips any empty comments. Error handling prevents the script from crashing if the HTML structure changes slightly.
Once you've collected all posts and comments, dump them into a JSON file:
python
output_file_path = 'parsed_posts.json'
with open(output_file_path, 'w', encoding='utf-8') as json_file:
json.dump(parsed_posts, json_file, ensure_ascii=False, indent=2)
print(f"Data has been saved to {output_file_path}")
The indent=2 makes the file human-readable. The ensure_ascii=False keeps Unicode characters intact.
If you're feeding Reddit data into language models like Gemini, there's an easier way. Instead of parsing HTML manually, use ScraperAPI's output_format=markdown parameter. This returns clean, structured text perfect for LLMs.
Here's how:
python
import requests
API_KEY = "YOUR_SCRAPERAPI_KEY"
url = "https://www.reddit.com/r/valheim/comments/..."
payload = {
"api_key": API_KEY,
"url": url,
"output_format": "markdown"
}
response = requests.get("http://api.scraperapi.com", params=payload)
markdown_data = response.text
print(markdown_data)
The response is already formatted as markdown—post title, upvotes, comments, everything. You can feed this directly into Gemini for sentiment analysis or summarization without writing any parsing logic.
Install the Gemini SDK:
bash
pip install google-generativeai
Then send the markdown to Gemini:
python
import google.generativeai as genai
genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel(model_name="gemini-2.0-flash")
prompt = f"""
You are an online community analyst. Based on the Reddit thread below, provide:
A short summary of the discussion
The top 3 user opinions or concerns
Overall sentiment (positive, negative, or mixed)
Here's the thread:
{markdown_data}
"""
response = model.generate_content(prompt)
print(response.text)
Gemini will return a summary of the discussion, top concerns, and overall sentiment. No manual parsing needed.
This approach is perfect if you're building tools to track community sentiment, analyze feedback trends, or compare perspectives across subreddits. You go from raw Reddit data to actionable insights with minimal code.
You now know how to scrape Reddit posts and comments, avoid getting blocked, and export structured data. You can collect public discussions from any subreddit, track sentiment over time, or feed data into machine learning models.
The techniques here work for market research, competitor analysis, or just understanding what people are saying about a topic. 👉 If you're scraping Reddit at scale, ScraperAPI keeps your requests under the radar so you can focus on analysis instead of worrying about IP bans.
Stay curious, scrape responsibly, and remember: the best insights come from asking the right questions.
Why Should I Scrape Reddit?
Scraping Reddit gives you real-time insights into what people actually think. You can track sentiment around brands, identify emerging trends, understand competitor positioning, or gather data for machine learning models. It's market research without the survey costs.
What Can You Do with Reddit Data?
The possibilities are endless. Run sentiment analysis on product launches, map social networks within niche communities, track voting patterns, identify influencers, or train chatbots on real conversations. Reddit data fuels everything from trend forecasting to behavioral analysis.
Can I Scrape Private or Restricted Subreddits?
No. Private and restricted subreddits are off-limits, and scraping them violates Reddit's terms of service. Stick to publicly available data. It's not just about following the rules—it's about respecting the communities you're studying.