Learn how to extract headlines, authors, publication dates, and full article text from news websites using Python's Newspaper3k library—without getting blocked or rate-limited.
So you want to scrape news articles. Maybe you're building a news aggregator, doing sentiment analysis, or just tired of manually copying headlines. Whatever the reason, Newspaper3k makes it surprisingly easy.
Here's the thing though: most scraping tutorials make it sound harder than it is. They throw around technical jargon and assume you're already a web scraping expert. This guide? It's different. We're going to walk through everything step-by-step, like two friends figuring this out together over coffee.
Think of Newspaper3k as your personal news extraction assistant. It's a Python library specifically designed to grab content from web pages that look like articles—you know, headlines, body text, author names, that kind of stuff.
The cool part is it handles the messy work automatically. You don't need to dig through HTML tags or figure out CSS selectors. Just point it at a news URL, and it pulls out what you need.
First things first—create a project folder and toss a file called index.py in there. Now let's actually build something.
Open your terminal and run:
pip install newspaper3k
That's it. Seriously.
Here's where it gets fun. Grab any news article URL (I'm using a CNN sports article here, but any news site works):
python
from newspaper import Article
url = 'https://edition.cnn.com/2025/06/10/sport/manchester-city-wins-champions-league-for-first-time-beating-inter-milan-1-0-in-tense-istanbul-final/index.html'
article = Article(url)
article.download()
article.parse()
Two methods here: download() grabs the HTML, and parse() makes sense of it. Think of it like downloading a recipe, then actually reading and understanding the instructions.
Once you've parsed the article, you can pull out all sorts of data:
python
print("Headline:", article.title)
print("Authors:", article.authors)
print("Publication Date:", article.publish_date)
print("Main Text:", article.text)
Run it with python index.py and boom—you've got the article's headline, author list, publication date, and the full text sitting right there in your terminal.
The library also grabs:
top_image – that featured image at the top
images – all images in the article
videos – embedded video content
html – the full page HTML if you need it
Here's something neat: Newspaper3k works with 40+ languages out of the box. By default, it auto-detects the language, but you can specify one if needed:
python
url = 'https://www.bbc.com/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics'
article = Article(url, language='zh') # Chinese
It supports everything from Arabic to Vietnamese—check the official docs for the complete list.
Now here's where things get real. When you start scraping dozens or hundreds of articles, websites notice. They see identical requests coming from the same IP and think "bot!" Then you get blocked.
Newspaper3k's built-in download feature doesn't support proxies directly, which becomes a problem fast. The workaround? Use an HTTP client like Python Requests to handle the download, then pass the HTML to Newspaper3k for parsing.
That's where proxy rotation comes in handy. Instead of managing proxy lists yourself (trust me, it's a headache), you can use a scraping API that handles all that complexity behind the scenes. 👉 Skip the proxy headaches and start scraping news at scale with intelligent rotation built-in
Here's how it works with a scraping API:
python
import requests
from urllib.parse import urlencode
from newspaper import Article
url = 'https://edition.cnn.com/2025/06/10/sport/manchester-city-wins-champions-league-for-first-time-beating-inter-milan-1-0-in-tense-istanbul-final/index.html'
article = Article(url)
payload = {'api_key': 'YOUR-API-KEY', 'url': url}
response = requests.get('https://api.scraperapi.com', params=urlencode(payload))
article.download(input_html=response.text)
article.parse()
print("Headline:", article.title)
print("Authors:", article.authors)
print("Publication Date:", article.publish_date)
With this setup, you can scrape millions of pages without worrying about CAPTCHAs, rate limits, or IP bans. The scraping API handles proxy rotation, browser fingerprinting, and all the anti-bot stuff automatically.
Newspaper3k has a built-in natural language processing feature that's surprisingly useful. It can generate article summaries and extract keywords automatically:
python
from newspaper import Article
url = 'https://edition.cnn.com/2025/06/10/sport/manchester-city-wins-champions-league-for-first-time-beating-inter-milan-1-0-in-tense-istanbul-final/index.html'
article = Article(url)
article.download()
article.parse()
article.nlp()
print("Text Summary:", article.summary)
print("Keywords:", article.keywords)
The nlp() method is computationally expensive, so don't use it unless you actually need the summary or keywords. But when you do need it? It's pretty slick.
First time running nlp(), you might see an error about missing the punkt package. No worries—just add these two lines at the top of your script once:
python
import nltk
nltk.download('punkt')
Run your script, then delete those lines. The package downloads once and you're good to go.
Want to scrape several news sites at once? Newspaper3k's multi-threading feature lets you do exactly that:
python
import newspaper
from newspaper import news_pool
ted = newspaper.build('https://ted.com')
cnbc = newspaper.build('https://cnbc.com')
fox_news = newspaper.build('https://foxnews.com/')
papers = [ted, cnbc, fox_news]
news_pool.set(papers, threads_per_source=2) # 6 threads total
news_pool.join()
print(cnbc.size())
The library smartly allocates 1-2 threads per source to avoid rate limiting. The join() method downloads articles from all sources simultaneously, returning arrays you can iterate through.
You now know how to extract article headlines, body text, authors, publication dates, and even video content from news websites. You've seen how to handle different languages, use NLP for summaries and keywords, and scrape multiple sources without getting blocked.
The key takeaway? Newspaper3k handles the parsing beautifully, but when you're scaling up, you need proper infrastructure to avoid anti-bot measures. Combining Newspaper3k with proxy rotation gives you the best of both worlds—powerful parsing plus reliable access at scale. 👉 Get started with 5,000 free API credits and see how easy large-scale news scraping can be
For more advanced features like trending terms and popular URLs, check out the official Newspaper3k documentation. Happy scraping!