The feedparser is a tool that we can use it to get items of title, link and article from any RSS feed or Atom feed.
Click here to see how to download and use feedparser.
Each RSS or Atom feed contains a title and entries. And normally each entry contains a summary or description. We use feedparser.parse() in Python3 to abstract these information from RSS or Atom feed.
From RSS feed get information by feedparser
Input
import feedparserny = feedparser.parse('https://nypost.com/living/feed/')print(ny.feed.title, '\n', ny.feed.link, '\n', ny['items'][0].title, '\n', ny.entries[0].title)Output
Living | New York Post https://nypost.com Study discovers victims are often blamed for workplace bullying Study discovers victims are often blamed for workplace bullyingWith so much new content on the web being added daily, it can be tough to keep up with what’s happening online. There’s a number of different ways people try, including visiting specific websites every day, doing Google searches, or relying on social media to keep them informed. One solution that sometimes gets overlooked is an old-school one: The RSS feed.
RSS is short for Really Simple Syndication and it’s a way to have information delivered to you. instead of you having to go find it.
If you’re visiting websites and blogs to see if there is anything new to read, you’re probably wasting a lot of time and not always finding anything new. RSS allows you to subscribe to a blog or website and have any new published information sent to you so you don’t have to go looking for it.
RSS is XML-formatted plain text. The RSS format itself is relatively easy to read both by automated processes and by humans alike. An example feed could have contents such as the following:
<?xml version="1.0" encoding="UTF-8" ?><rss version="2.0"><channel> <title>RSS Title</title> <description>This is an example of an RSS feed</description> <link>http://www.example.com/main.html</link> <lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate> <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate> <ttl>1800</ttl> <item> <title>Example entry</title> <description>Here is some text containing an interesting description.</description> <link>http://www.example.com/blog/post/1</link> <guid isPermaLink="false">7bd204c6-1655-4c27-aeee-53f933c5395f</guid> <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate> </item></channel></rss>When retrieved, reading software could use the XML structure to present a neat display to the end users.
Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing.
Click here to see the list of stop words.