Tools and conception

The feedparser

What is feedparser

The feedparser is a tool that we can use it to get items of title, link and article from any RSS feed or Atom feed.

Click here to see how to download and use feedparser.

Basic function

Each RSS or Atom feed contains a title and entries. And normally each entry contains a summary or description. We use feedparser.parse() in Python3 to abstract these information from RSS or Atom feed.

From RSS feed get information by feedparser

Input

import feedparser

ny = feedparser.parse('https://nypost.com/living/feed/')

print(ny.feed.title, '\n',

      ny.feed.link, '\n',

      ny['items'][0].title, '\n',

      ny.entries[0].title)

Output

Living |  New York Post

 https://nypost.com

 Study discovers victims are often blamed for workplace bullying

 Study discovers victims are often blamed for workplace bullying

RSS feed

Why we use RSS?

With so much new content on the web being added daily, it can be tough to keep up with what’s happening online. There’s a number of different ways people try, including visiting specific websites every day, doing Google searches, or relying on social media to keep them informed. One solution that sometimes gets overlooked is an old-school one: The RSS feed.

What is RSS?

RSS is short for Really Simple Syndication and it’s a way to have information delivered to you. instead of you having to go find it.

If you’re visiting websites and blogs to see if there is anything new to read, you’re probably wasting a lot of time and not always finding anything new. RSS allows you to subscribe to a blog or website and have any new published information sent to you so you don’t have to go looking for it.

RSS is XML-formatted plain text. The RSS format itself is relatively easy to read both by automated processes and by humans alike. An example feed could have contents such as the following:

<?xml version="1.0" encoding="UTF-8" ?>

<rss version="2.0">

<channel>

 <title>RSS Title</title>

 <description>This is an example of an RSS feed</description>

 <link>http://www.example.com/main.html</link>

 <lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate>

 <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>

 <ttl>1800</ttl>

 <item>

  <title>Example entry</title>

  <description>Here is some text containing an interesting description.</description>

  <link>http://www.example.com/blog/post/1</link>

  <guid isPermaLink="false">7bd204c6-1655-4c27-aeee-53f933c5395f</guid>

  <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>

 </item>

</channel>

</rss>

When retrieved, reading software could use the XML structure to present a neat display to the end users.

Stop words

What is stop words

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing.

List of stop words

Click here to see the list of stop words.

Google Sites

Report abuse