Analysis

Stopword List

Some meaningless words will form the most part of the result, when the program tries to calculate the most frequent words from the RSS feed, such like the functional words contained in human language. These functional words are very common. Compared with other words, functional words have little practical meaning, such as 'the', 'is', 'at','which'.etc. As a result we may not get the useful information from the result.

top30Words=calcMostFreq(vocabList,fullText)

print (top30Words)

Output

[('', 635), ('the', 94), ('a', 81), ('of', 59), ('p', 36), ('and', 36), ('to', 34), ('in', 33), ('for', 27), ('s', 25), ('with', 22), ('is', 22), ('i', 19), ('on', 17), ('at', 16), ('was', 16), ('it', 16), ('her', 16), ('that', 15), ('who', 14), ('as', 14), ('from', 14), ('he', 12), ('when', 11), ('this', 11), ('you', 10), ('new', 10), ('have', 10), ('your', 9), ('his', 9)]

To solve this problem, we can build the filter to eliminate these meaningless words by building a stopwords List.

stopWords.txt

stopWords = getStopWords()

for words in stopWords:

    if words in vocabList:

        vocabList.remove(words)

top30Words=calcMostFreq(vocabList,fullText)

print(top30Words)

Output

[('t', 8), ('people', 7), ('years', 7), ('food', 7), ('year', 5), ('time', 5), ('lot', 4), ('long', 4), ('julie', 4), ('shop', 4), ('cookbook', 4), ('dinner', 4), ('folks', 4), ('open', 4), ('kitchen', 4), ('martha', 3), ('good', 3), ('soup', 3), ('park', 3), ('bar', 3), ('milk', 3), ('baking', 3), ('ve', 3), ('female', 3), ('culinary', 3), ('born', 3), ('hot', 3), ('american', 3), ('chef', 3), ('island', 3)]

Resource of RSS feed

In this project, the RSS feed is extract from the "living" part of New York Times and Los Angel Time. the two RSS feed only provide 20 pieces of information.

ny = feedparser.parse('https://nypost.com/living/feed/')

sf = feedparser.parse('https://www.latimes.com/food/rss2.0.xml')

feed1 = ny

feed0 = sf

docList=[];classList=[];fullText=[]

minLen=min(len(feed1['entries']),len(feed0['entries']))

print('num of information from RSS feed is 'minLen)

Output

num of information from RSS feed is 20

As a result, we may not get enough training sample. In the project we use 20% as training set which only contains 4 pieces of RSS feed. In the future work, it is important to find a stable and abundant RSS source.

Meanwhile, choose different part of a RSS feed can get different information

Title:

information = feed1['entries'][0]['title']

print(information)

Mom faces jail in Dubai for calling ex-husband’s new wife a ‘horse’ on Facebook

Summary:

information = feed1['entries'][0]['summary']

print(information)

A terrified British mom facing jail in Dubai for calling her ex-husband an “idiot” and his new wife a “horse” says her &#8220;life&#8217;s in ruins.&#8221; Laleh Sharavesh, 55, from Surrey, England, faces up to two years in jail after she was detained along with her daughter Paris, 14, when they arrived in the United Arab...

Bayes probability calculation

Bayes’ theorem is a way to figure out conditional probability. Conditional probability is the probability of an event happening, given that it has some relationship to one or more other events.

After we calculate the frequent of each words shows in a particular RSS feed, and the probability of this RSS feed from different region, we can get bayes probability of each words.

for docIndex in trainingSet:

    #print('docIndex',docIndex)

        trainMat.append(bagOfWords2VecMN(vocabList,docList[docIndex]))

        trainClasses.append(classList[docIndex])

p0V,p1V,pSpam=trainNBO(np.array(trainMat),np.array(trainClasses))

Bayes probability of words in RSS feed

In this program, the project build a 240 length words vector, and figure out the top 30 most frequent words.

[('nyc', 5), ('san', 5), ('downtown', 4), ('mobile', 4), ('deals', 4), ('flooring', 4), ('flyer', 4), ('logo', 4), ('design', 4), ('printing', 4), ('distribution', 4), ('marketing', 4), ('oakland', 3), ('west', 3), ('cash', 3), ('carpet', 3), ('york', 3), ('bay', 3), ('business', 3), ('financial', 2), ('doc', 2), ('services', 2), ('software', 2), ('firm', 2), ('santa', 2), ('recruiting', 2), ('businesses', 2), ('pros', 2), ('low', 2), ('crew', 2)]

Google Sites

Report abuse