Some meaningless words will form the most part of the result, when the program tries to calculate the most frequent words from the RSS feed, such like the functional words contained in human language. These functional words are very common. Compared with other words, functional words have little practical meaning, such as 'the', 'is', 'at','which'.etc. As a result we may not get the useful information from the result.
top30Words=calcMostFreq(vocabList,fullText)print (top30Words)Output
[('', 635), ('the', 94), ('a', 81), ('of', 59), ('p', 36), ('and', 36), ('to', 34), ('in', 33), ('for', 27), ('s', 25), ('with', 22), ('is', 22), ('i', 19), ('on', 17), ('at', 16), ('was', 16), ('it', 16), ('her', 16), ('that', 15), ('who', 14), ('as', 14), ('from', 14), ('he', 12), ('when', 11), ('this', 11), ('you', 10), ('new', 10), ('have', 10), ('your', 9), ('his', 9)]To solve this problem, we can build the filter to eliminate these meaningless words by building a stopwords List.
stopWords = getStopWords()for words in stopWords: if words in vocabList: vocabList.remove(words)top30Words=calcMostFreq(vocabList,fullText)print(top30Words)Output
[('t', 8), ('people', 7), ('years', 7), ('food', 7), ('year', 5), ('time', 5), ('lot', 4), ('long', 4), ('julie', 4), ('shop', 4), ('cookbook', 4), ('dinner', 4), ('folks', 4), ('open', 4), ('kitchen', 4), ('martha', 3), ('good', 3), ('soup', 3), ('park', 3), ('bar', 3), ('milk', 3), ('baking', 3), ('ve', 3), ('female', 3), ('culinary', 3), ('born', 3), ('hot', 3), ('american', 3), ('chef', 3), ('island', 3)]In this project, the RSS feed is extract from the "living" part of New York Times and Los Angel Time. the two RSS feed only provide 20 pieces of information.
ny = feedparser.parse('https://nypost.com/living/feed/')sf = feedparser.parse('https://www.latimes.com/food/rss2.0.xml')feed1 = nyfeed0 = sfdocList=[];classList=[];fullText=[]minLen=min(len(feed1['entries']),len(feed0['entries']))print('num of information from RSS feed is 'minLen)Output
num of information from RSS feed is 20As a result, we may not get enough training sample. In the project we use 20% as training set which only contains 4 pieces of RSS feed. In the future work, it is important to find a stable and abundant RSS source.
Meanwhile, choose different part of a RSS feed can get different information
Title:
information = feed1['entries'][0]['title']print(information)Mom faces jail in Dubai for calling ex-husband’s new wife a ‘horse’ on FacebookSummary:
information = feed1['entries'][0]['summary']print(information)A terrified British mom facing jail in Dubai for calling her ex-husband an “idiot” and his new wife a “horse” says her “life’s in ruins.” Laleh Sharavesh, 55, from Surrey, England, faces up to two years in jail after she was detained along with her daughter Paris, 14, when they arrived in the United Arab...Bayes’ theorem is a way to figure out conditional probability. Conditional probability is the probability of an event happening, given that it has some relationship to one or more other events.
After we calculate the frequent of each words shows in a particular RSS feed, and the probability of this RSS feed from different region, we can get bayes probability of each words.
for docIndex in trainingSet: #print('docIndex',docIndex) trainMat.append(bagOfWords2VecMN(vocabList,docList[docIndex])) trainClasses.append(classList[docIndex])p0V,p1V,pSpam=trainNBO(np.array(trainMat),np.array(trainClasses))In this program, the project build a 240 length words vector, and figure out the top 30 most frequent words.
[('nyc', 5), ('san', 5), ('downtown', 4), ('mobile', 4), ('deals', 4), ('flooring', 4), ('flyer', 4), ('logo', 4), ('design', 4), ('printing', 4), ('distribution', 4), ('marketing', 4), ('oakland', 3), ('west', 3), ('cash', 3), ('carpet', 3), ('york', 3), ('bay', 3), ('business', 3), ('financial', 2), ('doc', 2), ('services', 2), ('software', 2), ('firm', 2), ('santa', 2), ('recruiting', 2), ('businesses', 2), ('pros', 2), ('low', 2), ('crew', 2)]