We have chosen a vast dataset of over 800,000 text documents. In 2014, there was a huge debate on Net Neutrality. The Federal Communications Commission (FCC) asked the citizens of the United States to enter comments in ECFS (Electronic Comment Filling System) before 18 July 2014. As a result, 801,781 users posted their comments online. The FCC released the dataset publicly and we are using this dataset for our project. We plan on understanding the key points made by the people regarding the Net Neutrality issue.
Our Dataset was publicly available on the FCC website as a zip file 14-28-RAW-Solr.zip, which contained several XML documents. The XML documents were complex and contained around 20 fields per comment. Lot of the XML data were unnecessary and processing using an XML is very slow. Therefore, our approach was to first extract the necessary fields from the XML and store it in a simple formatted file. Therefore, we implemented a Perl Script to extract from the XML document and to store the results on a CSV (Comma Separated Value) file.
Figure 1: Output of the Perl Script Comma Separated Values
As mentioned earlier, the XML documents were crude and there were many non-readable artifacts which could not be processed.Remove all non-ascii. Remove all non-alpha numeric except few. Results in a CSV file. Dataset stored on Google Cloud - BigQuery. Whereas this may not be limited to this though sufficient.The following is recommended and can be used as a future work. Data Cleaning is done by: (1) normalizing accented terms, (2) dropping illegible artifacts described above. Data Preprocessing and Tokenizing is done by: (1) normalizing all terms to lowercase, (2) using English tokenization rule, (3) filtering out terms w/ less than 3 characters, (4) removing common English stop words, and (5) normalizing spelling variations on some important terms such as ISP, paytoplay, Commissioner/Chairman Tom Wheeler, common carriers, etc.[1]
As we are dealing with roughly a million records, there was no way our computers could process the data with the current configuration. Therefore, we had to switch to a faster approach and use Cloud Computing. We chose Google’s BigQuery as Google already had a publicly available dataset for Trigrams. Access to our dataset is very fast and the schema is shown. Finally, our dataset looks something like this.
Figure 2: Google’s publicdata:samples.trigrams
Page rank is an algorithm or technique found by Google. This addresses the problem of ranking websites based upon the quantity and quality of links to a certain website. Higher the ranking means more the important the web page is. The importance of the website is determined by the links to that particular website provided by the other websites. This is the quantity part. Whereas the quality part is determined by the parent web site which has link to the website of interest, as to how many links that has. The name is inspired after Larry Page one of the founders of Google.
Figure 3: Schema
Figure 4: Our Dataset
Page rank algorithm is based on the link analysis concept. Link analysis belongs to the study of network analysis. These techniques are used for example by insurance companies to find all the links of a customer with other insurance companies for a particular product for example a car. This way they either easily catch fraudulent customers having registered for more than one customer or for marketing strategy development. Also these are used by intelligence bureaus to track caller easily and find unknown links between them.[6]
A basic idea behind the algorithm is given N nodes all nodes are equally assigned probability values. Later say node 1 is connected to node 2, i.e. the link of web page 2 is mentioned on webpage 1. Also just to generalize say node 1 is also connected to node 3. Now the algorithm is run for the first iteration, yes it is a iterative algorithm. In the first iteration the probability value which node 1 owns will be given out equally to node 2 and node 3 which it is connected to. The algorithm is run iteratively until all the probability score of each node settles for in a steady state. Now the probability owned by each node tells crudely about the rank of the corresponding web page. There are many layers over and above this but it is out of scope of this project report. This has been explained only to give the reader a basis intuition before the text rank is presented.
Figure 5: PageRank Representation
The mention of the page rank is important because the same intuition is used behind the text rank algorithm proposed by Rada et al. The main idea is that all the words in a document votes and recommends other words in the document. We encash this idea and build a graph out of each relevant word in a document. We then use the same algorithm as explained above for the page rank over this. So we have a graph say G given by G=(V,E) where V is the set of vertices and E is the set of Edges. And let In(Vi) be the number of vertices contributing to vertex Vi and out(Vi) be the number of vertices to which Vi is voting for. The score of Vi may be given by the formula below. D is the damping factor which we set it to 0.85.[3]
Also we assign arbitrary values to each node and then run the iterative algorithm of voting and recommendations. The paper suggests that irrespective of what starting value assigned to each of the node it will converge at the same values. So, we did not experiment much on this and assigned all zeros values.
Whereas, in our problem statement of text ranking we cannot use the directed rank approach. The solution for this is to assign In(Vi)=Out(Vi). Also we take advantage of the weighting of the graphs.We modify the previous formula to
Once we have the graph ready, we implement the following four step algorithm which we run iteratively until a convergence threshold of 0.005. The four steps as stated by the paper is:
Further, we need to identify text which is worth enough to put it into the graph on the first hand. The paper proposes many techniques including frequency based, nave Bayes. But we have chosen the lexical and syntactic features technique. The expected end result for this application is a set of words or phrases that are representative for a given natural language text. The text is tokenized, and annotated with part of speech tags. All lexical units that pass the syntactic filter are added to the graph, and an edge is added between those lexical units that co-occur within a window of words. After obtaining the final score for each vertex in the graph, vertices are sorted in reversed order of their score, and the top vertices in the ranking are retained for post-processing. During post-processing phase, all lexical units selected as potential keywords by the TextRank algorithm are marked in the text, and sequences of adjacent keywords are collapsed into a multi-word keyword.
STEP 1: Construct a Graph from Parts of Speech (PoS) tags Scan sentences to construct a graph of relevent morphemes (smallest grammatical unit of a language).
STEP 2: Run TextRank to determine the keywords: Run through a certain N iterations of the TextRank algorithm, or until the standard error converges below a given threshold in this case (0.005). Damping factor = 0.85 Threshold = 0.005 Iterate through the Graph and calculate the rank Sort and mark the top vertices.
STEP 3: Lemmatize the selected keywords and phrases Test in the WordNet the lexical value of nouns and adjectives and collocations Augment the graph with n-grams added as vertices.
STEP 4: Re-run TextRank on the post-processing phase.
STEP 5: Constructring a metric space for overall ranking (From WordNet: NGram, Link Rank, Count Rank, Rank of Set of Synonyms)
Comment 1: Dear FCC My name is Ty Grove and I live in Isla Vista CA. Net neutrality the principle that Internet service providers ISPs treat all data that travels over their networks equally is important to me because without it ISPs could have too much power to determine my Internet experience by providing better access to some services but not others. A paytoplay Internet worries me because ISPs could act as the gatekeepers to their subscribers. The Internet is important to me because as a college student I need to know that there will not be barriers to entry for the new ideas and services that I hope to bring to the marketplace. If ISP subscribers have an easier time loading websites of existing companies than my new innovative product there?s no way that I will be able to compete or succeed. Sincerely Ty Grove
Key words: net neutrality, internet service providers isps, paytoplay internet, internet experience, internet, easier time, new innovative product, services, new ideas, isp subscribers, better access, ty grove, isla vista, college student, much power
Figure 6: Number of Iterations
Comment 2: I would like to express my outrage at this document. When I first started reading it I was under the belief that reclassifying ISP as Common carriers would at least be considered in the proposal for rule making. In paragraph 118 you make it very clear that you will not be doing this. I feel with out even the disscussion this document just contributes to the problem and will be the last nail in the coffin of net neutrallity.
Key words: rule making, document, net neutrallity, last nail, common carriers, reclassifying isp
Comment 3: I value the current system of internet service that does not afford people different speeds of internet. I don’t support net neutrality rules that would change this.
Key words: net neutrality rules, people different speeds, internet service
Comment 4: These cable companies have been operating in an obvious debatably illegal monopolyfashion that allows them to exploit consumers. Giving them more power and legally allowing them to control and profit from offering faster access to specific paying websites and companies is an agregrious and ridiculous action to make. I am tired of seeing big money entities muscle their way into taking advantage of the people and it is the government and its institutions’ YOU jobs to protect the people. Governments are representative of the people and their interests and if nearly 50 000 comments are not enough look to the thousands of emails and articles all across the currentlyfree web to find what the people want and deserve. We want a free market and a free internet. Please do not give these companies the power to exploit us. Thank you.
Key words: ridiculous action, more power, currentlyfree web, people, faster access, free internet, free market, cable companies, big money entities muscle, government, obvious debatably illegal monopolyfashion, companies, power
TextRank algorithm does a pretty good job in classifying the text. Best part is that we don't need training data. This can be applied on any subject not domain specific.
Of course there are many such other algorithms including LDA which we explored making use of the Stanford NLP Tool Box. In future, we propose to use a combination of such techniques and come up with innovative ways to merge to get the best out of them.
[1] Adison Wongkar, Christoph Wertz, What are People Saying about Net Neutrality, Project at Stanford University.
[2] ”Unsupervised Feature Selection for Text Data.” - Springer. N.p., n.d. Web. 03 June 2015.
[3] TextRank: Bringing Order into Texts,Rada Mihalcea and Paul Tarau,Department of Computer Science,University of North Texas
[4] Dan Jurafsky - Christopher Manning Coursera NLP https://www.coursera. org/course/nlp.
[5] Stanford Core NLP http://nlp.stanford.edu/software/corenlp.shtml
[6] Wikipedia http://en.wikipedia.org/wiki/PageRank