The reading list contains files which are typically taken from the Economist, Wall Street Journal (WSJ), New York Times or Trade journals. Since the copyright laws prohibit them to be shared, I am not sharing them here. This week by week synopsis involves two set of files. The first contains the lecture, and accompanying code. The second set of files are files in the reading list.
Week 1 and 2: Provides use cases for text analysis in Finance, market efficiency, elementary R (and statistics), and beginner Python. Typically runs over to 1.5 weeks. Two set of associated files, public and private. Each year, I work with new textual dataset. In 2011, we worked with WSJ Abreast of the News articles, In 2012, we worked with Seekingalpha.com articles and SEC filings. In 2013, we plan to work with SEC filings and press releases of the firms.
Week 3: Typically one day is spent going over Week 1 material. Introduce word counting technology. Introduce Tetlock(2007) paper. Two set of associated files, public and private. Private file contains Harvard IV dictionaries, Notre Dame dictionary and two WSJ articles. The Notre Dame dictionaries are available at http://www3.nd.edu/~mcdonald/Word_Lists.html.
Week 4: I introduce how to count words in a document. I also provide examples of text analysis capabilities at two common data providers - Reuters and Bloomberg. The slides include examples from Bloomberg. The slides also contain Bloomberg's text analysis technique. I will introduce items from Reuters later in the semester. At that point I will also explain the differences between these two models of text analysis.
Week 5: I have moved the discussion on Fang and Perez paper as well as "Models of information" to week 6. Week 5 begins with continuing the word counting exercise. I take students through five version of the word counting program. Version 5 also provides the word which match in a particular document. Students see that some of the positive words such as "Company", and "Share" and not really positive. That leads to alternative dictionary by Loughran and McDonanld dictionary. Students are also introduced to functions in python. In the later half of the week, I introduce "Regular Expressions" in python.
Week 6: In the first part, I introduce the students to regular expression in python. They use regular expressions to get a list of all stories published on the market watch website. They also learn about findall, groups and IGNORECASE flag. In the second part of the first lecture, I introduce stemming. The examples are provided in stemp.py, a python file. In the next lecture I talk about Fang and Perez paper and introduce students to Fama-French regression. The lectures and programs are here.
Week 7: I continue discussion of Fang and Peress paper with emphasis on Fama French regression. Thursday is midterm, so it is a short week. The R file for Fama-French regression is available. R tricks learnt in the class: read.table, read.csv, lagging a variable in data frame, names, head, class, merge and lm (multiple regression).
Week 8 and Week 9: I introduce event study. Followed by event study in R. I also talk about the possibilities of using event studies to 'sign' the words and the dangers associated with it. For example "able" is a positive word in upmarket and a negative word in downmarket. As usual the lecture and R files are available. R file is a little sketchy, hope to improve over time.
Week 10: I introduce Supervised learning. "Tag", "Train" and "Classify". First I start with two examples. Example 1 is python code for identifying gender from names. Example 2 is python code for Rotten tomato reviews. Then I introduce Bayesian learning and Naive Bayes classifier. Finally I introduce decision trees. Students also learn how to extract stories from market watch, an affiliate of WSJ. As usual, the material is available here.
Week 11: I show the code blocks required to get data from a website without authentication. In second part of the week, or perhaps Week 12 first part, I show how does a commercial system for text analysis works. Files for downloading text from a website without authentication are available here.
Week 12: I show one option for tagging articles. I also provide example of how does twitter fine tunes its searches. Later I introduce entity-detection and parts of speech tagging. The material is available here.
Week 13: We talked about some important datasources such as waybackmachine.org. We also talked about Amazon Turk as a tagging service. We also talked about OpenCalais as entity detection service. We used lecture file from week 12.
Week 14: We talked about how to distinguish between news effect and the effect of sentiment. I also showed how three popular techniques for sentiment identification work over almost a million documents. The lecture is available on the blackboard. This is essentially my joint work with Steve Heston.
Week 15: I show how to get data from Twitter. I also showed how to deal with gziped and csv files in Python.The associated files are here.
Week 16: We talked about my work with Elisabeth Newcomb Sinha and Lucija Muehlenbachs. Guest lecture by Elisabeth Newcomb Sinha.
Final Exam: It is a take home exam. The files are posted here.
Final Projects: This year Brian Thies looked at distribution of press releases over time. Suhas Nagabhushan looked at if there was any change in investor sentiment following Lehman debacle.