Drake Gibson- Bureau of Labor Statistics, David Oh -Bureau of Labor Statistics

Title: Work Stoppages Web Scrape Project


Abstract:


The Work Stoppages Program provides monthly and annual data and analysis of major work stoppages involving 1,000 or more workers lasting one full shift or longer. The monthly and annual data show the establishment and union(s) involved in the work stoppage along with the location, the number of workers and the days of idleness.

David Oh and Drake Gibson were tasked with overhauling and automating the data collection process for the Work Stoppages Program by scraping relevant news articles from the internet. These scraped articles will be reviewed and used to report monthly and annual data and analysis of major work stoppages in the United States. Our prototype collects the data from the internet, processes and categorizes the articles and finally, presents the article data and text to the reviewer for final determination of relevancy.

We begin with RSS feeds for articles that fall into the search terms like work stoppage, strike and labor dispute to name a few. From the RSS feeds, we use R to scrape the news article webpage, clean the article text, and then annotate articles on a daily basis. With the rvest package, we are able to isolate and keep all the article text in the corresponding HTML tag.

We are able to annotate the articles using elements like the mention of union names, and number of workers. The annotated articles are stored in an excel file then passed to our front end.

Using Python, we created a front end user interface for users to search, view and annotate scrapped articles to apply human scores to compare against our computer scores. Using TF-IDF, our prototype delivers recommended articles based on the content of the article selected by the user. Links to the article website and a screenshot of the article are also available for users. These are for articles that may help reviewers for imperfect scrapes or articles no longer available online.

We will continue to perfect this application based on input from our stakeholders to perfect our algorithm for machine scores to find the most relevant, in scope articles for the Work Stoppages Program.