Tweet Pipeline

Task

Twitter can be regarded as a broadcast network. Tweet archives could be processed and distributed in the /tv tree in the normal fashion: that tree is organized with subdirectories by year, sub-sub-directories by month, and sub-sub-sub-directories by UTC day.

There is a public streaming API that provides, "a small random sample of all public statuses". Red Hen has consumed this data in its "raw" JSON format, but would like to process the data to follow a standard format shared by all of the TV news data.

Given the size of the files, we are currently planning to create one file with tweets per UTC hour. For example, /tv/2018/2018-10/2018-10-07/2018-10-07_1000_WW_Spritzer.twt would contain all the tweets in the public sample stream (known informally as "spritzer") for the one hour form 10:00 (inclusive) to 11:00 (exclusive) UTC on October 7, 2018.

Each line in the file will correspond to one tweet, and fields will be separated with the | (pipe) character. The exact fields and order are currently being decided, and input is welcome on the GitHub discussion page.

Once the format is decided and the initial data ingested, tweet files could then be processed for sentence segmentation, NLP, Frames, etc., and the results placed in a separate file with extension, perhaps .twtmeta. This system would need to be designed to be compatible with existing Red Hen data structures and processes. This part of the processing pipeline has not been started and input/help is welcome! If interested, write to 

 
and Red Hen will try to connect you with a mentor.

Shane Karas and Ram Gullapalli are the initial organizers of this team, which Scott Hale joined in December 2018.

Related links


Recording the public/sample stream. The code used to record tweets will be shared shortly. This code is simple and has only one task: to write content from the stream directly to files on the hard disk. Twitter does not buffer the stream; so, no processing occurs in this script. All content is written to disk for later scripts to analyse and format. The focus of this pipeline is not near-real time; so, the scripts can run in batches.

Formatting and distributing .twt files within the Red Hen file structure. Scott is developing a Python script to handle this step of the process. The code is open source within the GitHub repository.

Open tasks. We want to ingest tweets from other archives in addition to the public/sample stream. The final pipeline should be sufficiently flexible to accommodate a wide variety of input formats.

Other tools.

json-to-csv


some bash background on working with .json tweets:

  1. A tweet can mention another person by twitter handle, known as screen_name. In that case, the json record tags the screen_name and provides and tags the corresponding real name.
  2. A tweet can be a retweet, with a cascade of info.
  3. A tweet can quote another tweet. More cascades.
  4. Unfortunately, tags are reused. screen_name and real_name are used for everybody involved in the tweet. The json record lists the actual top-level author, and the actual diffusion path is not visible. 
  5. But this means that lines for tweets do not always have the same number of fields. This would make sophisticated work on fields difficult, and would make statistics on fields difficult.
  6. Red Hen might want to improve the situation by moving what we think is the crucial information to the first N fields, such that n in N is always a known category. E.g. time of creation of tweet, in UTC, screen name of tweeter, real name of tweeter, full text
  7. Red Hen would then be able to tell her processing scripts (sentence splitter, NLP tagger, Frame tagger, etc.) to ignore everything after the first N fields. Red Hen would also be able to tell the script which field to work on. E.g., split sentences and do NLP tagging on field 4 only.

Manipulating json info into one-tweet-per-line .twt files

Red Hen expects that the json-to-csv library is the place to start.  But bash has many string manipulation abilities.  As an example:
$ cat master_2018.json | python -m json.tool > masterpretty_2018.json 
$ cat masterpretty_2018.json | sed 's/^[ \t]*//' > t1.twt
$ cat t1.twt | grep -v "Wed\ Mar\ 18.*2009" | grep -v "in_reply_to_screen_name" | grep "time_zone\|created_at\|full_text\|retweeted\|screen_name\|^\"name" > t2.twt

[careful: the command above requires knowing the content of the line that specifies the created_at time for the creation of the user account, in this case Wed\ Mar\ 18.*2009. That is not a general solution to filtering out that line]

$ cat t2.twt | sed 's/^.full/\|&/'| sed 's/^.created/\|&/' | sed 's/^.retweeted/\|&/' | sed 's/^.screen_name/\|&/' |sed 's/^.name/\|&/' > t3.twt
$ cat t3.twt | tr -d '\n' | sed 's/\"time_zone/\n&/g' >t4.twt
$ cat t4.twt | sed 's/\(^\"time_zone.*\)\(\"created_at".*[0-9][0-9][0-9][0-9]\"\,\)\(.*\)/\2\1\3/' > t5.twt
$ cat t5.twt | sed 's/\"created_at\":\ \"//' > t6.twt
$ cat t6.twt | sed 's/\"\,\"/\|/' > t7.twt


Such steps convert a json twitter archive into a file with lines like this:

Mon Jan 01 13:37:52 +0000 2018|time_zone": "Eastern Time (US & Canada)",|"full_text": "Will be leaving Florida for Washington (D.C.) today at 4:00 P.M. Much work to be done, but it will be a great New Year!",||"retweeted": false,|"screen_name": "realDonaldTrump",|"name": "Donald J. Trump",

Mon Jan 01 12:44:40 +0000 2018|time_zone": "Eastern Time (US & Canada)",|"full_text": "Iran is failing at every level despite the terrible deal made with them by the Obama Administration. The great Iranian people have been repressed for many years. They are hungry for food & for freedom. Along with human rights, the wealth of Iran is being looted. TIME FOR CHANGE!",||"retweeted": false,|"screen_name": "realDonaldTrump",|"name": "Donald J. Trump",

Mon Jan 01 12:12:00 +0000 2018|time_zone": "Eastern Time (US & Canada)",|"full_text": "The United States has foolishly given Pakistan more than 33 billion dollars in aid over the last 15 years, and they have given us nothing but lies & deceit, thinking of our leaders as fools. They give safe haven to the terrorists we hunt in Afghanistan, with little help. No more!",||"retweeted": false,|"screen_name": "realDonaldTrump",|"name": "Donald J. Trump",
And the date and time in that format is properly converted by the date command:
$ date -ud 'Mon Jan 01 13:37:52 +0000 2018' '+%Y-%m-%d %H:%M %Z %z' 
2018-01-01 13:37 UTC +0000