This data is from the movie database (IMDB). The data can be found at: ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/
The specific files downloaded for this project is:
Since the data comes in a GZIP format, I was able to unzip it by the following manner:
1. I downloaded the program WinSCP
a. found at https://winscp.net/eng/index.php
2. I connected to an external server at my school which hosts file extraction software
3. I connected to the system1 server at UIC through SSH through Windows PowerShell
4. In systems1 I unzipped the .gz extension files by using command
a. gunzip <filename.gz>
5. I then transferred back the unzipped files to my PC through WinSCP
I did some research and there exists an open source software called 7-Zip found at: https://www.7-zip.org/ that could be used to extract the zipped files needed for this project.
1. Removed all text before line 14
MOVIES LIST
===========
2. Removed line 4697742
--------------------------------------------------------------------------------
3. Save file for further processing in python script movie_pre.py
1. Removed all text before line 14
CERTIFICATES LIST
=================
2. Removed line 924919
------------------------------------------------------------------------------
3. Saved file for preprocessing in python script
1. Removed text above line 14
RUNNING TIMES LIST
==================
2. Removed line 1517423
--------------------------------------------------------------------------------
3. Saved for further preprocessing in python script running_pre.py
Genres.List
1. Removed all text above line 383
THE GENRES LIST
==================
2. Saved for further preprocessing in python script genres_pre.py
1. Remove all text above line 101936
a. found this line through find option under search tab
THE KEYWORDS LIST
====================
b. Can remove with Begin/End Select option under edit tab
i. Click on line 101936 and select this option
ii. go back to beginning of file and select line 1
iii. press backspace or del key on keyboard to remove all those highlighted lines
2. Saved for further preprocessing in python script keywords_pre.py
1. Removed all text above line 14
RELEASE DATES LIST
==================
2. Removed line 5433159
--------------------------------------------------------------------------------
3. Saved for further preprocessing with python script release_pre.py
The code is commented in a way the steps are understood, and the script will preclean the data for you. The following data files were cleaned with these scripts found in the data folder of project
There are comments in file: preprocessing.R that go step by step the reasonings of what was done to the python cleaned txt files.
Then the file: processing.R does even further processing to create data structures to be used in the shiny app: app.R
Regular Expressions in R
http://www.endmemo.com/program/R/gsub.php
https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html
https://www.youtube.com/watch?v=K8L6KVGG-7o
Regular Expression in Python
https://www.debuggex.com/cheatsheet/regex/python
Join Method in Python
https://www.programiz.com/python-programming/methods/string/join
Writing to File in Python
https://linuxhandbook.com/python-write-list-file/
Adding Tabs in Shiny App