Data

About Data

This data is from the movie database (IMDB). The data can be found at: ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/

The specific files downloaded for this project is:

release-dates.list
running-times.list
certificates.list
genres.list
keywords.list
movies.list
ratings.list

Unzipping Data

Since the data comes in a GZIP format, I was able to unzip it by the following manner:

1. I downloaded the program WinSCP

a. found at https://winscp.net/eng/index.php

2. I connected to an external server at my school which hosts file extraction software

3. I connected to the system1 server at UIC through SSH through Windows PowerShell

4. In systems1 I unzipped the .gz extension files by using command

a. gunzip <filename.gz>

5. I then transferred back the unzipped files to my PC through WinSCP

I did some research and there exists an open source software called 7-Zip found at: https://www.7-zip.org/ that could be used to extract the zipped files needed for this project.

Using Notepad++ to Manually Clean Files

Movies.List

1. Removed all text before line 14

MOVIES LIST

===========

2. Removed line 4697742

--------------------------------------------------------------------------------

3. Save file for further processing in python script movie_pre.py

Certificates.List

1. Removed all text before line 14

CERTIFICATES LIST

=================

2. Removed line 924919

------------------------------------------------------------------------------

3. Saved file for preprocessing in python script

Running-Times.List

1. Removed text above line 14

RUNNING TIMES LIST

==================

2. Removed line 1517423

--------------------------------------------------------------------------------

3. Saved for further preprocessing in python script running_pre.py

Genres.List

1. Removed all text above line 383

THE GENRES LIST

==================

2. Saved for further preprocessing in python script genres_pre.py

Keywords.List

1. Remove all text above line 101936

a. found this line through find option under search tab

THE KEYWORDS LIST

====================

b. Can remove with Begin/End Select option under edit tab

i. Click on line 101936 and select this option

ii. go back to beginning of file and select line 1

iii. press backspace or del key on keyboard to remove all those highlighted lines

2. Saved for further preprocessing in python script keywords_pre.py

Release-Dates.List

1. Removed all text above line 14

RELEASE DATES LIST

==================

2. Removed line 5433159

--------------------------------------------------------------------------------

3. Saved for further preprocessing with python script release_pre.py

Preprocessing Data through Python Scripting

The code is commented in a way the steps are understood, and the script will preclean the data for you. The following data files were cleaned with these scripts found in the data folder of project

Preprocessing Data Further in RStudio

There are comments in file: preprocessing.R that go step by step the reasonings of what was done to the python cleaned txt files.

Then the file: processing.R does even further processing to create data structures to be used in the shiny app: app.R

References

Regular Expressions in R

http://www.endmemo.com/program/R/gsub.php

https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html

https://www.youtube.com/watch?v=K8L6KVGG-7o

Regular Expression in Python

https://www.debuggex.com/cheatsheet/regex/python

Join Method in Python

https://www.programiz.com/python-programming/methods/string/join

Writing to File in Python

https://linuxhandbook.com/python-write-list-file/

Adding Tabs in Shiny App

https://www.youtube.com/watch?v=2Oda3LfL_qY

Page updated

Report abuse