Data Transformation and Visualization

Digital Storytelling through data manipulation and visualization with Python and R. Epic tales cast in raw numbers and graphical aesthetics. Idiote, speculate, ideate, innovate and collaborate using dplyr, ggplot2, pandas and matplolib.

7 Key Data Analyst Skills

Install Tidyverse

ggplot2

pandas

matplotlib and seaborn

Introduction to dplyr

dplyr filter()

dplyr arrange()

dplyr select()

dplyr arrange()

dplyr summarise()

Data Transformation and Visualization

Data Science heavily involves cleansing, shaping and formatting your data before you can do the analysis. Data scientists typically spend more of their time finding and preparing data than refining models. Business analysts often have to wait weeks for their IT team to extract data from the source systems and curate relevant datasets before appropriate interrogation can take place.

Powerful Data Transformation libraries

R and Python have developed powerful libraries to explore your data specify a series of operations to transform data into the format and shape and definition preferred. The type of transformation available removes many of the limitations of working with spreadsheets, which generally can’t cope with data sets that are too large to fit in memory or even in a practical sense view. R and Python programming languages permit more complex operations based on a coherent syntax approach.

Visualization

Data visualization is the graphical representation of information. By making use of charts, graphs, and maps, data visualization tools provide an accessible way to view and appreciate trends, outliers, and patterns not always discernible from the native spreadsheet. Data visualization tools and technologies shrink Big Data to essential elements that can be distilled to make decision making empirically sensitive and nimble. Data Visualization tools in Excel are substantial especially if you can leverage R or Python. See below Amortization example where we wish to understand the decomposition of principal and interest on a mortgage overtime. The BERT add in for Excel allows spreadsheets to leverage the resources for R.

In following few pages we will use Diamond, MPG, NYCFlights13 and Titanic datasets and apply key R and Python libraries. We will engage in Exploratory Data Analysis (EDA) to summarize their main characteristics, often with visual methods. EDA provides rich information from data that reveals patterns that perhaps can be further investigated through formal modeling or hypothesis testing. John Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."

Tukey's endorsement of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems 'S'-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which permitted data scientists to identify outliers, trends and patterns in data.

Leverage the resources of R in Excel using BERT add in

One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Download and installation of BERT add in for Excel. BERT is a tool for connecting Excel with the statistics language R. Specifically, it’s designed to support running R functions from Excel spreadsheet cells. In Excel terms, it’s for writing User-Defined Functions (UDFs) in R.

Create an amortizing schedule in Excel with a R function and Visualize Principal and Interest

We build a user defined function for Excel to output a Mortgage Schedule. This is created in R and transposed to Excel by applying the BERT add in. The schedule can be visualized by combining R and Excel functionalities.

Google Colaboratory

Useful Reference Texts for Python and R

Three useful electronic/text books for Data Transformation and Visualization are R for Data Science Python for Data Science and the Python Data Science Handbook

Please see github link

Google Colaboratory