Data Transformation and Visualization
Data Science heavily involves cleansing, shaping and formatting your data before you can do the analysis. Data scientists typically spend more of their time finding and preparing data than refining models. Business analysts often have to wait weeks for their IT team to extract data from the source systems and curate relevant datasets before appropriate interrogation can take place.
Powerful Data Transformation libraries
R and Python have developed powerful libraries to explore your data specify a series of operations to transform data into the format and shape and definition preferred. The type of transformation available removes many of the limitations of working with spreadsheets, which generally can’t cope with data sets that are too large to fit in memory or even in a practical sense view. R and Python programming languages permit more complex operations based on a coherent syntax approach.
Data visualization is the graphical representation of information. By making use of charts, graphs, and maps, data visualization tools provide an accessible way to view and appreciate trends, outliers, and patterns not always discernible from the native spreadsheet. Data visualization tools and technologies shrink Big Data to essential elements that can be distilled to make decision making empirically sensitive and nimble. Data Visualization tools in Excel are substantial especially if you can leverage R or Python. See below Amortization example where we wish to understand the decomposition of principal and interest on a mortgage overtime. The BERT add in for Excel allows spreadsheets to leverage the resources for R.
In following few pages we will use Diamond, MPG, NYCFlights13 and Titanic datasets and apply key R and Python libraries. We will engage in Exploratory Data Analysis (EDA) to summarize their main characteristics, often with visual methods. EDA provides rich information from data that reveals patterns that perhaps can be further investigated through formal modeling or hypothesis testing. John Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."
Tukey's endorsement of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems 'S'-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which permitted data scientists to identify outliers, trends and patterns in data.