A good starting point for introducing Data Science to newbies is the area of Visualization and Data transformation. These are key tools for writing up reports, developing business plans and presenting ideas. Producing stunning graphs and state-of-the-art data dashboards can help supply some of the sophistication that third parties may wish to observe in you. In particular, the tidyverse umbrella package provides a free-to-use enterprise level set of solutions. The "tidyverse" suite assembles some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble. The packages work in harmony to clean, process, model, and visualize data. To install tidyverse onto RStudio is straighforward. Please follow the video clip below:
Graphing is highly important for populating written reports and communicating with prospective financial backers, suppliers and clients. ggplot2 is a dedicated data visualization package for R. Hadley Wickham in 2005 pioneerd ggplot2 which departed from base R somehwhat by dis-aggregating visualization semantics across scales and layers. ggplot2 can serve as a replacement or add-on for the base graphics in R and contains a number of defaults for web and print display of common scales. It is licensed under GNU. The ggplot2 package from tidyverse provides a simplified syntax for producing a sophisticated range of visualizations for small to large datasets. It is ideal for dismantling hard-to-explain concepts and breaking down into capsule form business intelligence. ggplot2 offers an unparalleled level of intuition - easily projected through dashboards, pdfs, powerpoints etc. The playlist introduces the basic tools touched on in R for Data Science :
The tidyverse is an amalgam of packages that can be installed relatively troubled-free and expeditiously with a single “meta”-package, which is called “tidyverse”. This provides a convenient way of downloading and installing all tidyverse packages with a unified R command:
install.packages
("tidyverse")
The core tidyverse includes the packages that you’re likely to use in everyday data analyses, and these are attached when you attach the tidyverse package:
library
(tidyverse)
#> -- Attaching packages ---------------------- tidyverse 1.3.0.9000 --
#> v ggplot2 3.2.1 v purrr 0.3.3
#> v tibble 2.1.3 v dplyr 0.8.3
#> v tidyr 1.0.0 v stringr 1.4.0
#> v readr 1.3.1 v forcats 0.4.0
#> -- Conflicts ------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
The underlying philosophy is regularly presented using the following iconography. The program outline below has become something of a signature of the tidyverse simplicity and style.
For any enterprise, a mastery of data helps clarify the boundaries and potential of any given business. More data sometimes can lead to more confusion unless you have a tool that organizes the key talking points and leverages those vital nuggets of knowledge that you want to convey to peers. dplyr from the tidyverse suite offers again a simplified syntax to express your best ideas through numbers. It is useful for creating tables of summary statistics across specific groups of data. Rarely will your data arrive in exactly in the form you anticipate or desire in order to analyze it appropriately. As part of the data science workflow, you will need to transform your data. Please follow the video playlist for an introduction to basic commands:
To appreciate the full potential of the tidyverse suite - it is worth exploring how ggplot2, dplyr, tidyr, readr, purrr, and tibble packages sit together. A unified treatment serves to reveal how the different packages spark. I have chosen to set out how the powerful features of tidyverse combine, using the Titanic dataset from kaggle: https://www.kaggle.com/c/titanic/data . The latter, I have chosen because most people are somewhat domain experts given the pervasiveness of contemporary film in modern culture and its ability to communicate complex ideas and nuances: https://www.youtube.com/watch?v=kVrqfYjkTdQ The Titanic dataset is also a common staple of most Data Science courses whether for professional training or degree based.
In the Titanic Tidyverse 5 video clip below, I introduce some principles of Machine Learning. This branch of Artificial Intelligence relies upon pattern recognition and the theory that computers can learn without being explicitly programmed to perform specific tasks. Computer models can crawl across new data as new data surfaces and then independently adapt. Within the RStudio suite there are a number of packages that enable this type of analysis. The more celebrated applications of machine learning are quite recognizable from our contemporary tech landscape:
Data analysis can be viewed through the prism of 1) visualization/summarization , 2) estimation and 3) prediction . To get some sense of how Artificial Intelligence is impacting on enterprises in Ireland you might find the following archive of podcasts interesting. Machine learning is primarily concerned with prediction but also with finding patterns in the data that can be highly nonlinear. The titanic dataset can be viewed as being analogous to a small business dataset. Perhaps, that is why it is viewed as a valuable pedagogic tool. The binary outcome of survival/perish is similar to collated binary data where mouse clicks result in sale/no sale for instance. Micro-entrepreneurs may benefit from applying Machine Learning techniques to garner insights around client decision making and salient influences including web content and keywords that exert some relevance on a target demographic. To introduce some tools in R that set out visualizations relating to decision trees and conditional inference trees – please follow the link to Hal Varian (chief economist of Google): https://www.aeaweb.org/articles?id=10.1257/jep.28.2.3 In Titanic Tidyverse 5, I implemented in RStudio Hal Varian's R Scripts for Machine Learning that predicts and classifies the survival of passengers. Check out the video below :
The Machine Learning exercise as implemented above using the Titanic dataset can be viewed as a classification problem. A more mundane example would be how to classify emails as being “spam” or “not spam” based on the signature characteristics of the email. This approach, as your own experience might prove, is not infallible. But neither are human beings. Statisticians typically use a range of tools including qualitative choice models like a logit or probit for dealing with classification problems. In the video clip above, a decision tree was teased out using the approach described by Hal Varian . These models can be regarded as sequence of decisions stemming from branches into leaves of terminal nodes that culminate in a range of outcomes. A tree classifier has the same general form, but the decision at the end of the process is a choice about how to classify the observation. The goal ultimately is to “grow” a decision tree that produces good out-of-sample predictions or in more common parlance ensures that the important emails rarely slip over into the spam.
These techniques can be applied across a broad spectrum from business to finance. Businesses generally retain data relating to sales. A key metric for any business is sales or sales conversion regardless of venue. In larger organisations, dedicated marketing teams use Machine Learning findings to gain new in-depth insights to optimize their marketing strategies directed at targeted consumers. Large scale businesses traditionally (perhaps up to now exclusively) have had the depth of resources to filter relevant content to engage their target audience. The formula for success can be quite simple. They generate content that is compelling to a relevant audience sometimes in the form of storytelling and anecdotage designed to illicit an action or response. This has been important for the success of many leading businesses today. The cost of implementing these techniques however has plummeted in recent years with the widespread availability of code and software.
Python and R contain many libraries that are free to use and permit Machine Learning. Digital marketers design strategies to optimize dialogue and develop engagement spanning multiple platforms. This in turn, drives brand awareness and builds engagement. Machine Learning tools are immensely helpful in analyzing what type of content, keywords, and phrases are most relevant to your desired audience. Here, I have focused on just a small subset of techniques and visualizations which are possible using RStudio. The latter is generally rated highly as a statistical software. Please take a look at some industry ratings: https://www.g2.com/categories/statistical-analysis?tab=highest_rated The major benefit to start-ups and micro-enterprises is that there are no longer financial barriers to entry. The playing field is never level when the Davids confront the Goliaths but the accessibility of these freeware tools in Python and R take you a step closer. Take a look at a recent kaggle competition: https://www.kaggle.com/c/santander-product-recommendation See kernels that have been developed to train and validate models through testing. Economics and Finance To explore in more detail - additional worked examples within R using Hal Varian examples/code/explanation, please follow link.