Welcome to Vinegar Data
Intelligent Analytics to unlock insights from your data. VinegarHill-DataLabs is a learning portal designed to promote data literacy and digital storytelling with mainly R and Python libraries. A community resource.
Intelligent Analytics to unlock insights from your data. VinegarHill-DataLabs is a learning portal designed to promote data literacy and digital storytelling with mainly R and Python libraries. A community resource.
VinegarHill-DataLabs is a resource set up to promote data literacy, data analytics and digitization. We aim to share skills relating to performing data analysis, data transformation, visualization and modeling. The technologies we use will include include: Microsoft Excel VBA, Python and R. (Some additional technologies include Javascript and Googlesheets ). All of these can be used with minimal costs being incurred and in principle should be within the reach of students, start-ups and cash stretched micro-entrepreneur.
Data should be considered an asset as tangible as a company’s hardware or its headquarter office space. Learning data science allows you to understand and creatively develop Business Intelligence to deal with the ever expanding flow of information passing through organizations. Investing in data tools will assist learners to navigate digital technologies and extract value from otherwise dormant resources often vaulted away in orphaned spreadsheets. Cleansing and shaping tabular data in Python and R with graphical flow editing capability can be transformative for team work. The cost of harnessing this technology mainly relates to the human cost of learning the techniques - the rest is already sunk in the repositories of code freely available over the internet. Making use of interactive templates to code operations, functions, and logical operators opens a vista to customization via filtering, sorting, combining or removing columns. Mastery of Pivot Tables key to revealing Business Intelligence is accomplished here using Pandas library in Python and the dplyr package in R. Developing skills in these operations are typically the first steps you will take on your Data Odyssey.
No prior skill set is assumed here. Step-by-step instruction from scratch is provided with audio-visual demonstration of key concepts. The standard data analytics techniques and machine learning applied to business/science domain intelligence, are presented here in a hands-on/learn-by-doing style. This portal will help you tune into solving routine business/science problems and communicate that analysis to key decision makers and stakeholders. Importantly, the approach very deliberately relies on using Freeware: Python/R/RStudio/RStudio Cloud/Googlesheets / Google Colab and Nearware (nearly everybody has it anyway): e.g. Excel. The reading materials, code and other learning tools are all non-subscription based. This is important for removing barriers to entry, making analytics tools available and removes speedbumps as an organization scales. RStudio Cloud / Google Colab resources are accessible on all smart screens from tablets to basic chromebooks - so hardware should not be a barrier to entry either. This portal is designed to permit students/startups/seasoned modelers to leverage as fully as possible many extra-ordinary resources for free and obviate the need to manage/renew software licenses.
Founder: Brian Byrne Phd
Learn to find, clean/process, and transform data. Apply Visualization and Transformation to explore data in a systematic way. Engage in exploratory data analysis, (or EDA) to parse through Big Data. Leverage the graphing tools available in ggplot2, created by Hadley Wickham, to create publication quality visualizations and reports with minimum fuss. ggplot2 is introduced here and explained from scratch. The ggplot2 syntax executable in R (and incidentally Python) provides a simplified grammar for producing “elegant graphics for data analysis”. Learn how to create highly nuanced charts simply and extract business/scientific intelligence from vast data frames using a more programmatic and intuitive interface. Learn to specify what variables to plot, display templates, and manipulate general visual properties. Available on the Tidyverse R page, we will extensively make use of examples developed in: R for Data Science. Also some nice lecture notes can be found here from Alexandra Chouldechova.
Develop a broad insight and understanding of data analytics tools and the ability to extract useful knowledge from data. Develop a mastery of basic statistical techniques. Employ basic frameworks like the Normal Distribution and student-t distribution to understand the preponderance or otherwise of trends or patterns. Establish with statistical confidence relationships. We develop models of qualitative choice with examples drawn from mortgage approval, survivorship, wine quality. Employ basic OLS and random forest modeling for making forecasts. Develop business/scientific intelligence from Machine Learning techniques. Demonstrate real world applications of Artificial Intelligence being deployed to assess mortgage applications.
Develop basic data analysis in Excel and VBA. We demonstrate how to implement OLS modeling in Excel. We demonstrate how to estimate value of Employee Stock Options - a non trivial exercise for startups. Also we provide some training from scratch on how to automate the estimation of mortgage repayments using VBA. (If Excel is not your thing - no problem use Javascript in Googlesheets).
Perform Data Analysis in R and Python. Develop Exploratory Data Analysis and pre-modelling using R tidyverse and Python Pandas libraries. We develop a series of tutorials to explain some of the powerful data transformation and manipulation features of Pandas. These are excellent for preparing professional style reports.
Tidyverse R and base R are incredibly powerful and widely used to execute and communicate forecasts, statistical analysis and modelling. The Tidyverse R suite assembles some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble for visualization and data query. The Pandas, Matplotlib and Seaborn packages available in Python similarly provides a full complement of data query and visualization tools. These cutting edge packages can be transformative in promoting collaboration and disseminating ideas through data intelligence. In particular, the Tidyverse umbrella package from R can be used to tease out many key areas of data analytics. Tidyverse R can also be installed in Google Colab.
We introduce statistical modelling very gently here by making use of Excel, R and Python. We demonstrate how to estimate basic linear relationships, simple model parameters and introduce how to estimate model error using: Sum of Squared Residuals, Total Sum Squares and Explained Sum of Squares. R will be used also to introduce newer forms of statistical modelling: Machine Learning. These tools are available seamlessly in R and Python. We deploy sklearn libraries in Google Colab python notebooks to model and predict house prices in a training and testing framework. We exploit the Kaggle platform - a free to use resource to access both code and data. In particular, the Titanic Kaggle Dataset is presented as a sort of handy "proof of concept" for those new to Machine Learning. We develop an Analytical model for predicting survival on board the ill-fated Titanic. We also demonstrate Machine Learning and AI by training the HMDA dataset for mortgage origination and vetting. Some examples, techniques and code elaborated by Hal Varian (chief Economist of Google) for Machine Learning are introduced here and explained in detail. See link to the Journal of Economic Perspectives with the relevant journal article. Train a Machine Learning algorithm to determine which mortgage applicants would be successful or not. Evaluate varying Machine Learning models using confusion matrices. Predict Wine Quality and Prices using standard regression techniques and random Forest.
R has grown into fully fledged data science programming language replete with a very active community (see: https://www.kaggle.com & http://www.daveondata.com/ ) and web resources fully primed to go. We follow Professor Orley Ashenfelter of Princeton and investigate wine quality. We also leverage content and study materials hosted by MITOPENCOURSEWARE and disseminated freely to users. Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer have published an online text second to none for modeling techniques which we draw on extensively here.
Python code can be intuitively executed in Anaconda and Google Colab and from many other platforms. Significant Python resources and implementation are available from the Python Data Science Handbook This Google Colab makes use of key Python libraries: NumPy, Pandas, Matplotlib and Scikit-learn — The latter is one of the most popular libraries for machine learning.
// The code that is provided here is free software; you can redistribute it and/or
// modify it under the terms of the GNU General Public License
// as published by the Free Software Foundation.
// These snippets of code here are distributed in the hope that they will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.