Data Wrangling Tools

Data wrangling is a process in which raw data is cleaned, structured, and enriched to prepare it for analysis. We often need to clean data before we can conduct any kind of meaningful analysis or presentation of the data. It often involves correcting inconsistencies, handling missing values, and transforming data into a more useful format, making it easier to work with and draw insights from.

Think of it as "getting your data in shape" for analysis, much like you would prepare ingredients for a recipe. In any real world data, you are likely to see Duplicates, Missing Values, variable format, and incorrect ordering at minimum.

Data Wrangling may have any of the below steps:

Exploring
Structuring
Cleaning
Enriching
Validating

Data Wrangling Resources

You have lots of options, with none being "right" or "wrong."

Traditional Programming Langauges - Python is very commonly used in industry, with the Pandas library
Statistical Programming Languages - R is perhaps the most common - RStudio is a great tool to let you interrogate your data and see the direct results of your code
Rapid Miner also has some data wrangling resources
Excel (or similar) - This is an option but not recommended. Excel is known to reformat your data, which can lead to mistakes - see famously: https://theconversation.com/excel-autocorrect-errors-still-plague-genetic-research-raising-concerns-over-scientific-rigour-166554#:~:text=Our%20research%20shows%20autocorrect%20errors,gene%20name%20mangled%20by%20autocorrect.

Online Resources and Videos

Towards Data Science - Getting Started With R
Towards Data Science - R Basics
R For Data Science - Online Book
Introduction to Data Science - Chapters 1- 6 (R)
Real Python - Data Cleaning in Python
W3Schools - Pandas Tutorial
Alex the Analyst - Youtube Tutorial
Pandas Tutorial Playlist