BIG DATA
CITATIONS IN THE BUTTONS BELOW
CITATIONS IN THE BUTTONS BELOW
Once the data you identified is gathered and imported, your next step is to make it analysis-ready. This is where the process of Data Wrangling, or Data Munging, comes in. Data Wrangling is an iterative process that involves data exploration, transformation, and validation.
Transformation of raw data includes the tasks you undertake to:
Structurally manipulate and combine the data using Joins and Unions.
Normalize data, that is, clean the database of unused and redundant data.
Denormalize data, that is, combine data from multiple tables into a single table so that it can be queried faster.
Clean data, which involves profiling data to uncover quality issues, visualizing data to spot outliers, and fixing issues such as missing values, duplicate data, irrelevant data, inconsistent formats, syntax errors, and outliers.
Enrich data, which involves considering additional data points that could add value to the existing data set and lead to a more meaningful analysis.
A variety of software and tools are available for the Data Wrangling process. Some of the popularly used ones include Excel Power Query, Spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, Trifacta Wrangler, Python, and R, each with their own set of characteristics, strengths, limitations, and applications.
IBM DATA ANALYTICS COURSE - COURSERA - 2021 COPYRIGHTS
A Data Repository is a general term that refers to data that has been collected, organized, and isolated so that it can be used for reporting, analytics, and also for archival purposes.
The different types of Data Repositories include:
Databases, which can be relational or non-relational, each following a set of organizational principles, the types of data they can store, and the tools that can be used to query, organize, and retrieve data.
Data Warehouses, that consolidate incoming data into one comprehensive storehouse.
Data Marts, that are essentially sub-sections of a data warehouse, built to isolate data for a particular business function or use case.
Data Lakes, that serve as storage repositories for large amounts of structured, semi-structured, and unstructured data in their native format.
Big Data Stores, that provide distributed computational and storage infrastructure to store, scale, and process very large data sets.
ETL, or Extract Transform and Load, Process is an automated process that converts raw data into analysis-ready data by:
Extracting data from source locations.
Transforming raw data by cleaning, enriching, standardizing, and validating it.
Loading the processed data into a destination system or data repository.
Data Pipeline, sometimes used interchangeably with ETL, encompasses the entire journey of moving data from the source to a destination data lake or application, using the ETL process.
Big Data refers to the vast amounts of data that is being produced each moment of every day, by people, tools, and machines. The sheer velocity, volume, and variety of data challenge the tools and systems used for conventional data. These challenges led to the emergence of processing tools and platforms designed specifically for Big Data, such as Apache Hadoop, Apache Hive, and Apache Spark.
IBM DATA ANALYTICS COURSE - COURSERA - 2021 COPYRIGHT
In this digital world, everyone leaves a trace. From our travel habits to our workouts and entertainment, the increasing number of internet connected devices that we interact with on a daily basis record vast amounts of data about us there's even a name for it Big Data. Ernst and Young offers the following definition: big data refers to the dynamic, large, and disparate volumes of data being created by people, tools, and machines. It requires new, innovative and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to drive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value. There is no one definition of big data but there are certain elements that are common across the different definitions, such as velocity, volume, variety, veracity, and value. These are the V's of big data Velocity is the speed at which data accumulates.
Data is being generated extremely fast in a process that never stops. Near or real-time streaming, local, and cloud-based technologies can process information very quickly. Volume is the scale of the data or the increase in the amount of data stored. Drivers of volume are the increase in data sources, higher resolution sensors, and scalable infrastructure. Variety is the diversity of the data. Structured data fits neatly into rows and columns in relational databases, while unstructured data is not organized in a predefined way like tweets, blog posts, pictures, numbers, and video. Variety also reflects that data comes from different sources; machines, people, and processes, both internal and external to organizations.
Drivers are mobile technologies social media, wearable technologies, geo technologies video, and many, many more. Veracity is the quality and origin of data and its conformity to facts and accuracy. Attributes include consistency, completeness, integrity, and ambiguity. Drivers include cost and the need for traceability. With the large amount of data available, the debate rages on about the accuracy of data in the digital age. Is the information real or is it false? Value is our ability and need to turn data into value. Value isn't just profit. It may have medical or social benefits, as well as customer, employee or personal satisfaction.
The main reason that people invest time to understand big data is to derive value from it. Let's look at some examples of the V's in action. Velocity. Every 60 seconds, hours of footage are uploaded to YouTube, which is generating data. Think about how quickly data accumulates over hours, days, and years. Volume.
The world population is approximately 7 billion people and the vast majority are now using digital devices. Mobile phones, desktop and laptop computers, wearable devices, and so on. These devices all generate, capture, and store data approximately 2.5 quintillion bytes every day. That's the equivalent of 10 million blu-ray DVDs. Variety. Let's think about the different types of data. Text, pictures, film, sound, health data from wearable devices, and many different types of data from devices connected to the internet of things. Veracity. Eighty percent of data is considered to be unstructured and we must devise ways to produce reliable and accurate insights.
The data must be categorized, analyzed, and visualized. Data scientists, today, derive insights from big data and cope with the challenges that these massive data sets present. The scale of the data being collected means that it's not feasible to use conventional data analysis tools, however, alternative tools that leverage distributed computing power can overcome this problem.
Tools such as Apache Spark, Hadoop, and its ecosystem provides ways to extract, load, analyze, and process the data across distributed compute resources, providing new insights and knowledge. This gives organizations more ways to connect with their customers and enrich the services they offer. So next time you strap on your smartwatch, unlock your smartphone, or track your workout, remember your data is starting a journey that might take it all the way around the world, through big data analysis and back to you.
IBM DATA ANALYTICS COURSE - COURSERA - 2021 COPYRIGHT
CONFIDENTIALITY / COPYRIGHT NOTICE: This site and my information and any attachments contains the PRIVILEGED AND CONFIDENTIAL INFORMATION of Fred Finkelstein & Irondesigner DBA / LLC mutually, Inc., its affiliated corporations or legal entities, and is intended only for the use of the individual(s) named above. If you are not the intended recipient of this e-mail or invited to this site, or the employee or agent responsible for delivering this to the intended recipient, you are hereby notified that any unlawful interception, dissemination, disclosure, printing or copying of this e-mail or site or any attachments is strictly prohibited under the Electronics Communication Privacy Act (ECPA), 18 USCA 2510, 18 USCA 2511, and any applicable laws. If you have received this e-mail or viewing this site in error, please delete or destroy it & close down, including all attachments or copies, and immediately notify us by e-mail at mechanicalengrg@gmail.com