We just talked about Datawarehousing, I believe this is a good time when I should also introduce a concept called Bigdata which is also one of the most discussed and implemented Idea in most of the enterprises in today’s world, Note I am calling it an Idea and not some technology or product. To elaborate and make it simple for you Bigdata is nothing but the way you handle/process huge amounts of data in an organization, as we already know in today’s world Data has been increasing at an exponential rate and it has to be managed and controlled for various reasons. This idea of managing/processing huge amounts of data is called "Bigdata", you can easily relate to the Datawarehouse and ETL concepts which we just discussed in the previous section as Datawarehouse could result in gathering and processing huge amounts of data for which you need a Bigdata solution, but the previous discussion only comprised of RDBMS which actually has limitations when it comes to managing huge amounts of data
Before we get into Bigdata and its concepts, Lets first talk generic, looking at some day to day challenges that we face with regards to just data which is Big in Size
· Opening an Excel file in your desktop of around 50 gb
· Send a document of 100 Mb over email
· 50 Gb file transfer via "file transfer" solutions on the network
By giving these examples, I am trying to create a paradigm in your mind that Bigdata is a relative term to the capability of the system that can process huge amounts of data, again it only means that Data can be huge in any terms and you as an organization should be able to handle and process it successfully as per your business needs in the stipulated amount of time.
Before we proceed ahead and further define Bigdata lets create some hunger for learning by setting up some goals which can be achieved using Bigdata Solutions; this will help us raise our curiosity levels and understand and grasp as much as information possible, so here are some Business advantages you get once you employ Bigdata:
Business Insights
In simple words to look deep into the entire data of the organization and figure out ways to make more profit
Data Warehousing
You want to collect all data from all sources in your organization and store it at a central place for reporting, analytics and offloading OLAP load from the production applications, it’s just what we learnt in the previous section in Data-warehousing but this time it won’t be done using the traditional ETLs and databases but with specialized Bigdata tools and Environment.
Fraud detection and prevention
Most familiar use cases are with credit / Debit cards, if someone is making a huge transaction from overseas with your card then that transaction may be kept on hold and probably a customer-service-agent can call you and confirm whether you are on travel to allow the transaction. With Bigdata analytics and machine learning you can have more sophisticated protection against frauds
Customer 360
A dashboard application can be built to provide a 360-degree view of the customer from a single portal, These dashboards may pull data from varied sources, analyze and present it to customer service, sales or marketing or even to the customer themselves
Recommendation Engines
AI & ML tools can utilize Bigdata to analyze and filter user activity and provide recommendations to user beforehand, you can see it in action on some of the daily used apps such as Whatsapp where before you even type the entire sentence you get the suggested word in your native language. All these abilities arise out of Big data and analytics.
Not facilitating such features to your customers may result in losing customers to your competition and may also lose out on Upsell or cross sell opportunities
Velocity
Speed at which data comes in “Something like TPS in transaction processing”, By TPS I mean that every system within the ecosystem should be capable of processing this data at high speed, if you take an example of IMPS, you have lots of data coming at very high speed and from many different systems, if the data is coming at the speed of 500 transactions per second (TPS), then each system from start to end should be capable of processing 500 TPS, even if a single system does not match that speed then you may not be able to achieve that much of performance
Volume
By Volume it means the amount of data being collected every second, every hour or every day, when data is coming into the organization from different channels, it can be internet, it can be intranet, it can be APIs, it can be SFTPs, it can be emails, it can be anything. Once the data is coming in from so many channels you need to have a proper mechanism to accumulate, store, process and derive meaningful Business insights from them.
Variety
Structured and unstructured, structured used in decision making, I would be dwelling on structured data only as it’s easy to explain and understand, so structured data basically means which is easily searchable and which has a defined schema (table with fixed rows and columns) where you can write and read in a systematic manner. Unstructured data is which can't be easily searchable such as videos, Audio, logs, etc. For example, if we have a PIM solution which records all the videos of "Administrators" performing various activities, now if I want to trace a setting which was changed in any of the application server by an admin, I cant just type in the PIM search box saying retrieve the Video where “xyz” setting was changed, that is not possible because its unstructured data
Airbus generating 100 to 500 TB data in one flight.
Camera phones / Smartphones generating a lot of data.
Internet traffic, countless number of servers coming online every second every minute.
Facebook generates 500 plus TB of data daily.
Scientific facilities can generate 40 TB per second.
Sample Data Types
Videos
Audios
Images
Photos
Text messages
Logs
Emails
Transactions
Documents
Repositories
Well, until now you should be able to understand what exactly Bigdata is and what is the purpose of having Bigdata but one question which comes to the mind is why can’t we use a traditional ETL application with a traditional SQL database to manage bigdata, lets try and figure out the answer to this question
Limitations with RDBMS
· Vertical Scaling
· Parallelism
· Performance
· Storing and analyzing unstructured data
I would explain the major limitations and the rest you may understand automatically; Even the best RDBMS in the industry can only scale up to a specific limit and is dependent on the hardware resources, for e.g. If you talk about a database server, if you have the highest level of configuration server then it would be like 60 cores, and 2 TB RAM with shared storage for Cluster with having 2 servers share the load.
When you talk about bigdata we are looking at data which can easily cross more than 100 TB plus, which definitely the above hardware configuration can’t handle and process even if you make a simple Select * Query with a filter then it’s going to take a lot of time on a 100 TB database, plus the other hassles of database management, backups, encryption and security would pose challenges beyond our control.
So how to tackle such a situation where we should be able to process thousands of terabytes of data and still maintain the performance?
Let's take a very simple scenario, a real-life problem faced in RDBMS and how Bigdata solutions can resolve it. Imagine there’s a huge table in RDBMS which is taking a lot of time to parse due to its huge size of 9 million records.
So How does Bigdata or Bigdata tools solve this problem, answer is “Distributed file system” and “Distributed computing”, here I have taken the Hadoop architecture to explain
Any Big data solution would break this huge database file into multiple smaller pieces and distribute on different nodes which are a part of the Bigdata cluster, and each node would perform the processing individually on the subset of files so in the shown graphic instead of parsing 9 million records each server has to process only 3 million records and produce the required output
How Bigdata solves the problem by using “Distributed Computing” & “Distributed File System”
Massive parallel software’s running on multiple commodity computers having their own compute and storage, the pieces are processed individually on each of the nodes and then the results are combined and provided as a result to the front end, so the secret sauce of this technology is "distributed file system" and "distributed computing."
Phew that was a lot to digest, I can understand your condition, but this is how a Bigdata solution works, always remember "Distributed Computing" & "Distributed File System" are the key concepts of any Bigdata solution.
I may not be indulging into any of the Bigdata tools here because if you have understood this section then you can easily understand any Bigdata tools
Hadoop is one of the major Bigdata Suites which is used in the industry and nowadays you have Bigdata tools also available on the Public cloud providers such as Microsoft Azure, AWS & Google