Data Warehouse vs Data Lake: Which is the Best Data Platform for Digital Channels?

Learn about the difference between lakes and warehouses.

by Tal Doron · Mar. 22, 23 · GigaSpaces Technologies · Opinion · Original Source

The world has been moving steadily and surely toward a ubiquitous digital reality for many years. The normalization of instantly available content and personalized data has sharply driven competition among organizations leading to an explosion in digital services. This has severely stretched organizations’ ability to deliver the ‘always fresh – always on’ data that modern digital applications need. There are differences of opinion among IT, data and application professionals on how best to overcome this challenge. Some believe that a data warehouse, or data lake can offer a solution. Some will assess data warehouses vs. data lakes, while others still may diverge from both data lakes and data warehouses and consider the merits of data warehouses vs data hubs.

Most IT professionals do agree however that enterprises, more than ever, require modernization of their backend and middleware architecture to improve performance for the digital age, facilitate lower TCO of their infrastructure, and optimize the data supply and data consumption food chain.

In my recent dialogues with IT and business executives, some of the key challenges they raise derive from a gap between the growing appetite for digital applications, and the pace at which data is can be made available and served to business applications. These professionals recognize that a new approach is needed but frequently are challenged with finding a solution that meets their needs. Hence, they fall back to familiar solution buckets such as data warehouses and data lakes, data stores and the like.

The natural tendency to seek solutions that fall into familiar categories is understandable, but nevertheless may limit an organization’s options when seeking to solve new problems. Going beyond the familiar requires organizations to shift their focus from IT operations to delivering positive customer experiences. As part of this shift, organizations face numerous questions and challenges such as:

How to create consistency across all channels, brands and devices?
How to contextualize digital services based on real-time circumstances, location and indirect referential data?
How to serve data to services in a proper fashion and a timely manner to meet an individual customer’s needs and expectations?
How to deliver optimum personalized digital experiences?

To understand the technical gap that organizations must overcome in tackling these challenges, we’ll break down the components that are part of this ecosystem – and then rebuild it, better.

Data lake vs Data warehouse vs Data Hub: Which best fits your organization’s transactional data needs?

When assessing the most appropriate solution to meet the data needs of modern digital applications, IT and application integration teams should first and foremost ask themselves which use case – or business outcome – they seek to address. Digital services that rely on real-time data have specific needs that may not be necessarily be served by an organization’s existing technology or data stack. Many public sources, such as technical blogs, compare the pros and cons of relational databases, NoSQLs, data warehouses (DWH) and data lakes. This wide range of data stores and database technologies inadvertently causes confusion in the industry about which should be used to do what. Ultimately its not a question of data warehouse vs data lake, rather whether these solutions address the use case the organization needs to address.

As a general rule, before jumping into the details of each solution type, it is best to differentiate between solutions that are designed to address analytical use cases from those designed to meet the real-time low latency needs of transactional use cases.

Organizations also need to figure out what portion of their data is operational, to avoid turning the data warehouse platforms into something they are not. By focusing on the purpose for which the technological solution was designed, we can address each component in the proper context of the enterprise architecture and optimize utilization and costs.

When considering the leading solutions as part of modernizing your enterprise architecture, the following factors should be taken into account:

Continuous data integration
Data consumption and exposure
SQL interfaces
Data compression
Multiple native stacks vs. a fully integrated solution
Supported data formats
How each data store solution updates data

Rebuilding with a futuristic vision

Using a Digital Hub for real-time, low-latency data delivery from backend systems to digital apps, business services

Here’s something that won’t come as a shock to you: building software architecture is complex. Architects need to sync multiple data sources, multiple data types and pipelines, and the transformations that run between these sources.

One well-established notion is that data lakes and data warehouses fall short with Event-Driven Architecture, as they are unable to serve APIs quickly, with high concurrency.

First, the ingress – moving data to data lakes and warehouses is an offline or batch process, which almost always results in a built-in delay and high latency if the data is served from them.

Second, the egress – most solutions utilize SQL and REST APIs above the data lake – is simply not fast enough to meet the latency demand of business applications.

To cope with these shortcomings, application developers started building small databases adjacent to business applications, often referred to as “data marts” or “local cache”, which also lead to high overall latency and data duplication across the different marts. This architecture pattern causes excessive data duplication and inefficiencies. Even worse, it often compromises data integrity between channels or applications. A common challenge with this pattern can be demonstrated by executing a basic “get my account information” query and receiving different results on the mobile app than on the internet website – a true story that happened to me with a local credit card company.

A Digital Integration Hub (DIH) platform is built on a data hub concept. It eliminates this workaround and related issues by decoupling business applications and backend SoRs with event-based or batch replication patterns. The organizations’ operational data is reflected in the consolidated fabric that powers real-time access by using advanced microservices, exposing relevant APIs and by doing that – accelerating the API serving.

Is data integration complicated and where do CDC and ETL solutions fit into the mix?

All databases can ingest data from ETL (Extract, Transform and Load) solutions or Change Data Capture (CDC), which can be integrated with common databases and message brokers, so you might ask: what’s the big deal here?

Here’s the thing: the initial integration is not all that complicated. The truly hard work begins after integration, when architects, DBAs and developers have to do all kinds of wrangling to solve common integration challenges in existing systems, with countless production workflows that often have indirect dependability due to modern event-driven and API-based patterns. Before diving into the different challenges, let’s examine the simple data extraction and ingestion pipeline and what we need to handle:

Data conflicts and reconciliation
Multiple CDC streams
Concurrent Initial Load and CDC without any downtime to data access or business services
Schema evolution or adding new/existing tables dynamically to an ongoing CDC without restarting the service
Scaling CDC streams to align with higher ingress/egress
Handle logical data misalignments
Metadata management and “tagging” data to map relationships between data and services
Data freshness validation
Data integrity between the DB and the “System of Engagement” (SOE)
Reflect transactional data from multiple tables in the SOE when pushing to a restreaming service

There might be other post-integration challenges, but most solutions in the market fall into one of the following categories: CDC, ETL, Databases/NoSQLs or Microservices, thus lacking the holistic capabilities to handle the entire data lifecycle between Systems of Record (SoRs) and the business services. Off the shelf digital hub solutions such as Smart DIH, due to its unified, holistic architecture and monitoring capabilities, seamlessly unifies and manages the entire data lifecycle.

Can Data Lakes and Data Warehouses handle Data Synchronization?

Data Lakes and Data Warehouses are not optimized to meet transactional and operational workloads. The following table gives an overview of how data hubs, data warehouses and data lakes compare in the ways they handle data:

Consider a Digital Data Hub for numerous applications

Organizations face a growing need to scale up their digital services rapidly. This strong digital appetite comes with growing pains in performance, cost, and manageability as the thriving number of applications outgrow a certain comfort threshold.

Leveraging a converged, distributed real-time data hub solution with an embedded lightweight Java application server provides unprecedented performance and scale that can’t be achieved when using different solutions that are manually stitched together. The benefits include maintaining data integrity via a combination of collections and normalized relational data, together with the ability to perform certain operations, such as “joins” across data in different formats.

Effective data management is the premise for delivering strategic business value from digital services. This requires having domain-oriented decentralized data ownership, combined with microservices-driven architecture to access enterprise shared data. This consolidated architecture provides a more flexible and easier scale for parallel reuse of functionality and data.

Classic microservices architecture using collections per service: Data is duplicated between collections.

Multichannel integrity is achieved by reusing the same “data access services” from a single source of truth, as depicted here:

The GigaSpaces Data Hub: Unified multi-model data store pattern

What role does Embedded Event-Driven Architecture fill?

Many organizations have adopted the Event-Driven Architecture (EDA) methodologies and design principles as part of their data management strategy (more on this in Kai Waehner’s excellent blog). Companies such as Uber and Netflix are textbook examples of using EDA effectively. But here’s one major caveat: these are technology workshops that happen to be streaming movies or orchestrating commutes, and having their entire budget built around these specific operations – a luxury most organizations don’t have.

To achieve a simpler architecture that also provides a lower latency real-time response, embedded-EDA (eEDA) is built by embracing the architecture for embedding events, message queues and notifications as part of the extreme low-latency performance of in-memory workflow. This design, as opposed to traditional SOA which involves heavy multi-process communication and data transfer, is a real-time fabric based on the “Spaces” principles.

To enhance the utilization of events, GigaSpaces created an architecture with the following unique characteristics:

Embedded Event Triggers
Embedded Event Management Engine
Embedded Event Priority Based Queues
Embedded Event Priority Based Clusters (grouping)
Embedded Outbound Messaging System (pub/sub notification pattern)

Event processing is improved immensely with co-location by injecting business logic to run in the same memory space as the data on the data fabric. The technological benefits include:

Durable notifications via fully durable pub/sub messaging for data consistency and reliability
FIFO Groups ensure in-order and exclusive processing of events
No need to transfer events from the data tier to the service tier
Related data can be co-located to the same group while parallelizing across additional groups

With reduced latency for business applications, the IT team can easily add contextual information to the queries while enhancing the overall volume of customer interactions.

Event-Driven Architecture for inbound and outbound

The reduced total cost of ownership

All architects know a simple truth: a design isn’t viable if its costs are unacceptable. Let’s keep this notion in mind when examining the trend of shifting to cloud computing in order to reduce costs.

Cloud has endless advantages, however, when used irresponsibly it can backfire without compassion. The following quote from the Firebolts blog captures this irony: “If you look at the Fivetran benchmark, which managed 1TB of data, most of the clusters cost $16 per hour. That was the enterprise pricing for Snowflake ($2 per credit). Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price.”

Thinking about operational data, we often require tens or even hundreds of TB of data, resulting in an overpriced architecture just for the data tier – before accounting for other middleware components such as CDC, ETL, Cache and others.

With the GigaSpaces digital integration hub solution, a unified and performance-optimized technology creates efficiency at scale. The platform reduces the need to replicate and mobilize data while simplifying data management. It substitutes costly standalone elements, driving direct and indirect cost savings by optimizing data management, reducing overall footprint, reducing usage and dependency on existing costly elements, and reducing operational load and maintenance costs.

GigaSpaces customers testify to a reduction in operational costs of 40-75%. This reduction of software and maintenance costs may vary based on elements being replaced or optimized with the introduction of GigaSpaces into the solution architecture stack. Here’s one example: A fully digital bank operating in Sweden has made an entire stack of commercial RDBMS licenses redundant after two years of using the GigaSpaces solution, eventually substituting with a GigaSpaces stand-alone DIH as the bank’s Operational Data Store.

With the GigaSpaces solution in place, enterprises can also substitute some standalone data replication solutions that extract data to a single ODS, eliminating the need for additional costly expenditures. In addition, they will also no longer be required to add additional caching solutions, such as Redis, on top of the ODS.

Additional benefits include allowing software engineers to focus on developing new business logic instead of spending time on data-related and integration challenges, resulting in shorter time-to-service, from months to days, and reduced costs associated with human error.

Lastly, ongoing maintenance and support costs are reduced, as well as the expertise required per workflow. This is achieved by the standardization of data pipelines and data microservices, through no code and low code options provided with the GigaSpaces solution.

The blue line indicates a lower operational cost over time when using GigaSpaces Smart DIH versus a “DIY Solution” leveraging multiple products

Putting it all together – the full DIH package

After careful examination of the different technologies required to build a robust and price-effective solution, GigaSpaces built the solution architecture for the modern operational data store in the form of a Digital Integration Hub.

The DIH enables organizations to focus on converging business and technology, reducing the stack complexity, and providing fast response time for new and upgraded digital services while reducing overall costs.

By simply upgrading a database, or adding a newer middleware component, organizations tend to improve performance in the short term, but the additional costs and overall complexity don’t provide the required ROI.

We can keep diving deep into the IT Gap and closely examine the specs of different data stores, but ironically enough the biggest challenges organizations face in digital transformation are not technological in nature. Rather, they revolve around changing the thought paradigm of managers signing off on these changes. More on this – in my next blog posts.

Stay tuned.

Page updated

Google Sites

Report abuse