Blog

Expected Value Framework

Decisions based on expected monetary value

From Databases to Data Lake

When people start working with modern data platforms, tools like AWS S3, Athena, Glue, and Redshift, they quickly encounter a large set of concepts: databases, schemas, ETL pipelines, data lakes, workflows, and catalogs. It can feel overwhelming and sometimes arbitrary. A natural question arises: Where do these ideas actually come from? Are they rooted in computer science theory, or are they simply conventions that emerged from industry practice? Read more!

The answer is that they come from both. Computer science provides the theoretical foundations: data modeling, relational databases, query languages, and distributed systems. Industry practice then builds on these ideas to handle the realities of modern data: large volumes, cloud infrastructure, and operational complexity. What you see today in cloud platforms like AWS is essentially a modern implementation of concepts that have existed for decades, adapted to operate at massive scale.

The starting point historically is the database system. Traditional applications stored their data in relational databases such as Oracle, PostgreSQL, or MySQL. In these systems, data is organized into tables defined by a schema. The schema describes the structure of the data (column names, data types, and relationships between tables). SQL is then used to query and manipulate this structured data. For example, a sports organization might store athlete data in a table called athletes with columns such as athlete_id, name, height, and country. In this architecture, the data lives directly inside the database engine, and the schema is strictly enforced.

As organizations began collecting much larger amounts of data, e.g., website logs, sensor data, user activity, and transactions. Traditional databases became difficult and expensive to scale for analytical workloads. This led to the concept of the data warehouse, a system specifically designed for analytics rather than operational transactions. In a warehouse, data from many different operational systems is combined and organized to support reporting and business analysis. For example, a retail company might collect data from its point-of-sale systems, website analytics, and inventory systems into a central warehouse where analysts can study purchasing patterns. Modern cloud warehouses such as Amazon Redshift, Snowflake, and BigQuery evolved from this idea.

To move data into a warehouse, organizations developed ETL pipelines, which stand for Extract, Transform, Load. ETL describes the workflow of taking data from source systems, transforming it into a consistent and usable structure, and loading it into an analytical system. For instance, an airline might extract flight booking data from several operational databases, convert timestamps into a standardized format, normalize currencies, and load the cleaned dataset into the warehouse. Because these processes often involve many steps and dependencies, they are organized as data workflows or pipelines. Tools like Apache Airflow or AWS Glue orchestrate these workflows.

In recent years, another architectural concept has gained prominence: the data lake. Instead of immediately transforming all data into structured tables, organizations store raw data in large-scale object storage systems such as Amazon S3. The data may remain in its original format, e.g., CSV, JSON, logs, images, or Parquet files. Instead of enforcing a schema when the data is written, the schema is applied later when the data is queried. This approach is called schema-on-read. Systems such as Amazon Athena allow analysts to query files stored in S3 by defining a schema that tells the query engine how to interpret the data.

As data platforms grew more complex, another need emerged: a way to centrally manage information about datasets themselves. This led to the concept of a metadata catalog. A catalog stores information such as dataset schemas, storage locations, formats, and ownership. The AWS Glue Data Catalog is an example of such a system. Rather than defining tables separately in every tool, the catalog acts as a shared registry of datasets. Once a dataset is registered there, multiple systems, e.g., Athena, Spark jobs, or even Redshift through Redshift Spectrum, can access it consistently.

When you combine these pieces, you get the architecture used in many modern data platforms. Data arrives from operational systems, devices, or applications. It is stored in scalable storage such as S3. ETL workflows process and transform the raw data. The resulting datasets are registered in a metadata catalog. Query engines and analytics tools then read from this catalog to perform analysis or power machine learning models. For example, an energy company might collect telemetry from battery systems, store the raw measurements in a data lake, process them with ETL jobs, register the datasets in a catalog, and analyze them with SQL queries or predictive models.

Seen from this perspective, these concepts are not isolated tools but parts of a broader system. Databases and schemas define how data is structured. ETL pipelines and workflows move and transform data. Data lakes and warehouses provide storage and analytics environments. Metadata catalogs allow multiple tools to access the same datasets consistently. Together, these ideas form the backbone of modern data platforms. An evolution shaped by both computer science research and decades of engineering practice in industry.

Page updated

Google Sites

Report abuse