1. I’m a Data Scientists, why do I need this?
- Data Engineering defined
2. The survival kit
- Supply #1: Some basic concepts
- Different types of workloads
- Data Infrastructure
- Data Pipelines
- Data Warehouse vs Data Lake
- Supply #2: Storage Formats
- Row-Oriented vs Column-Oriented
- Not everything is a CSV
- Supply #3: ETLs
- Toward programmatic ETLs
- SQL-based ETL v.s. Code-based ETL
- Supply #4: Big Data Frameworks
- Apache Hive
- Presto
- Apach Spark
- Supply #5: Data Modeling
- Dimensional Modeling
- Modern Data Warehousing
- Supply #6: Workflow Orchestration
- Apache Airflow
- Supply #7: ML Technical Debt .
- Identifying Code smells
- Bibliografia:
- Designing Data-Intensive Applications (Martin Kleppmann)
- The Datawarehouse Toolkit 3rd Edition (Ralph Kimball, Margy Ross)
- Hadoop Application Architectures (Mark Grover et al)
- Big Data: Principles and best practices of scalable realtime data systems (Nathan Marz and James Warren)
- Cantidad de horas: 8 presenciales + 4 virtuales
- Método de evaluación: Take home.