Overivew

Last few years have seen significant advances in the technologies for data collection, data transmission and storage of data.  For the first time in the history, thanks to these technologies, we are at a position where we can instrument a scientific phenomenon, or a business workflow, at a very fine granularity and the collect the required data for the relevant time period. This simple approach can be used for diverse phenomenon – evolving traffic patterns in a city, or how social media plays a part in enhancing/diminishing a brand, or how a protein will fold and impact their function. Think of it as the technology has given us a ringside seat to observe the phenomenon in an unprecedented detail.


Though we might have earned ringside seat, we still have to do some work to get a ringside view on the phenomenon. The data that is collected and stored is far from usable, and in its raw form will not give us the understanding we are looking for. There is a significant analysis that needs to be done on the data before we can learn from that data and benefit from that learning. This task of analyzing this large amount of data to extract insights is the basis the field of data science. A new set of technologies have mushroomed to support the use cases demanded by data science, these are the big data technologies.

This course is offered jointly by the LAB group, Chalmers University and Persistent  under the.Swedish e-Science Education graduate school (SeSE)

Important Details:

    Dates:  Mon 25th May 2015 to Fri 29th May 2015
    Location:  Chalmers Technology University, 
    Timing:  10:00 am to 5:00 pm



Preparatory Week:

  • Read part 3 on infrastructures in Fourth Paradigm and any chapter there or in 2020 Science closest to your application domain of interest.


Contents:

  • Data Science Methodology: You will learn an approach for conducting your data science experiment. This will include the high level steps that you need to follow for conducting and validating your experiment, you will also learn of the common pitfalls and gotchas.

  • Tools & Technologies: You will get an exposure to the different tools & technologies that should be used for the different steps of the data science experiment. Starting from data collection all the way to data visualization. There will be assignments around Hadoop & Spark.

  • Case Studies: You will get a peek inside some of the successful big data projects in scientific and business domains, where some of the methodology and techniques have been applied.

  • Labs and Hands-on Experience on Big Data Technologies: You will get hands on experience on working on a subset of the big data technologies in a lab setting and you will leave the course with a working big data environment that you could use for your own data science experiments.

Outline:

The course will progressively cover the steps involved in a typical data science experiment. Starting from data ingestion to data visualization. Each day of the course will be divided in 3 parts: 
  • An in-class lecture where a particular data science concept will be covered.

  • A hands on lab session around the concept that was covered in the lecture. The lab session will be conducted under the guidance of the lab assistants. The labs sessions will vary in technical difficulty, for candidates who are comfortable with programming the lab sessions will focus on hand-on assignments, for others the labs will be around setting up a data science experiment and evaluating outcomes.

  • A group discussion or a case study relevant to the concept that was covered in the class.

Tentative Agenda



 Mon, 25th May Tue, 26th MayWed, 27th May Thu, 28th May Fri, 29th May 
 Introduction to Big Data & Analytics, 
Problem Scope. End to end implementation (ShareInsights)
 Data Preparation, Data Quality and Data Integration Data Analysis, Advanced Analytics.Data Visualization  Wrap up and the applying Big Data Technologies
     Lunch BreakLunch BreakLunch BreakLunch BreakLunch Break
Big Data Methodology.
Example from scientific domain 
Labs on relevant technologies Labs on relevant technologies Labs on relevant technologies Labs on relevant technologies
Coffee Break Coffee Break Coffee Break Coffee Break Coffee Break 
Group Discussion on Problem  Case Study from the Industry.TBD TBD     TBD

Course Pre-requisites:

  • Data management & processing skills. Knowledge of one of the data process tools - Excel, R, SQL, database.
  • Basic programming knowledge of scripting language.
  • Expectation to bring a data science problem.
  • Good to have: Statistics Knowledge

Comments