Introduction to Big Data Processing (TIETA17)

Posted on

About the 2018 edition of the course

The next course will start in October 2018 (period II).

These are general information pages about the course. The active course pages of the 2018 course will be hosted on the WETO-server (note: the pages will become available once the course begins). You may login to WETO using your basic university account.

General course information

TIETA17 Introduction to Big Data Processing is a 5 ECTS course that aims to provide the participating students with:

  • A high-level general introduction to the concepts of “Big Data” and “data analysis”.
  • Practical experience with some commonly used tools and techniques for (big) data processing.
    • Relevant Python libraries/tools for data processing, analysis and visualization (especially from the SciPy-ecosystem, such as numpy, pandas, matplotlib, scikit-learn, etc.).
    • Two commonly used distributed computing frameworks: Apache Hadoop and Apache Spark.
    • Two commonly used distributed data storage systems: Hadoop Distributed File System (HDFS) and Apache Hbase distributed database. If time permits, also MongoDB will be introduced.

Recommended prior knowledge

The course will require satisfactory programming skills in Python. If you have no prior Python programming experience, it is highly recommended that you first take the course TIETA19 Practical programming in Python (held in period I).

Additional recommended, although not strictly required, background courses:

Course structure

The main elements of the course are (1) weekly lectures, (2) self-study and (3) exercise work. These are described below.

  1. Lectures
    • The lectures will provide a high-level overview on the covered topics, some practical examples and demos as well as general information about the course itself.
  2. Self-study
    • A lot of the finer details of Python libraries, Apache Hadoop and Apache Spark will be studied in an independent manner.
  3. Exercise work
    • The students will do practical (programming) exercises using Python (including MapReduce and Spark programs).
    • The exercises are peer-reviewed: students must review solutions submitted by other students (in an anonymous manner).
    • Exercises will also be discussed/reviewed during the lectures; there are no dedicated “exercise sessions”.

Passing the course

Course participants need to complete most of the practical exercises and pass the final exam. More details about the passing and grading criteria will be described during the course.