Introduction to Big Data Processing (TIETA17)

Posted on

About the 2019 edition of the course

The next course will start in October 2019 (period II).

Latest information

  • The first lecture will be on Monday 21.10.2019 at 15:15 – 16:45 in lecture room Pinni B1100.
  • We plan to provide video streams and recordings of all lectures. A link to the videos is provided on the course WETO page that has now been opened.

These are general information pages about the course. The active course pages of the 2019 course are hosted on the WETO-server. You may login to WETO using your basic university account.

General course information

TIETA17 Introduction to Big Data Processing is a 5 ECTS course that aims to provide the participating students with:

  • A high-level general introduction to the concepts of “Big Data” and “data analysis”.
  • Practical experience with some commonly used tools and techniques for (big) data processing.
    • Relevant Python libraries/tools for data processing, analysis and visualization (especially from the SciPy-ecosystem, such as numpy, pandas, matplotlib, scikit-learn, etc.).
    • Two commonly used distributed computing frameworks: Apache Hadoop and Apache Spark.
    • A commonly used distributed data storage system: Hadoop Distributed File System (HDFS). If time permits, also Apache Hbase distributed database or MongoDB will be introduced.

Recommended prior knowledge

The course absolutely requires satisfactory programming skills in Python. It is advisable to complete e.g. the course TIETA19 Practical Programming in Python before this course.

Additional recommended, although not strictly required, background courses:

Course structure

The main elements of the course are (1) weekly lectures, (2) self-study and (3) exercise work. These are described below.

  1. Lectures
    • The lectures will provide a high-level overview on the covered topics, some practical examples and demos as well as general information about the course itself.
  2. Self-study
    • A lot of the finer details of Python libraries, Apache Hadoop and Apache Spark will be studied in an independent manner.
  3. Exercise work
    • The students will do practical (programming) exercises using Python (including MapReduce and Spark programs).
    • The exercises are peer-reviewed: students must review solutions submitted by other students (in an anonymous manner).
    • Exercises will also be discussed/reviewed during the lectures; there are no dedicated “exercise sessions”.

Passing the course

Course participants need to complete most of the practical exercises and pass the final exam. More details about the passing and grading criteria will be described during the course.