CloseHelpPrint
Kies de Nederlandse taal
Course module: INFOMDIS
INFOMDIS
Data intensive systems
Course info
Course codeINFOMDIS
EC7.5
Course goals
In this course, the students will learn how to leverage Big Data frameworks, configure them, know what is needed in order to use them, and be clear on the benefits to expect from them.
The knowledge acquired is done at two levels.
The first is the processing (by introducing new programming and data processing approaches), and the second is the storage and querying (by presenting new systems designed for such data).
The students will also learn how to use these technologies in the data preparation tasks, i.e., integration, cleaning, exploration, and querying.
Furthermore, the students will learn how to manage some special forms of data, and in particular, streams and graphs.
At the end of the course, the students will be able to face real world challenges by having the ability to identify the right solutions in real life situations involving Big Data, make the right choices in putting in place, configuring, and using big data systems, and perform the required maintenance and optimization tasks.

Assessment
  • exam (40% of the final grade)
  • project (60% of the final grade)
At the end of the course, there will be a written exam in which the students are asked to answer some questions that illustrate they have understood the fundamental concepts of the presented technologies.
The project is self-contained and performed in groups, where the group members are called to develop a solution to a specific real-life problem by using some of the tools presented in the course, and then produce a report.


To qualify for a repair of the final result the mark needs to be at least a 4.
Content

Nowadays, we are producing data at rates that we have never seen before, creating datasets characterized by extreme volume, variety and velocity.
Unfortunately, traditional data management technologies have been proven limited in managing data with these characteristics.
This led to the term Big Data, as a way to refer to this kind of data, and the new technologies that have been developed to cope with such datasets.
This course is an introduction to Big Data management technologies.
It aims at providing an understanding of the fundamental principles upon which the Big Data systems have been built, and a good knowledge of the generic features that each such system is having.
The course is also covering the use of such tools in data preparation, i.e., all these tasks that data practitioners need to do before they have the data ready for the analytics.

Some of the topics that are touched in the course, include, but are not limited, to:

  • advanced SQL and data consistency
  • Big Data systems (map reduce, HDFS, Spark)
  • heterogeneous data Integration (mappings, data cleaning)
  • data imputation
  • NoSQL databases (graph databases, column stores)
  • stream processing
  • Pig Latin
  • graph analytics at large scale.

The course is fundamental for the modern data science students since it provides them with required knowledge on the tools that are available for achieving their goals.

Course form
In-class lectures. 
Attending lectures may not be mandatory, yet, students are responsible for all announcements and course material discussed in the class, thus, class participation is expected and encouraged.
The lectures consist of presentation of some theories on which Big Data technologies are based, and presentation of specific systems and technologies. 

Literature
The course will follow different chapters from books on different tools. An indicative list is:
  • Ian Robinson, ‎Jim Webber, ‎Emil Eifrem, "Graph databases".
  • Eric Redmond , Jim R. Wilson, "Seven Databases in Seven Weeks"
  • Jure Leskovec, Anand Rajaraman, and Jeff Ullman, "Mining Massive Datasets"
  • Holden Karau, Andy Konwinski, et al, "Learning Spark"
  • Martin Kleppmann, "Designing Data Intensive Applications"
CloseHelpPrint
Kies de Nederlandse taal