In this course, the students will learn how to leverage Big Data frameworks, configure them, know what is needed in order to use them, and be clear on the benefits to expect from them.|
The knowledge acquired is done at two levels.
The first is the processing (by introducing new programming and data processing approaches), and the second is the storage and querying (by presenting new systems designed for such data).
The students will also learn how to use these technologies in the data preparation tasks, i.e., integration, cleaning, exploration, and querying.
Furthermore, the students will learn how to manage some special forms of data, and in particular, streams and graphs.
At the end of the course, the students will be able to face real world challenges by having the ability to identify the right solutions in real life situations involving Big Data, make the right choices in putting in place, configuring, and using big data systems, and perform the required maintenance and optimization tasks.
At the end of the course, there will be a written exam in which the students are asked to answer some questions that illustrate they have understood the fundamental concepts of the presented technologies.
Furthermore, there will be a course project. The project is self-contained and performed in groups, where the group members are called to develop a solution to a specific real-life problem by using some of the tools presented in the course, and then produce a report.
The project counts as 60% of the final mark, while the written exam as 40%.
A repair test requires at least a 4 for the original test.
Nowadays, we are producing data at rates that we have never seen before, creating datasets characterized by extreme Volume, Variety and Velocity.
Unfortunately, traditional data management technologies have been proven limited in managing data with these characteristics. This led to the term Big Data, as a way to refer to this kind of data, and the new technologies that have been developed to cope with such datasets.
This course is an introduction to Big Data management technologies. It aims at providing an understanding of the fundamental principles upon which the Big Data systems have been built, and a good knowledge of the generic features that each such system is having.
The course is also covering the use of such tools in data preparation, i.e., all these tasks that data practitioners need to do before they have the data ready for the analytics.
Some of the topics that are touched in the course, include, but are not limited, to: advanced SQL and Data Consistency, Big Data Systems (Map Reduce, HDFS, Spark), Heterogeneous Data Integration (Mappings, Data Cleaning), Data Imputation, NoSQL Databases (Graph Databases, Column Stores), Stream Processing, Pig Latin, Graph Analytics at Large Scale.
The course is fundamental for the modern data science students since it provides them with required knowledge on the tools that are available for achieving their goals.
Attending lectures may not be mandatory, yet, students are responsible for all announcements and course material discussed in the class, thus, class participation is expected and encouraged.
The lectures consist of presentation of some theories on which Big Data technologies are based, and presentation of specific systems and technologies.
The course will follow different chapters from books on different tools.
An indicative list is:
- Graph Databases
- Seven Databases in Seven Weeks
- Mining Massive Datasets
- Learning Spark
- Designing Data Intensive Applications
|You must meet the following requirements|
- Assigned study entrance permit for the master
|Required materials-Instructional formatsTests|