CloseHelpPrint
Kies de Nederlandse taal
Course module: INFOMBD
INFOMBD
Big data
Course infoSchedule
Course codeINFOMBD
ECTS Credits7.5
Category / LevelM (M (Master))
Course typeCourse
Language of instructionEnglish
Offered byFaculty of Science; Graduate School of Natural Sciences; Graduate School of Natural Sciences;
Contact personprof. dr. A.P.J.M. Siebes
Telephone+31 30 2533229
E-mailA.P.J.M.Siebes@uu.nl
Lecturers
Lecturer
prof. dr. A.P.J.M. Siebes
Other courses by this lecturer
Course contact
prof. dr. A.P.J.M. Siebes
Other courses by this lecturer
Teaching period
3-GS  (06/02/2023 to 21/04/2023)
Teaching period in which the course begins
3-GS
Time slotD: D (WED-afternoon, Friday)
Study mode
Full-time
Enrolment periodfrom 31/10/2022 up to and including 25/11/2022
Course application processOsiris Student
Enrolling through OSIRISYes
Enrolment open to students taking subsidiary coursesYes
Pre-enrolmentNo
Post-registrationYes
Post-registration openfrom 23/01/2023 up to and including 20/02/2023
Waiting listYes
Course placement processadministratie onderwijsinstituut
Course goals


2021-2022 is the last year this course will be offered.


Assessment
To pass the course you have to:

  • experiment with PAC bounds
  • write a personal (i.e., on your own) essay in which you convince the reader that you have mastered the course material
Content
 

Big Data is as much a buzz word as an apt description of a real problem: the amount of data generated per day is growing faster than our processing abilities. Hence the need for algorithms and data structures which allow us, e.g., to store, retrieve and analyze vast amounts of widely varied data that streams in at high velocity.

In this course we will limit ourselves to data mining aspects of the Big Data problem, more specifically to the problem of classification in a Big Data setting. To make algorithms viable for huge amounts of data they should have low complexity, in fact it is easy to think of scenarios where only sublinear algorithms are practical. That is, algorithms that see only a (vanishingly small) part of the data: algorithms that only sample the data.

We start by studying PAC learning, where we study tight bounds to learn (simple) concepts almost always almost correctly from a sample of the data; both in the clean (no noise) and in the agnostic (allowing noise) case. The concepts we study may appear to allow only for very simple – hence, often weak – classifiers. However, the boosting theorem shows that they can represent whatever can be represented by strong classifiers.

PAC learning algorithms are based on the assumption that a data set represents only one such concept, which obviously isn’t true for almost any real data set. So, next we turn to frequent pattern mining, geared to mine all concepts from a data set. After introducing basic algorithms to compute frequent patterns, we will look at ways to speed them up by sampling using the theoretical concepts from the PAC learning framework.

Competencies
-
Entry requirements
You must meet the following requirements
  • Assigned study entrance permit for the master
Required materials
-
Instructional formats
Lecture

Seminar

Tests
Final result
Test weight100
Minimum grade-

CloseHelpPrint
Kies de Nederlandse taal