SluitenHelpPrint
Switch to English
Cursus: INFOMBD
INFOMBD
Big data
Cursus informatieRooster
CursuscodeINFOMBD
Studiepunten (ECTS)7,5
Categorie / NiveauM (Master)
CursustypeCursorisch onderwijs
VoertaalEngels
Aangeboden doorFaculteit Betawetenschappen; Graduate School of Natural Sciences; Graduate School of Natural Sciences;
Contactpersoonprof. dr. A.P.J.M. Siebes
Telefoon+31 30 2533229
E-mailA.P.J.M.Siebes@uu.nl
Docenten
Docent
prof. dr. A.P.J.M. Siebes
Overige cursussen docent
Contactpersoon van de cursus
prof. dr. A.P.J.M. Siebes
Overige cursussen docent
Blok
3-GS  (07-02-2022 t/m 22-04-2022)
Aanvangsblok
3-GS
TimeslotD: D (WO-middag, WO-namiddag, Vrijdag)
Onderwijsvorm
Voltijd
Cursusinschrijving geopendvanaf 01-11-2021 t/m 28-11-2021
AanmeldingsprocedureOsiris
Inschrijven via OSIRISJa
Inschrijven voor bijvakkersJa
VoorinschrijvingNee
Na-inschrijvingJa
Na-inschrijving geopendvanaf 24-01-2022 t/m 21-02-2022
WachtlijstJa
Plaatsingsprocedureadministratie onderwijsinstituut
Cursusdoelen


2021-2022 is the last year this course will be offered.

Assessment
To pass the course you have to:

  • experiment with PAC bounds
  • write a personal (i.e., on your own) essay in which you convince the reader that you have mastered the course material
Inhoud

Big Data is as much a buzz word as an apt description of a real problem: the amount of data generated per day is growing faster than our processing abilities. Hence the need for algorithms and data structures which allow us, e.g., to store, retrieve and analyze vast amounts of widely varied data that streams in at high velocity.

In this course we will limit ourselves to data mining aspects of the Big Data problem, more specifically to the problem of classification in a Big Data setting. To make algorithms viable for huge amounts of data they should have low complexity, in fact it is easy to think of scenarios where only sublinear algorithms are practical. That is, algorithms that see only a (vanishingly small) part of the data: algorithms that only sample the data.

We start by studying PAC learning, where we study tight bounds to learn (simple) concepts almost always almost correctly from a sample of the data; both in the clean (no noise) and in the agnostic (allowing noise) case. The concepts we study may appear to allow only for very simple – hence, often weak – classifiers. However, the boosting theorem shows that they can represent whatever can be represented by strong classifiers.

PAC learning algorithms are based on the assumption that a data set represents only one such concept, which obviously isn’t true for almost any real data set. So, next we turn to frequent pattern mining, geared to mine all concepts from a data set. After introducing basic algorithms to compute frequent patterns, we will look at ways to speed them up by sampling using the theoretical concepts from the PAC learning framework.

Course form
Lectures and tutorials.

Literature
The slides completed by your own lecture notes are in principle all you need. Background reading material is, however, also available:

  • For the first part of the course, we largely follow the first 8 chapters of the book "Understanding Machine Learning, From Theory to Algorithms" by Shai Shalev-Shwartz and Shai Ben-David
    • you can legally download the book from a webpage of the first author
    • you can, of course, also buy this book.
    • It is a good book, so if you want to become a data scientist, buying it is a sensible choice
  • For the later parts of the course we will point to the papers that the lectures are based on. You can download these papers (again legally) from anywhere in the UU network/
Competenties
-
Ingangseisen
Je moet voldoen aan de volgende eisen
  • Toelatingsbeschikking voor de master toegekend
Verplicht materiaal
-
Werkvormen
Hoorcollege

Werkcollege

Toetsen
Eindresultaat
Weging100
Minimum cijfer-

SluitenHelpPrint
Switch to English