OSIRIS - Onderwijsaanbod INFOMDSS 2020

Sluiten

Help

Cursus: INFOMDSS

INFOMDSS

Data science and society

Cursus informatie

Cursuscode		INFOMDSS
Studiepunten (EC)		7,5

Cursusdoelen

This is the starting and obligatory course for the Business Informatics (MBI) programme as well as the Applied Data Science profile. As such, its primary objective is to inspire and introduce you to the exciting domain of Applied Data Science. At the end of this course, you will be able to:

Understand the role of data science and its societal impact
Recognise the knowledge discovery processes in applied data science
Identify trends and developments in big data technologies
Apply selected big data technologies to solve real-world problems
Analyse unstructured data using natural language processing techniques
Understand the need for self-service data science

The short url for the official course page is: http://bit.ly/dss2020-cs.
The official course schedule overview is available at: http://bit.ly/dss2020-overview.

Grading

The final grade will be determined based on the following course components:
[A] Mid-term exam
[B] End-term exam
[C] Optional bonus (or penalty) for extraordinary (or poor) participation/performance

Grade = [A]*0.50 + [B]*0.50 + [C]

Note that the minimum grade of each of these exams is a 5.0. If for one of the exams your grade is between a 4.0 and a 5.5, you can repair that specific exam during the “second chance” session. Note that it is not possible to repair both exams. You need to have a final grade of 6.0 or higher to PASS the course.

All course materials are examined, including all lecture slides, assignments and weekly readings.

In order to qualify for the Repair Exam, ALL grade components need to be 5.0 or higher, and you also need to have PASSed at least 65% of the assignments.

Inhoud

Applied Data Science

The first course topic that we cover is Applied Data Science (ADS) as positioned in (Braschler et al., 2019) and defined in (Spruit & Lytras, 2018) as “the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts”. Being the core theme of this course, we cover the need for data scientists (e.g. Davenport & Patil, 2012) and relate this novel topic with the well-known domain of knowledge discovery processes (Chapman et al., 2000). We refer to standardised NIST definitions (Pritzker & May, 2015) to properly ground our ADS perspective.

Data Analytics

Data analytics is the multidisciplinary field which aims to make sense of data and observations from everyday life. Its data-driven approach to problem solving includes various methods and techniques. In this theme we focus on discussing why certain approaches work, what common mistakes are made, and so on, using (Lazer et al., 2014; Broniatowski et al., 2014) as a running example. We will also discuss data analytics tasks from both statistical and machine learning perspectives.

Big Data & Cloud Computing

The original course trigger was the inability of researchers to analyse datasets which were simply too big to process on a laptop. On the one hand they can use someone else’s bigger computer (e.g. Cloud Computing) and on the other hand they can employ other data analysis techniques that are designed to be limitlessly scalable. The prime example of such an analysis technique is MapReduce, which we will discuss both from the original Hadoop perspective (Dean & Ghemawat, 2008) as well as from its successors within the increasingly more popular Spark environment (Chambers & Zacharia, 2018). Furthermore, we also note the more philosophical implications of Big Data technologies using (Ambrose, 2015). How do we know that we know? What are the epistemological implications of Big Data analyses on the theory of knowledge? Would a historical perspective be helpful?

Natural Language Processing

We introduce the field of Natural Language Processing (NLP) as a key technology within data science and artificial intelligence. Applications of NLP are everywhere where people communicate, including web search, scientific papers, emails, customer service, language translation, and clinical reports. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. However, for decades NLP has mostly been based on symbolic approaches instead. Current NLP research aims to meaningfully integrate these two paradigms to better understand human language. Therefore, we will introduce you first to some classical linguistic theories before moving into more recent neural network-based NLP approaches, based on (Clark et al., 2013). Furthermore, the computational experiment assignment will allow you to experiment more in-depth with a state-of-the-art approach within this fast moving field of NLP.

Automated Machine Learning

As identified in (Spruit & Jagesar, 2016), one of the major challenges in correctly applying Machine Learning techniques in Applied Data Science projects is the so-called Selection vs Configuration dilemma. Often it is quite hard to select the best algorithm for a given data analysis task, and even harder to properly configure its (hyper-)parameters. Even for data scientists. One promising solution might be Automated Machine Learning (Hutter et al., 2019). Thus, AutoML promises to reduce the human effort necessary for applying machine learning, improve the performance of machine learning algorithms, and improve the reproducibility and fairness of scientific studies.

Self-Service Data Science

In the Do-It-Yourself week you will work individually on an NLP computational experiment and experience the course vision of self-service data science. The assignment has many variations in datasets, language models and techniques.

Societal Impact

You decide which popular Data Science book with societal impact you read and pitch!

Other Trends

In the final lecture we will introduce other interesting data science techniques and developments which we could not cover in the course, but which may be worth investigating in a later course or research project.

Course form

This Corona edition of our course is somewhat differently structured... We do keep the twice-a-week lecture slots, in MS Teams streaming format. However, these sessions will mostly start with an interactive multiple choice quiz, which is just for fun and to informally test your current knowledge, and be followed by a general Q/A session for any remaining questions. These sessions will be recorded and it is not mandatory to attend any lectures.

Regular lecture materials will be provided as videos to be viewed anytime. This is why we will have regular quizes to test and help you remind whether you actually watched and read all materials. The workshop sessions will be taking place online as well in a standard asynchronous discussion channel format on MS Teams. Our TA and SAs will try to answer any queries asap in the Technical Support channel.

Throughout the course, you are given a number of individual (mostly quite small) assignments. The answers to the assignments are to be submitted to the appropriate channel in our DSS 2020 Teams group before the stated deadline (mostly one week after release). There will be no deadline extensions, so be sure to submit appropriately. These assignments will be assessed but not graded: you either PASS or FAIL. When you have FAILed 20 percent or more of the total number of assignments, you will have FAILed the course due to the 'inspanningsverplichting' (course effort) criterion. However, if you did PASS at least 65% of the assignments, you will be given the opportunity to do the REPAIR assignment (which is a relatively big assignment).

e.g. With 16 assignments, you will need to PASS 13/16 (~81%) assignments. In case you have either 11 or 12 PASSes, you qualify for the substantial REPAIR assignment. Should you merely PASS 10 (~63%) or less assignments, then you have FAILed the course without a second chance.

To help you complete the assignments, this class is also supported by the DataCamp learning platform for Python, SQL and more, through a combination of short expert videos and hands-on-the-keyboard exercises.

Literature

We provide PDFs for most if not all required literature.

Sluiten

Help