This is the starting and obligatory course for the Business Informatics (MBI) programme as well as the Applied Data Science profile. As such, its primary objective is to inspire and introduce you to the exciting domain of Applied Data Science. At the end of this course, you will be able to:
- Understand the role of data science and its societal impact
- Recognise the knowledge discovery processes in applied data science
- Identify trends and developments in big data technologies
- Apply selected big data technologies to solve real-world problems
- Analyse unstructured data using natural language processing techniques
- Understand the need for self-service data science
Please note that the official course page is http://www.cs.uu.nl/education/vak.php?stijl=2&vak=INFOMDSS. Below is an excerpt from the official course documentation.
Applied Data Science
The first course topic that we cover is Applied Data Science (ADS) as positioned in (Braschler et al., 2019) and defined in (Spruit & Lytras, 2018) as “the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts”. Being the core theme of this course, we cover the need for data scientists (e.g. Davenport & Patil, 2012) and relate this novel topic with the well-known domain of knowledge discovery processes (Chapman et al., 2000). We refer to standardised NIST definitions (Pritzker & May, 2015) to properly ground our ADS perspective.
Data analytics is the multidisciplinary field which aims to make sense of data and observations from everyday life. Its data-driven approach to problem solving includes various methods and techniques. In this theme we focus on discussing why certain approaches work, what common mistakes are made, and so on, using (Lazer et al., 2014; Broniatowski et al., 2014) as a running example. We will also discuss data analytics tasks from both statistical and machine learning perspectives.
Big Data & Cloud Computing
The original course trigger was the inability of researchers to analyse datasets which were simply too big to process on a laptop. On the one hand they can use someone else’s bigger computer (e.g. Cloud Computing) and on the other hand they can employ other data analysis techniques that are designed to be limitlessly scalable. The prime example of such an analysis technique is MapReduce, which we will discuss both from the original Hadoop perspective (Dean & Ghemawat, 2008) as well as from its successors within the increasingly more popular Spark environment (Chambers & Zacharia, 2018). Furthermore, we also note the more philosophical implications of Big Data technologies using (Ambrose, 2015). How do we know that we know? What are the epistemological implications of Big Data analyses on the theory of knowledge? Would a historical perspective be helpful?
Natural Language Processing
We introduce the field of Natural Language Processing (NLP) as a key technology within data science and artificial intelligence. Applications of NLP are everywhere where people communicate, including web search, scientific papers, emails, customer service, language translation, and clinical reports. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. However, for decades NLP has mostly been based on symbolic approaches instead. Current NLP research aims to meaningfully integrate these two paradigms to better understand human language. Therefore, we will introduce you first to some classical linguistic theories before moving into more recent neural network-based NLP approaches, based on (Clark et al., 2013). Furthermore, the computational experiment assignment will allow you to experiment more in-depth with a state-of-the-art approach within this fast moving field of NLP.
Automated Machine Learning
As identified in (Spruit & Jagesar, 2016), one of the major challenges in correctly applying Machine Learning techniques in Applied Data Science projects is the so-called Selection vs Configuration dilemma. Often it is quite hard to select the best algorithm for a given data analysis task, and even harder to properly configure its (hyper-)parameters. Even for data scientists. One promising solution might be Automated Machine Learning (Hutter et al., 2019). Thus, AutoML promises to reduce the human effort necessary for applying machine learning, improve the performance of machine learning algorithms, and improve the reproducibility and fairness of scientific studies.
Self-Service Data Science
In the Do-It-Yourself week you will work individually on an NLP computational experiment and experience the course vision of self-service data science. The assignment has many variations in datasets, language models and techniques.
You decide which popular Data Science book with societal impact you read and pitch!
In the final lecture we will introduce other interesting data science techniques and developments which we could not cover in the course, but which may be worth investigating in a later course or research project.
|Even though this course is not a programming course, you are required to write various data analysis scripts. Therefore, if you don't have any script programming experience yet, it is advisable to familiarise yourself beforehand by taking an online introductory Python programming course.||Verplicht materiaal|
|We provide literature on all topics as listed in the official course documentation.|
AlgemeenThis course contains two lectures per week, for which afterwards the slides will be made available on our DSS Teams group.
AlgemeenThroughout the course, you are given a number of individual assignments. The answers to the assignments are to be submitted to the appropriate channel in our DSS Teams group.