DATA MINING

Degree course: 
Corso di Second cycle degree in COMPUTER SCIENCE
Academic year when starting the degree: 
2020/2021
Year: 
1
Academic year in which the course will be held: 
2020/2021
Course type: 
Compulsory subjects, characteristic of the class
Language: 
English
Credits: 
6
Period: 
Second semester
Standard lectures hours: 
48
Detail of lecture’s hours: 
Lesson (48 hours)
Requirements: 

Basic contents of the Intelligent Systems course delivered in the first year of the master's degree course.
The course is recommended for those who have knowledge of at least one programming or scripting language.
It is advisable to get a laptop (Windows, Mac or Linux) capable of executing the Python interpreter.

Final Examination: 
Orale

The exam consists of a project and a set of questions on the topics covered in class.
The project is proposed by the student based on his interests. In the absence of specific proposals, the project is proposed by the teacher. In the project students are typically asked to implement simple methods of experimental investigation on data made available to them by websites and / or other banchmarking data available on online repositories. These investigations are aimed at ascertaining the students' ability to adapt the methods studied to real cases, possibly including their specific features. The project must be accompanied by a report describing the objectives, contents, results obtained and conclusions.
The outcome of the project is positive if it shows a vote at least equal to 18/30.
The theoretical test consists of a set of questions presented to the student through the Moodle platform and serves to understand the degree of knowledge that the student has acquired on the topics covered in class. Normally 30 minutes are enough to answer the approximately 20 questions but there is no a fixed limit to complete the test. The grade of this test is used to calculate the final grade through a multiplication with a scale factor and the average with the vote of the project.
The vote of the project contributes significantly to the determination of the final grade.

Assessment: 
Voto Finale

The term Data Mining means a set of techniques and tools used to explore large amounts of data, with the aim of identifying / extracting significant information / knowledge, so as to make them available to decision-making processes.
This course aims to provide the fundamentals of the discipline, focusing the study on the most important Data Mining techniques of current application / industrial interest.
The course combines the theoretical knowledge of Data Mining with the use of open source Python software.
The course participants will be guided in finding patterns within datasets, and through the tools provided by Python they will learn to preprocess data, perform Clustering, Classification and Forecasting operations.
In summary, the educational objectives of teaching and the expected results of learning are the following:
- Acquire knowledge and ability to use the Python language and typical data mining libraries, in order to process datasets and execute machine learning algorithms.
- Acquire basic knowledge for data pre-processing. At the end of the course, the student will be able to deal with the main issues related to the data and, independently, face a real problem in the best way.
- Understanding of some machine learning algorithms, supervised or not, in order to extract information from the data. The training objective is to equip students with the knowledge necessary to solve real problems by identifying and developing the appropriate machine learning algorithms. The course will consist of a theoretical part describing classic Machine Learning algorithms and a practical part in which these methodologies are applied using the Python language.
- Ability to learn new methodologies typical of web data mining for the main techniques of analysis of the traces that users leave on the web. Learning will be supported by numerous developed case studies.

The lessons will address the following topics:
Languages, libraries and tools for Data Mining (14 h, educational goal 1)
- IPython: intro, help, magic commands, debug.
- The Python language: introduction, data types, basic elements of the language.
- Libraries for data mining: NumPy as data structure, Pandas for data manipulation, MatPlotLib for data visualization, Sci-Kit learn for the use of Machine Learning algorithms.
Data pre-processing (14 h, educational objective 2)
- Feature-Engineering
- Solutions for the management of missing data
- Training, validation and testing data
- Techniques for managing unbalanced datasets
- Bigrams, Stemming, Lemmatizing
Machine learning algorithms (14 h, formative objective 3)
- Machine Learning Introduction
- Supervised-Learning Classification,
- Decision tree classifier, algorithm and examples
- Ensamble models: boosting, bagging, stacking
- Support Vector Machine
- Linear regression, Polynomial basis function, gaussian basis function and regularization.
- Frequent itemsets and association rules mining
- Hyperparameter tuning
Web data mining (14 h, educational goal 4)
- Recommender-systems, Collaborative-Filtering, content-based filtering. Evaluation of recommendation systems.
- Word2Vec, Sentiment Analysis, Document Sentiment Classification, Aspect based Sentiment Analysis.
- Information Extraction, Named Entity Recognition
- Social Networks, Face Recognition, Multimodal problems
The topics will be addressed using the Python programming language as a reference. Nevertheless, many of the topics covered in the course are of general validity, and the proposed techniques are generally applicable with different languages.

The teaching material is available on the e-learning website: Slides, lecture notes, application examples.
However, there are several books that can profitably be used by students to integrate the material provided through e-learning.
- Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
Authors: Zdravko Markov, Daniel T. Larose
Publisher: WILEY-INTERSCIENCE
- Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data - July 2011
Author:Bing Liu
Publisher: Springer Publishing Company, Incorporated
- Python Data Science Handbook: Essential Tools for Working with Data
Author: Jake VanderPlas
Publisher: O'Reilly

Convenzionale

48 hours of frontal lessons.
The hours of lectures are carried out in the classroom, alternating theoretical moments with practical exercises.
The analytical software used will be Python, an open-source platform that can be freely downloaded from the web.
IPython, an open source, browser-based tool that allows students to create/edit documents that contain code, views and text, will be used during lectures.
During the course, additional analytical packages will be downloaded and installed, necessary for the various topics discussed during lectures.
In the classroom, continuous assistance is provided by the teacher.

The teacher receives students by appointment, upon request sent by by e-mail to name.surname@uninsubria.it. The teacher responds only to e-mails signed and coming from the students.uninsubria.it domain.

Professors

Borrowers