DATA MINING
- Overview
- Assessment methods
- Learning objectives
- Contents
- Delivery method
- Teaching methods
- Contacts/Info
Basic contents of the Intelligent Systems course delivered in the first year of the master's degree course.
The course is recommended for those who have knowledge of at least one programming or scripting language.
It is advisable to get a laptop (Windows, Mac or Linux) capable of executing the Python interpreter.
The exam consists of a theoretical part consisting of a set of questions and a project on one or more topics addressed in class.
The theoretical test consists of a set of questions presented to the student through the Moodle platform and is used to understand the degree of knowledge that the student has acquired on the topics covered in class.
The project is proposed by the teacher. In the project, students must implement
- existing models applied to datasets where this model has never been applied and in this case, they must try to overcome the state of the art on the new dataset;
- improvements with innovative ideas of an existing model that allow us to obtain similar or better results on the datasets where the original model has already been applied.
The project must be accompanied by a report that introduces the problem, analyzes the existing literature, describes the project details, and shows and comments on the results obtained and the conclusions. It is possible to present and discuss the project after completing the theory test, regardless of the grade obtained.
There is no minimum mark for the two tests, any mark obtained allows either to carry out the next test or to obtain a final mark. The theoretical test grade and the project grade are averaged together to obtain a base score that could increase slightly at the teacher's discretion if the project presents some original ideas or particularly interesting results. The test is passed if the final grade is greater than or equal to 18/30.
The term Data Mining refers to a set of techniques and tools used to explore large amounts of data, to identify/extract significant information/knowledge, and to make them available to decision-making processes.
This course aims to provide the discipline's fundamentals and then focus on some Data Mining techniques of current application/industrial interest with a focus on Deep Learning.
The course combines theoretical knowledge with open-source software and the Python language.
The course participants, through the tools provided by Python, will learn to preprocess data (images and text) and extract added information from the data by applying some models analyzed in the course.
In summary, the educational objectives of the teaching and the expected learning outcomes are the following:
1) Acquire knowledge and ability to use the Python language and libraries typical of Data Mining, to process datasets and execute machine learning algorithms.
2) Acquire basic knowledge for data pre-processing. At the end of the course, the student will be able to deal with the main problems relating to the data and independently face a real problem in the best feasible way.
3) Understanding of some machine learning algorithms of current application/industrial interest, with particular focus on some successful models of Deep Learning, supervised and not, to extract information from data.
4) Analyze some real problems in which the models studied in the course can produce results that in many cases represent the state of the art.
In addition to the training objectives described above, the course aims to provide transversal skills, such as the critical attitude of students in evaluating the solutions obtained or proposed by third parties, the ability to independently learn new machine learning approaches for the analysis of data, the ability to analyze existing literature and the ability to use a scientific language to communicate the results obtained.
The lessons will address the following topics:
Languages, libraries, and Tools for Data Mining (4 h, learning objective 1)
- IPython: intro, help, magic commands, debug.
- The Python language: introduction, data types, basic elements of the language.
- Libraries for data mining: NumPy as a data structure, Pandas for data manipulation, MatPlotLib for data visualization, Sci-Kit learn for the use of Machine Learning algorithms, Pytorch to implement the neural models of Deep Learning.
Data pre-processing (10 h, learning objective 2)
- Solutions for missing data management
- Techniques for the management of unbalanced datasets
- Normalization
- Bigrams, Stemming, Lemmatizing
- Data augmentation
- Word embedding: One Hot Encoding, TF-IDF, Word2Vec, GloVe, BERT
Machine learning algorithms (34 h, learning objectives 3 and 4)
- Machine Learning and Deep Learning
- Deep neural networks
-- Convolutional Neural Networks (CNN)
-- AutoEncoders (AE)
-- Generative models (GAN)
-- Transformers (GPT)
-- Transfer learning
48 hours of frontal lessons.
The hours of lectures are carried out in the classroom, alternating theoretical moments with practical exercises.
The analytical software used will be Python, an open-source platform that can be freely downloaded from the web.
IPython, an open source, browser-based tool that allows students to create/edit documents that contain code, views and text, will be used during lectures.
During the course, additional analytical packages will be downloaded and installed, necessary for the diverse topics discussed during lectures.
In the classroom, continuous assistance is provided by the teacher.
The teacher receives students by appointment, upon request sent by e-mail to name.surname@uninsubria.it. The teacher responds only to e-mails signed and coming from the students.uninsubria.it domain.
Professors
Borrowers
-
Degree course in: COMPUTER SCIENCE