ECTS credits ECTS credits: 6
ECTS Hours Rules/Memories Student's work ECTS: 108 Hours of tutorials: 1 Expository Class: 21 Interactive Classroom: 20 Total: 150
Use languages Spanish, Galician, English
Type: Ordinary subject Master’s Degree RD 1393/2007 - 822/2021
Center Higher Technical Engineering School
Call: Second Semester
Teaching: With teaching
Enrolment: Enrollable | 1st year (Yes)
The increasing amount of information available through the Internet calls for the efficient processing of large amounts of data. This has led to the development of new storage and processing techniques to deal with huge amounts of data, namely Big Data techniques, that naturally adapt to distributed systems.
The main goal of this subject is to learn suitable processing techniques for large amounts of information in the Big Data world, particularly using the Hadoop ecosystem, and compare these techniques with the traditional ones employed in HPC environments. This will allow the student to select the optimal tools to solve a particular problem.
1. Introduction to Data Engineering
1.1 HPC vs Big Data: similarities and differences in data management.
1.2 Hardware and Software Technologies for High Performance Data Engineering
1.3 Data Engineering in HPC infrastructures vs. Cloud environments
2. Data Engineering phases
2.1 Modeling (Formats, Compression, Designing Schemas)
2.2 Intake (Periodicity, Transformations, Tools)
2.3 Storage (HDFS and NoSQL DBs, HBase, MongoDB, Cassandra)
2.4 Processing (Batch, Real-Time)
2.5 Orchestration
2.6 Analysis (SQL, Machine Learning, Graphs, UI)
2.7 Governance
2.8 Integration with BI (Visualization)
3. Introduccion to Data Analytics
3.1 Exploratory Data Analytics
3.2 Introduction to Machine Learning
4 Use cases
4.1 Applications to Internet of Things (Smart environments and Industry 4.0)
4.2 Applications to sciences and engineering
Basic bibliography
- T. White, "Hadoop: The Definitive Guide", 4th Edition, O'Reilly, 2015
- Wes McKinney "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython" 2nd Edition, O'Reilly, 2017
Additional bibliography
- Alex Holmes, "Hadoop in practice", 2nd Edition, Manning, 2014
- The student will be capable of installing, configuring, and managing the basic software for massive data processing.
- The student will be capable of coding massive data processing applications using domain-specific languages.
- The student will learn about Data Engineering tools (for Intake/Storage/Processing/Visualization).
- The student will learn the skills to search, select and manage Big data-related resources (bibliography, software, etc.).
Skills
- Basic: CG1, CG3, CG5, CB6, CB7.
- General: CT1, CT4.
- Specific: CE1, CE2
- Theory classes, in which the content of each subject is exposed. The student will have all the necessary material before the class and the professor will promote an active attitude, making questions that allow clarifying concrete aspects while leaving open questions for the reflection of the student.
- All teaching materials will be available to students on a virtual platform, which this course will be the Aula Cesga, https://aula.cesga.es/.
- Practical classes in the laboratory and in the classroom, in which students perform directed tasks that allow them to get acquainted from a practical point of view with the contents exposed in the theory classes.
- Development of assignments, in which the students have to apply the knowledge acquired in order to solve different problems in an autonomous way.
- Directed discussion. Guidance to solve individual / group assignments, problem solving and continuous evaluation activities.
- Follow-up support: orientation for the development of the assignments, resolution of doubts, etc.
Formative activities of face-to-face nature and their relation with the competences of the degree:
Theory classes CB6, CE1, CE2, CT4
Practical classes in laboratory CB6, CG3, CG5
Follow-up support CB6, CB7
Formative activities of no face-to-face nature and their relation with the competences of the degree:
Practical classes in laboratory CB6, CG3, CG5
Development of academically directed assignments CB6, CB7, CG3, CE1, CE2
Directed discussion CG1, CT1, CT4
Laboratory practice. Grading the assignments submitted by students: 50%
Supervised projects. Grading the supervised projects submitted by students: 50%
Not graded: Students that do not present any practical exercise or guided project will not be graded.
Second opportunity (June/July): Resubmit those laboratory practices or supervised projects not previously presented or submitting improved versions of previously presented practices/projects.
In the case of fraudulent performance of exercises or tests, the regulations of the Normativa de avaliación do rendemento académico dos estudantes e de revisión de cualificacións will be applied.
In the application of the Normativa da ETSE sobre plaxio (approved by the ETSE Council on 12/19/2019), the total or partial copy of any exercise will mean failure on both opportunities of the course, with a grade of 0.0 in both cases.
- Theory classes: 18h face-to-face + 0h autonomous work (total 18h)
- Practical classes in laboratory: 20h face-to-face + 60h autonomous work (total 80h)
- Directed Discussion: 3h face-to-face + 3h autonomous work (total 6h)
- Follow-up support: 1h face-to-face + 0h autonomous work (total 1h)
- Development of assignments: 0h face-to-face + 45h autonomous work (total 45h)
TOTAL: 42h face-to-face + 108h autonomous work, for a total of 150h
Due to the large practical component of the subject, it is advisable to be up-to-date with practices and guided projects during the semester.
The course makes intensive use of online communication tools: Video calls, chats, etc. In-person classes will be recorded for later perusing. An online learning management will be using for distributing notes, creating forums, etc.
The software tools used in this course are generally open-source or have free license for students.
The subject will be taught in English.