-
Alessandro Costantini authoredAlessandro Costantini authored
PhD-DataScience-2018.tex 13.84 KiB
\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\begin{document}
\title{ Infrastructures and Big Data processing as pillars in the XXXIII PhD course in Data Sciece and Computation}
%\address{Production Editor, \jpcs, \iopp, Dirac House, Temple Back, Bristol BS1~6BE, UK}
\author{D. Salomoni$^1$, A. Costantini$^1$, C. D. Duma$^1$, B. Martelli$^1$, D. Cesini$^1$, E. Fattibene$^1$, D. Michelotto $^1$
% etc.
}
\address{$^1$ INFN-CNAF, Bologna, IT}
\ead{davide.salomoni@cnaf.infn.it}
\begin{abstract}
During the Academic year 2017-2018 the Alma Mater Studiorum, Universuty of Bologna (IT) activated the XXXIII PhD course in Data Science and Computation.
The course runs for four years and it is devoted to those students graduated in the field of Mathematical Physical, Chemical and Astronomical Sciences.
This course builds upon fundamental data science disciplines to train candidates that should to become able to carry out academic and industrial research
at a higher level of abstraction, with different final specializations in several different fields where data analysis and computation becomes prominent.
In this respect, INFN-CNAF was responsible for two courses: Infrastructure for Big Data processing Basic (IBDB) and Advanced (IBDA)
\end{abstract}
\section{Introduction}
During the Academic year 2017-2018 the Alma Mater Studiorum, Universuty od Bologna (IT) activated the PhD XXXIII course in Data Science and Computation.
The PhD course starts off based on a joint collaboration of the University of Bologna with politecnico di Milano, the Golinelli Foundation, the Italian Institute
of Technology, Cineca, the ISI Foundation and INFN. Even though they are all Italian, each of the aforementioned institutions has already achieved a renown
international role in the upcoming field of scientific management and processing of data. Nonetheless, during its lifetime the Course is intended to discuss,
design and establish a series of international initiatives that include the possibility to reach agreements with foreign Universities and Research Institutions to
issue, for example: joint doctoral degrees, co-tutorship and student exchanges. These activities will be carried out also based on the contribution that each
member of the Course Board will provide.
The PhD course runs for four years and is aimed at train people to become able to carry out academic and industrial research at a level of abstraction that
builds atop each single scientific skill which lies at the basis of the field of ``Data Science''.
Drawing on this, students graduated in the field of Mathematical Physical, Chemical and Astronomical Sciences should produce original and significant
researches in terms of scientific publications and innovative applications, blending basic disciplines and finally specializing in specific fields as from those
provided in the following ``Curricula and Research'' topics
\begin{itemize}
\item Quantitative Finance and Economics
\item Materials and Industry 4.0
\item Genomics and bioinformatics
\item Personalised medicine
\item Hardware and Infrastructure
\item Machine learning and deep learning
\item Computational physics
\item Big Data, Smart Cities \& Society
\end{itemize}
In this respect, INFN-CNAF was responsible for two courses: Infrastructure for Big Data processing Basic (IBDB) and Advanced (IBDA).
Davide Salomoni has been the responsible in charge for both courses.
\section{Activities to be carried out during the Course}
At the beginning of the course each student is supported by a supervisor, member of the Collegio dei Docenti (Faculty Board), who guides him throughout
the Ph.D. studies. The first 24 months are devoted to the integration and deepening of the student expertise, according to a personalized learning plan
(drawn up by the student in agreement with the supervisor and then submitted to the Board for approval). The learning plan foresees reaching 40 CFU
(credits) by attending courses and passing the corresponding exams. By the 20th month (from the beginning of the course) the student must submit a
written thesis proposal to the Board for approval. By the end of the 24th month the student must have completed the personalized learning plan and
must report on the progress of the thesis draft. The admission (from the first) to the second is taken into consideration by the Board (and approved in
the positive case) on the basis of the fact that the candidate has obtained an adequate number of CFU. The admission (from the second) to the third
is taken into consideration by the Board (and approved in the positive case) if the candidate has obtained all the CFU and on the basis of a candidate's
public presentation regarding his/her thesis proposal. The third and the fourth years are entirely devoted to the thesis work. he admission (from the third)
to the fourth is is taken into consideration by the Board (and approved in the positive case) on the basis of a candidate's public presentation regarding the
current status of his/her thesis. The Board finally approves the admission to the final exam, on the basis of the reviewers' comments. The Board may
authorize a student to spend a period in Italy at universities, research centers or companies. It is mandatory for the student to spend a period of at
least 3 months abroad, during the 3rd/4th year of the course.
\section{Infrastructure for Big Data processing}
As already mentioned, the didactical units Infrastructure for Big Data processing Basic (IBDB) and Advanced (IBDA), headed by Davide Salomoni with the
support of the authors, have been an integral part of the PhD course and constituted the personalized learning plan of some PhD students.
In order to made available the teaching material and to made possible an active interaction among the teachers and the students, a Content
Management System have been deployed and made available. The CMS elected for such activity have been Moodle \cite{moodle} and the entire courses
have been made available trough it via a dedicated link (https://moodle.cloud.cnaf.infn.it/).
\subsection{Infrastructure for Big Data processing Basic}
The course is aimed at providing basic concepts of Cloud computing at the Infrastructure-as-a-Service level. The course started with an introduction to
Big Data It will continue with a description of the building blocks of modern data centers and how they are abstracted by the Cloud paradigm. A real-life
computational challenge was also given and students have to create (during the course) a cloud-based computing model to solve this challenge. Access
to a limited set of Cloud resources and services was granted to students in order to complete the exercises. A very brief introduction to High Performance
Computing (HPC) has been also be given. Notions about the emerging “fog” and “edge” computing paradigms and how they are linked to Cloud infrastructures concluded the course.
The course foresees an oral exam focusing on the presented topics. Students have been requested to prepare a small project discussed during the exam.
The course IBDA covered the wollowing arguments
\begin{itemize}
\item Introduction to IBDB: Here, an introduction to the course and its objective are described to the students. Moreover, a presentation of the computational challenges during the course.
\item Datacenter building blocks: Basic concepts related to batch system, queues, allocation policies, quota, etc. and a description of the different storage system have been provided.
Moreover, an overview of networking, monitoring and provisioning concepts have been given.
\item Infrastructures for Parallel Computing: High Throughput Vs High Performance computing have been described and analysed.
\item Cloud Computing: An introduction to Cloud IaaS have been provided and some comparisons among public and private cloud have been given. Hands-on have been provided
on how to use the IaaS stack layer, deploy virtual resources and create different components.
\item Creating a computing model in distributed infrastructures and multi-sites Cloud: Here an overview of the common strategies for Job Submission, Data Management, Failover
and Disaster Recovery have been described. Moreover, a discussion on computing model creation and introduction to the projects requested for the examination have been started.
\item Computing Continuum: Here an introduction to Low Power devices, Edge Computing, Fog Computing and Computing Continuum for Big Data Infrastructures have been presented.
\end{itemize}
\subsection{Infrastructure for Big Data processing Advanced}
The course is aimed at discussing the foundations of Cloud computing and storage services beyond IaaS (PaaS and SaaS) leading the students to understand how to
exploit distributed infrastructures for Big Data processing.
The IBDA course is intended as an evolution of the IBDB and, therefore, before following this course the IBDB should have already been achieved, or having familiarity with the covered topics.
At the end of the course, the student had practical and theoretical knowledge on distributed computing and storage infrastructures, cloud computing and virtualization,
parallel computing and their application to Big Data Analysis.
The course foresees an oral exam focusing on the presented topics. Students have been requested to prepare a small project discussed during the exam.
The course IBDA covered the wollowing arguments
\begin{itemize}
\item Introduction to IBDA. Here, an introduction to the course and its objective are described to the students. Moreover, a general presentation about Clouds beyond
the IaaS and the INDIGO-DataCloud architecture as a concrete example have been discussed.
\item Authentication and Authorization: Here principles of Cloud authentication and authorization (X.509, SAML, OpenID-Connect, LDAP, Kerberos, Username/password,
OAuth) have been presented, with a focus on the INDIGO-IAM (Identity and Access Management) tool \cite{iam}. The session envisaged also a set of hands-on related to
1)Connecting to INDIGO IAM, 2)Adapting a web-based application to use IAM, 3)Connecting multiple AuthN methods.
\item Cloud PaaS. Here an overview of PaaS and related examples have bee provided, together with a hogh-level description of the TOSCA \cite{tosca} standard for PaaS automation.
Hands-on related to TOSCA template and Alien4Cloud \cite{a4c}.
\item Non-Posix Cloud Storage. Lessons are intended tp provide to the students the basic concepts on POSIX and Object storage with pracital examples and hands-on on CEPH \cite{ceph}
\item Containers. The origin of containers, Docker \cite{docker} and dockerfiles, automation with Docker Swarm and security considerations about containers are provided. Moreover,
a description of how to run docker containers in userspace with udocker \cite{udocker}. Hands-on on how to create a container, working with docker versions and deploy a container
in a Cloud have been carried out to conplete the session.
\item Resource orchestration. Here the local orchestration of resources in Kubernetes \cite{kubernetes}, Mesos \cite{mesos} have been described, with a focus on how Information
Sysytems and the INDIGO Orcehstrator \cite{orchestrator} can be used to orchestrate resources remotely. The hands-on to create and deploy an HTCondor cluster over a Cloud has been also provided to the students.
\item Distributed File Systems. Storj, ipfs and Onedata basic concepts have been described. For the laest topic, an hands-on on how to store and replicate files at multiple sites with Onedata have been provided.
\item Cloud automation. The basic concepts of configuration management automation have been described, focusing the session on Ansible \cite{ansible} configuration manager and its relation with the TOSCA templates.
\end{itemize}
\section{Conclusions}
Based on a joint collaboration of the University of Bologna with Politecnico di Milano, the Golinelli Foundation, the Italian Institute of Technology,
Cineca, the ISI Foundation and INFN, the XXXIII PhD course in Data Sciece and Computation has been activated.
The course is aimed at train people to become able to carry out academic and industrial research at a level of abstraction
that builds atop each single scientific skill which lies at the basis of the field of Data Science.
As part of the PhD course, the teaching units Infrastructure for Big Data processing Basic (IBDB) and Advanced (IBDA) has
been included in the personalized learning plan of some PhD students. The teaching units are aimed at providing foundations of Cloud
computing and storage services beyond IaaS, PaaS, SaaS, leading the students to understand how to exploit distributed infrastructures for Big Data processing.
As an expect result, original, relevant and significant research activities are due by the end of the Course that can take different forms including
for example: scientific publications, system and software design, realization and production, and any kind of innovative applications specializing on a
broad gamut of topics, such as for example: Quantitative Finance and Economics; Materials and Industry 4.0; Genomics and bioinformatics; Personalised
medicine; Hardware and Infrastructure; Machine learning and deep learning; Computational physics; Big Data, Smart Cities \& Society.
\section{References}
\begin{thebibliography}{}
\bibitem{moodle}
Web site: https://moodle.org
\bibitem{iam}
Web site: https://www.indigo-datacloud.eu/identity-and-access-management
\bibitem{tosca}
Web site: https://github.com/indigo-dc/tosca-types
\bibitem{a4c}
Web site: https://github.com/indigo-dc/alien4cloud-deep
\bibitem{ceph}
Web site: https://ceph.com
\bibitem{docker}
Web site: https://www.docker.com/
\bibitem{udocker}
Web site: https://github.com/indigo-dc/udocker
\bibitem{kubernetes}
Web site: https://kubernetes.io/
\bibitem{mesos}
Web site: mesos.apache.org
\bibitem{orchestrator}
Web site: https://www.indigo-datacloud.eu/paas-orchestrator
\bibitem{ansible}
Web site: https://www.ansible.com
\end{thebibliography}
\end{document}