Skip to content
Snippets Groups Projects
Commit 0697b742 authored by Alessandro Costantini's avatar Alessandro Costantini
Browse files

Version 2, Added evolution and monitoring

parent 3a274f2b
No related branches found
No related tags found
1 merge request!1Cloud@CNAF contribution
Pipeline #21935 passed
No preview for this file type
contributions/ds_cloud_c/catc_monitoring.png

31.7 KiB

......@@ -11,7 +11,7 @@
\begin{abstract}
Cloud@CNAF is the cloud IaaS hosted at CNAF, based on open source solutions aiming
Cloud@CNAF is the cloud infrastructure hosted at CNAF, based on open source solutions aiming
to serve different use cases present here. The infrastructure is the result of
the collaboration of a transversal group of people from all CNAF
functional units: networking, storage, farming, national services, distributed systems.
......@@ -22,7 +22,7 @@ aqueduct pipe located in the street nearby CNAF went broke. This event caused
down of the entire DataCenter, including the Cloud@CNAF infrastructure.This paper
presents the activities carried out throughout 2018 to ensure the functioning
of the center cloud infrastructure, that saw the its migration from CNAF to INFN-Ferrara,
starting to the re-design of the entire to coupe with the limited availability of
starting to the re-design of the entire to cope with the limited availability of
space and weigth imposed by the new location, to the physical migration of the
racks and remote management and operation of infrastructure in order to continue
to provide high-quality services for our users and communities.
......@@ -56,48 +56,75 @@ Thanks to this enhancement, Cloud@CNAF was able to offer high reliable services
At the end of 2017, on November 9th early at morning, an aqueduct pipe located inthe street nearby CNAF, broke as documented in Ref. \cite{flood}.
As a result, a river of water and mud flowed towards the Tier1 data center. The level of the water did not exceeded the
threshold of safety of thewaterproof doors but, due to the porosity of the external walls and the floor, it couldfind a way
threshold of safety of the waterproof doors but, due to the porosity of the external walls and the floor, it could find a way
into the data center. Both electric lines failed at about 7.10AM CET. Access to the data center was possibile only
in the afternoon, after all the water had been pumped out.
As a result, the entire Tier1 data center went down, included the Cloud@CNAF infrastructure.
\section{The migration to INFN-Ferrara}
\section{The resource migration}
Some weeks after the flooding, we decided to move the Cloud@CNAF core services in a different location
in order to recover the services we pvovided for community and experiements.
Thanks to a strong relationship, both University of Parma/INFN-Parma and INFN-Ferrara proposed to host our core machinery and related services.
Due to the geograpical proximity and the presence of POP GARR, we decided to move the Cloud@CNAF core machineries to the INFN-Ferrara location.
Due to the geograpical proximity and the presence of Point of Precence (PoP) GARR, we decided to move the Cloud@CNAF core machineries to the INFN-Ferrara location.
Unfortunately, INFN-Ferrara was not able to host all the Cloud@CNAF resources due to a limited power availability.
Unfortunately, INFN-Ferrara was not able to host all the Cloud@CNAF resources due to a limited power availability and weight.
For such reason, we decided to carry on an important activity aimed at re-designing the new infrastructure.
In order to do that, we selected the services and the related machinery to move to the new - temporary - location to fit the maximum power consumption and weight
estimated for each of the two rooms devoted to host our services (see Table \ref{table:1} for details).
\section{Re-design the new infrastructure}
Due to the limitations decribed in Table\ref{table:1} we were push to re-desig the Cloud@CNAF infrastructure by using (only) three racks in order to host our core services (see Table \ref{table:1} for the list os services).
Among this three racks, the first hosted the storage resources, the second hosted the sotage, Openstack controller and network services, together
with the GPFS cluster and other services. The third rack hosted Ovirt and Openstack nodes and some other services.
Rack1 and 2 have been coonected by 2x40Gbps through our VDX and Rack 1 and 3 have been connectd by 2x10Gbps
Moreover, Rack1 is connected to POP GARR with 1x1Gbps fiber connection.
Due to the limitations decribed in Table\ref{table:1} we were push to re-desig the Cloud@CNAF infrastructure by using (only) three racks to host Cloud@CNAF core services (see Table \ref{table:1} for the list of services).
Among this three racks, the first one hosted the storage resources, the second hosted Openstack controller and network services, together
with the GPFS cluster and other services. The third rack hosted Ovirt and Openstack compute nodes and some other ancillary services.
Rack1 and 2 have been connected by 2x40Gbps through our Brocade VDX switches and Rack 1 and 3 have been connectd by 2x10Gbps through PowerConnect switches
Moreover, Rack1 is connected to PoP GARR with 1x1Gbps fiber connection.
A complete overview of the new infrastrucure and related resource location is shown in Figure \ref{new_c_at_c}.
As depicted by Figure \ref{new_c_at_c} and taking into account the limitations described in Table \ref{table:1}, we were able to limit the power conumption up to 13,79kW in respect to Room1 (limit 15kW)
and up to 5.8kW (limit 7kW) in respect to Room2.
The whole migration process (from the design to the reconfiguration of the new infrastructure) took almost a business week and after that the Cloud@CNAF and related services
where up and running able to serve again different projects and communities.
The whole migration process (from the design to the reconfiguration of the new infrastructure) took almost a business week and after that the Cloud@CNAF infrastructure and related services
where up and running, able to serve again different projects and communities.
\section{Conclusions}
Due to a damage in the aqueduct pipe located inthe street nearby CNAF, a river of water and mud flowed towards the Tier1 data center causing the
shutdown of the entire data center. For such reason, the services and related resources hosted by Cloud@CNAF went down.
To cope with this problem, we decided to temporary migrate che core resources and services of Clud@CNAF to INFN-Ferrara and to do this a complete re-design of the entire
infrastructure was needed to tackle the limitations in terms of power consumption and weight imposed by the new location.
Due to the joint effort of all the CNAF people and the INFN-Ferrara colleagues we were able to re-design, migrate and make operational the new Cloud@CNAF infrastructure and related hosted services
in less than a business week.
Thanks to the experience and the documentation provided, in June 2018 - after the Tier1 returned in its production status, Cloud@CNAF has been migrated back in less than three business days.
\section{Cloud@CNAF evolution}
Starting from the activity carried out in 2016 related to the improvements done at the infrastructure level \cite{catc}, in 2018 (after the return of the core infrasructure services due to the flooding)
the increase of the computing resources, in terms of quality and quantity, continued in order to enhance the both the services and the performance offered to users.
Thanls to such activity, during the last year the Cloud@CNAF saw a growth on the number of users and use cases implemented in the infastructure, in particular
the number of projects increased up to 87 which means a total consumption of 1035 virtual CPUS, 1766TB of RAM, with a total of 267 virtual machines (see Figure \ref{catc_monitor} for more details).
Among others, some of the project that used the cloud infrastructure are:
\begin{itemize}
\item HARMONY - under the TTLab coordination, is a project aimed at finding resourceful medicines offensive against neoplasms in hematology
\item EEE - Extreme Energy Events - Science inside Schools (EEE), is a special research activity about the origin of cosmics rays carried out with the essential contribution of students and teachers of high schools,
\item USER Support - for the development of experiments dashboard and the hosting of the production instance of the dashboard, displayed on the monitor present on the CNAF hallway,
\item DODAS - for Elastic Extension of Computing Centre batch resources on external clouds,
\item Services devoted to EU projects like DEEP-HDC \cite{deep}, XDC \cite{xdc} and many more.
\end{itemize}
\begin{table} [h]
\section{Conclusions and future work}
Due to a damage in the aqueduct pipe located inthe street nearby CNAF, a river of water and mud flowed towards the Tier1 data center causing the
shutdown of the entire data center. For such reason, the services and related resources hosted by Cloud@CNAF went down.
To cope with this problem, we decided to temporary migrate che core resources and services of Clud@CNAF to INFN-Ferrara and
to do this a complete re-design of the entire infrastructure was needed to tackle the limitations in terms of power consumption and
weight imposed by the new location.
Due to the joint effort of all the CNAF people and the INFN-Ferrara colleagues we were able to re-design, migrate and make operational
the new Cloud@CNAF infrastructure and related hosted services in less than a business week.
Thanks to the experience and the documentation provided, in June 2018 - after the Tier1 returned in its production status, Cloud@CNAF
has been migrated back in less than three business days.
Even taking into account the above described probles, we were able to maintain and evolve the Cloud@CNAF infrastructure, giving the possibility
to the old and new users to continue therir activities and carry out their results.
In the next year, new and challenging activities are planned, in particular the migration to the OpenStack Roky version.
\begin{table} [ht]
\centering
\begin{tabular}{ |c|c|c|c|c|c|c| }
\hline
......@@ -113,7 +140,7 @@ Occupancy (U) & 9 & 12 & 10 & 21 & 10 & 31 \\
\end{table}
\begin{table}
\begin{table} [ht]
\centering
\begin{tabular}{ |c|c|c|c| }
\hline
......@@ -137,6 +164,13 @@ Powervault & Cloud networks & Compute nodes \\
\label{new_c_at_c}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=15cm,clip]{catc_monitoring.png}
\caption{Cloud@CNAF monitoring and status}
\label{catc_monitor}
\end{figure}
\section{References}
\begin{thebibliography}{}
......@@ -147,10 +181,15 @@ Cloud@CNAF - maintenance and operation, C. Duma, R. Bucchi, A. Costantini, D. Mi
Web site: https://www.openstack.org/
\bibitem{flood}
The flood, L. dell’Agnello, CNAF Annual Report 2017, https://www.cnaf.infn.it/wp-content/uploads/2018/09/cnaf-annual-report-2017.pdf
\bibitem{deep}
Web site: https://deep-hybrid-datacloud.eu/
\bibitem{xdc}
Web site: www.extreme-datacloud.eu
\end{thebibliography}
\end{thebibliography}
\end{document}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment