Skip to content
Snippets Groups Projects
Commit ee277b99 authored by Lucia Morganti's avatar Lucia Morganti
Browse files

Add new file

parent 26cc5876
No related branches found
No related tags found
No related merge requests found
\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\usepackage{url}
\usepackage{color, colortbl}
\definecolor{LightCyan}{rgb}{0.88,1,1}
\definecolor{LightYellow}{rgb}{1,1,0.88}
\definecolor{Red}{rgb}{1,0,0}
\definecolor{Green}{rgb}{0,1,0}
\definecolor{MediumSpringGreen}{rgb}{0,0.98,0.6} %rgb(0,250,154)
\definecolor{Gold}{rgb}{1,0.84,0}%rgb(255,215,0)
\definecolor{Gainsboro}{rgb}{0.86,0.86,0.86}%rgb(220,220,220)
\begin{document}
\title{The INFN Tier-1}
\author{Luca dell'Agnello}
\address{INFN-CNAF, Bologna, IT}
\ead{luca.dellagnello@cnaf.infn.it}
\section{Introduction}
CNAF hosts the Italian Tier-1 data center for WLCG: over the years, Tier-1 has become the main computing facility for INFN.
Nowadays, besides the four LHC experiments, the INFN Tier-1 provides services and resources to 30 other scientific collaborations, including BELLE2 and several astro-particle experiments (Tab.\ref{T1-pledge})\footnote{CSN 1, CSN 2 and CSN 3 are the National Scientific Committees of the INFN, respectively, for experiments in high energy physics with accelerators, astro-particle experiments and experiments in nuclear physics with accelerators.}. As showns in Fig.~\ref{pledge2018}, besides LHC, the main users are the astro-particle experiments.
\begin{figure}[h]
\begin{center}
\begin{minipage}{35pc}
\includegraphics[width=15pc]{t1-img/cpu2018.png}\hspace{2pc}%
% \caption{\label{cpu2018}xxx}
% \end{minipage}\hspace{2pc}%
% \begin{minipage}{30pc}
\includegraphics[width=15pc]{t1-img/disk2018.png}\hspace{2pc}%
% \caption{\label{disk2018}xxx}
% \end{minipage}
\vspace{2pc}%
% \begin{minipage}{20pc}
\begin{center}
\includegraphics[width=15pc]{t1-img/tape2018.png}\hspace{2pc}%
% \caption{\label{tape2018}xxx}
\caption{\label{pledge2018}Relative requests of resources at INFN Tier-1}
\end{center}
\end{minipage}\hspace{2pc}%
\end{center}
\end{figure}
Despite the flooding that occurred at the end of 2017, we were able to provide the resources committed to the experiments for 2018, almost in time.
\begin{table}
\begin{center}
\begin{tabular}{l|rrr}
\br
\textbf{Experiment}&\textbf{CPU (kHS06)}&\textbf{Disk (PB-N)}&\textbf{Tape (PB)}\\
\hline
\rowcolor{MediumSpringGreen}
ALICE&52020&5185&13497\\
\rowcolor{MediumSpringGreen}
ATLAS&85410&6480&17550\\
\rowcolor{MediumSpringGreen}
CMS&72000&7200&24440\\
\rowcolor{MediumSpringGreen}
LHCB&46805&5606&11400\\
\rowcolor{MediumSpringGreen}
\hline
\textbf{LHC Total}&\textbf{256235}&\textbf{24471}&\textbf{66887}\\
\hline
\rowcolor{LightYellow}
Belle2&13000&350&0\\
\rowcolor{LightYellow}
CDF&0&0&4000\\
\rowcolor{LightYellow}
Compass&40&10&40\\
\rowcolor{LightYellow}
KLOE&0&33&3075\\
\rowcolor{LightYellow}
LCHf&6000&90&0\\
\rowcolor{LightYellow}
NA62&3000&250&200\\
\rowcolor{LightYellow}
PADME&1500&10&500\\
\rowcolor{LightYellow}
LHCb Tier2&26085&0&0\\
\rowcolor{LightYellow}
\hline
\rowcolor{LightYellow}
\textbf{CSN 1 Total}&\textbf{49625}&\textbf{743}&\textbf{7815}\\
\hline
\rowcolor{LightCyan}
AMS&15800&1990&510\\
\rowcolor{LightCyan}
ARGO&0&120&1000\\
\rowcolor{LightCyan}
Auger&2000&615&0\\
\rowcolor{LightCyan}
BOREX&2000&185&41\\
\rowcolor{LightCyan}
CTA&4000&796&120\\
\rowcolor{LightCyan}
CUORE&1900&262&0\\
\rowcolor{LightCyan}
Cupid&100&15&10\\
\rowcolor{LightCyan}
DAMPE&8000&200&100\\
\rowcolor{LightCyan}
DARKSIDE&2000&980&300\\
\rowcolor{LightCyan}
ENUBET&500&10&0\\
\rowcolor{LightCyan}
EUCLID&1000&1042&0\\
\rowcolor{LightCyan}
Fermi&500&15&40\\
\rowcolor{LightCyan}
Gerda&40&45&40\\
\rowcolor{LightCyan}
Icarus&4000&500&1500\\
\rowcolor{LightCyan}
JUNO&3000&230&0\\
\rowcolor{LightCyan}
KM3&300&250&200\\
\rowcolor{LightCyan}
LHAASO&300&60&0\\
\rowcolor{LightCyan}
LIMADOU&400&8&0\\
\rowcolor{LightCyan}
LSPE&1000&14&0\\
\rowcolor{LightCyan}
MAGIC&296&65&150\\
\rowcolor{LightCyan}
NEWS&200&60&60\\
\rowcolor{LightCyan}
Opera&200&15&15\\
\rowcolor{LightCyan}
PAMELA&650&100&150\\
\rowcolor{LightCyan}
Virgo&30000&656&1368\\
\rowcolor{LightCyan}
Xenon100&1000&200&1000\\
\rowcolor{LightCyan}
\hline
\rowcolor{LightCyan}
\textbf{CSN 2 Total}&\textbf{79186}&\textbf{8433}&\textbf{6604}\\
\hline
\rowcolor{Gainsboro}
FOOT&200&20&0\\
\rowcolor{Gainsboro}
Famu&2250&15&187\\
\rowcolor{Gainsboro}
GAMMA/AGATA&0&0&1160\\
\rowcolor{Gainsboro}
NEWCHIM/FARCOS&0&10&300\\
\rowcolor{Gainsboro}
\hline
\rowcolor{Gainsboro}
\textbf{CSN 3 Total}&\textbf{2450}&\textbf{45}&\textbf{1460}\\
\hline \hline
\rowcolor{Green}
\textbf{Grand Total}&\textbf{387496}&\textbf{33692}&\textbf{82766}\\
\rowcolor{Green}
\textbf{Installed}&\textbf{340000}&\textbf{34000}&\textbf{71000}\\
\br
\end{tabular}
\end{center}
\caption{Pledged and installed resources at INFN Tier-1 in 2018 (for the CPU power an overlap factor is applied)}
\label{T1-pledge}
\hfill
\end{table}
\subsection{Out of the mud}
The year 2018 began with the recovery procedures of the data center after the flooding of Novembrer 2017.
Despite the serious damages to the power plants (both power lines were compromised), immediately after the flooding we started the recovery procedures of both the infrastructure and the IT equipment. The first mandatory intervention was to restore, at least, one of the two power lines (with a leased UPS in the first period). This goal was achieved during December 2017.
In January, after also the chillers were restarted, we could proceed to re-open all services, including part of the farm (at the beginning only $\sim$ 50 kHS06, 1/5 of the total power capacity, were online, while 13\% was lost) and, one by one, the storage systems.
The first experiments to resume operations at CNAF were Alice, Virgo, Darkside: in fact, the storage system used by Virgo and Darkside had been easily recovered after Christmas break, while Alice is able to use computing resources relaying on remote storage. During February and March, we were able to progressively re-open the services for all other experiments. %(Fig.\ref{farm2018} shows the restart of the farm). Meanwhile, we had setup a new partition of the farm hosted at CINECA super-computing center premises (see Par.~\ref{CINECAext}).
The final damage inventory shows the loss of $\sim$ 30 kHS06, 4 PB of data and 60 tapes: on the other hand, it was possible to repair all the other systems recovering $\sim$ 20 PB of data; for the infrastructure, the second line was recovered (see \cite{FLOODCHEP} for details).
%\begin{figure}[h]
% \begin{center}
% \includegraphics[width=40pc]{t1-img/farm2018.png}\hspace{2pc}%
% \caption{\label{farm2018}Farm usage in 2018}
% \end{center}
%\end{figure}
\subsection{The long-term consequences of the flooding}
The data center was designed taking into account all possible accidents (e.g. fires, power outages ...), except at least this.
In fact, it was believed that the only threat due to water could come from a very heavy rain and, indeed, waterproof doors were installed some years ago (after a heavy rain).
The post-mortem analysis showed that the causes, beside the breaking of the tube, are to be found in the unfavorable position (2 underground levels) and in the excessive permeability of the perimeter (while the anti-flood doors worked). Therefore, an intervention has been carried out to increase the waterproofing of the data center and, moreover, work is planned for summer 2019 to strengthen the perimeter of the building and build a second water collection tank.
Even if the search for a new location to move the data center had started before the flooding (the main drive being its limited expandability not able to cope with the foreseen requirements for HL-LHC era when we should scale up to 10 MW of power for IT), the flooding gave us a second strong reason to move.
An opportunity is given by the new ECMWF center which will be hosted in Bologna, in a new Technopole area, starting from 2019. In the same area the INFN Tier-1 and the CINECA computing centers can be hosted too: funding has been guaranteed to INFN and CINECA by the Italian Government for this. The goal is to have the new data center for the INFN Tier-1 fully operational by the end of 2021.
\section{INFN Tier-1 extension at CINECA}\label{CINECAext}
As mentioned in the previous Paragraph, part of the farm is hosted at CINECA\footnote{CINECA is the Italian Supercomputing center, also located near Bologna ($\sim17$ far km from CNAF). See \url{http://www.cineca.it/}}.
Out of the 400 kHS06 CPU power (340 kHS06 pledged) of the CNAF farm, $\sim180$ are provided by servers installed in the CINECA data center.
%Each server is equipped with a 10 Gbit uplink connection to the rack switch while each of them, in turn, is connected to the aggregation router with 4x40 Gbit links.
The logical network of the farm partition at CINECA is set as an extension of INFN Tier-1 LAN: a dedicated fiber couple interconnects the aggregation router at CINECA with the core switch at the INFN Tier-1 (see Farm and Network Chapters for more details). %Fig.~\ref{cineca-t1}).
%The transmission on the fiber is managed by a couple of Infinera DCI, allowing to have a logical channel up to 1.2 Tbps (currently it is configured to transmit up to 400 Gbps).
%\begin{figure}
% % \begin{minipage}[b]{0.45\textwidth}
% \begin{center}
% \includegraphics[width=30pc]{t1-img/cineca-t1.png}
% \caption{\label{cineca-t1}Schematic view of the CINECA - INFN Tier-1 interconnection}
% \end{center}
% % \end{minipage}
%\end{figure}
These nodes, in production since March 2018 for WLCG experiments have been gradually opened to all other collaborations. %Due the low latency (the RTT is 0.48 ms vs. 0.28 ms measured on the CNAF LAN), there is no need of a disk cache on the CINECA side and the WNs directly access the storage located at CNAF; in fact, the
The efficiency of the jobs\footnote{The efficiency of a job is defined as the ratio beyween its CPU time and its wall-clock time.} is comparable to the one measured on the farm partition at CNAF.
Since this partition have been installed from the beginning with CentOS 7, legacy applications requiring a different flavour of Operating System can use it through the container technology Singularity~\cite{singularity}.
%Moreover, this partition has undergone several reconfigurations due to both the hardware and the type of workflow of the experiments. In April we had to upgrade the BIOS to overcome a bug which was preventing the full resource usage, limiting at $\sim$~78\% of the total what we were getting from the nodes. Moreover a reconfiguration of the local RAID configuration of disks is ongoing\footnote{The initial choice of using RAID-1 for local disks instead of RAID-0 has been proven to slow down the system even if safer from an operational point of view.} as well as tests to choose the best number of computing slots.
\section*{References}
\begin{thebibliography}{9}
\bibitem{FLOODCHEP} L. dell'Agnello, "Disaster recovery of the INFN Tier-1 data center: lesson learned" to be published in Proceedings of the 23rd International Conference on Computing in High Energy and Nuclear Physics - EPJ Web of Conferences
\bibitem{singularity} \url{http://singularity.lbl.gov}
\end{thebibliography}
\end{document}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment