ARFarming2018.tex

\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\usepackage{tikz}
\usepackage{hyperref}
\usepackage{eurosym}
%%%%%%%%%% Start TeXmacs macros
\newcommand{\tmem}[1]{{\em #1\/}}
\newcommand{\tmop}[1]{\ensuremath{\operatorname{#1}}}
\newcommand{\tmtextit}[1]{{\itshape{#1}}}
\newcommand{\tmtt}[1]{\texttt{#1}}
%%%%%%%%%% End TeXmacs macros

\begin{document}
\title{The INFN-Tier1: the computing farm}

\author{Andrea Chierici$^1$, S. Dal Pra$^1$, D. Michelotto$^1$}

\address{$^1$ INFN-CNAF, Bologna, IT}

\ead{andrea.chierici@cnaf.infn.it, stefano.dalpra@cnaf.infn.it, diego.michelotto@cnaf.infn.it}

%\begin{abstract}
%\end{abstract}

\section{Introduction}
The farming group is responsible for the management of the computing resources of the centre. This implies the deployment of installation and configuration services, monitoring facilities and to fairly distribute the resources to the experiments that have agreed to run at CNAF. 

%\begin{figure}
%\centering
%\includegraphics[keepaspectratio,width=10cm]{ge_arch.pdf}
%\caption{Grid Engine instance at INFN-T1}
%\label{ge_arch}
%\end{figure}


\section{Farming status update}
During 2018 the group got reorganized: Antonio Falabella left the group and Diego Michelotto took over him. This turnover was quite harmless since Diego already was aware of many of the procedures adopted in farming group as well as of the collaborative tools used internally.

\subsection{Computing}
It's well known that in November 2017 we suffered a flooding in our data center and so the largest part of 2018 was dedicated to restoring the facility, trying to understand how much of the computing power was damaged and how much was recoverable. We had quite a luck on blade servers (2015 tender) while on 2016 tender, most of the nodes that we thought were reusable, after some time got broken and were unrecoverable. We were able to recover working parts from the broken servers (like ram, CPUs and disks) and with those we assembled some nodes to be used as service nodes: the parts were accurately tested by a system integrator that guaranteed for us the stability and reliability of the resulting platform.
As a result of the flooding, approximately 24K HS06 got damaged.

In spring we finally installed the new tender, composed of AMD EPYC nodes, sporting more than 42K HS06, with 256GB of ram, 2x1TB SSDs and 10Gbit Ethernet network. This is the first time we adopt 10Gbit connection for WNs and we think from now on it will be a basic requirement: modern CPUs provide several cores, enabling us to pack more jobs in a single node, where a 1Gbit network speed may be a significant bottleneck. The same applies to HDDs vs SSDs: we think that modern computing nodes can provide 100\% of their capabilities only with SSDs disks.
General job execution trend can be seen in figure~\ref{farm-jobs}.

\begin{figure}
\centering
\includegraphics[keepaspectratio,width=15cm]{farm-jobs.png}
\caption{Farm job trend during 2018}
\label{farm-jobs}
\end{figure}

\subsubsection{CINECA extension}
Thanks to an agreement between INFN and CINECA\cite{ref:cineca}, we were able to integrate a portion (3 racks for a total of 216 servers providing $\sim$180 kHS06) of the Marconi cluster into our computing farm, reaching the total computing power of 400.000 HS06, almost doubling the power we provided last year. Each server is equipped with a 10 Gbit uplink connection to the rack switch while each of them, in turn, is connected to the aggregation router with 4x40 Gbit links.

Due to the proximity of CINECA we set up a highly reliable fiber connection between the computing centers, with a very low latency (the RTT\footnote{Round-trip time (RTT) is the duration it takes for a network request to go from a starting point to a destination and back again to the starting point.} is 0.48 ms vs. 0.28 ms measured on the CNAF LAN), and could avoid to set up a cache storage on the CINECA side: all the remote nodes access storage resources hosted at CNAF in the exact same manner as the local nodes do. This simplifies a lot the setup and increases global farm reliability (see figure~\ref{cineca} for details on setup).

\begin{figure}
    \centering
    \includegraphics[keepaspectratio,width=12cm]{cineca.png}
    \caption{INFN-T1 farm extension to CINECA}
    \label{cineca}
\end{figure}

These nodes have undergone several reconfigurations due to both the hardware and the type of workflow of the experiments. In April we had to upgrade the BIOS to overcome a bug which was preventing the full resource usage, limiting at $\sim$78\% of the total what we were getting from the nodes.
Moreover, since nodes at CINECA are setup with standard HDDs and since so many cores are available per node, we hit a bottleneck.
To mitigate this limitation, a reconfiguration of the local RAID configuration of disks has been done\footnote{The initial choice of using RAID-1 for local disks  instead of RAID-0 proved to slow down the system even if safer from an operational point of view} and the amount of jobs per node was slightly reduced (generally this  equals the number of logical cores). It's important to notice that we did not reach this limit with the latest tender we purchased, since it comes with two enterprise class SSDs.

During 2018 we kept using also the Bari ReCaS farm extension, with a reduced set of nodes that provided approx. 10k HS06. See 2017 AR for details on the setup.

\subsection{Hardware resources}
Hardware resources for farming group are quite new and a refresh was not foreseen during this year. The main concern is on the two different virtualization infrastructures, that only required a warranty renewal. Since we were able to recover a few parts from the flood-damaged nodes, we were able to acquire a 2U 4 node enclosure to be used as the main resource provider for the forthcoming HTCondor instance.

\subsection{Software updates} 
During 2018 we completed the migration from SL6 to CentOS7, on all the farming nodes. The configurations have been stored on our provisioning system: with the WNs the migration process has been rather simple, while with CEs and UIs we took extra care and proceed one at a time in order to guarantee continuity to the service. The same configurations have been used to upgrade LHCb-T2 and INFN-BO-T3, with minimal modifications. All the modules produced for our site can easily be exported to other sites, willing to perform the same update.
As already said the update involved all the services with just a small number of exceptions: CMS experiment is using PhEDEx\cite{ref:phedex}, a system that provides the data placement and the file transfer system that is incompatible with CentOS7. Since the system will be phased out in mid 2019, we agreed with the experiment to not perform any update. Same thing happened with a few legacy UIs and some services for the CDF experiment, that are involved in a LTDP project (more details in next year report).

In any case, if an experiment needs a legacy OS, like SL6, on all the Worker Nodes we provide a container solution based on singularity\cite{ref:singu} software.
Singularity enables users to have full control of their environment through containers: it can be used to package entire scientific workflows, software and libraries, and even data. This avoids the T1 users to ask farming sysadmin to install any software, since everything can be put container and run. Users are in control of the extent to which containers interacts with its host: there can be seamless integration, or little to no communication at all. 

Year 2018 has been terrible from a security point of view. Several critical vulnerabilities have been discovered, affecting data-center CPUs and major software stacks: the major ones were meltdown and spectre~\cite{ref:meltdown} (see figure~\ref{meltdown} and~\ref{meltdown2}). These discoveries required us to promptly intervene in order to mitigate and/or correct these vulnerabilities, applying software updates (this mostly breaks down to updating Linux kernel and firmware) that most of the times required to reboot the whole farm. This impacts greatly in term of resource availability, but it's mandatory in order to prevent security issues and possible sensitive data disclosures. Thanks to our internally-developed dynamic update procedure, patch application is smooth and almost automatic, avoiding farm staff to waste a lot of time.

\begin{figure}
\centering
\includegraphics[keepaspectratio,width=12cm]{meltdown.jpg}
\caption{Meltdown and Spectre comparison}
\label{meltdown}
\end{figure}
\begin{figure}
\centering
\includegraphics[keepaspectratio,width=12cm]{meltdown2.jpg}
\caption{Meltdown attack description}
\label{meltdown2}
\end{figure}

\subsection{HTCondor udpate}
INFN-T1 decided to migrate to HTCondor from LSF for several reasons. The main one is that this software has proved to be extremely scalable and ready to stand the forthcoming challenges that High Luminosity LHC will raise in our research community in the near future. Moreover  many of the other T1s involved in LHC have announced the transition to HTCondor or have already completed it, not to consider the fact that our current batch system, LSF, is no longer under warranty, since INFN decided not to renew the contract with IBM (the provider of this software now re-branded ``Spectrum LSF''), in order to save money and considering the alternative given by HTCondor.

\section{DataBase service: Highly available PostgreSQL}
In 2013 INFN-T1 switched to a custom solution the job accounting
system~\cite{DGAS} based on a PostgreSQL backend. The database was
made more robust over time, introducing redundancy, reliable hardware
and storage. This architecture was powerful enough to also host other
database schema, or even independent instances, to meet
requirements from user communities (CUORE, CUPID) for their computing
model. A MySQL based solution is also in place, to accommodate needs of
the AUGER experiment.

\subsection{Hardware setup}
A High Availability PostgreSQL instance has been deployed on two
identical SuperMicro hosts, ``dbfarm-1'' and ``dbfarm-2'' each one equipped as
follows:

\begin{itemize}
  \item Intel(R) Xeon(R) CPU E5-2603 v2 @ 1.80GHz,
  \item 32GB Ram
  \item two FiberChannel controllers
  \item a Storage Area Network volume of 2 TB
  \item two redundant power supply
\end{itemize}

The path to the SAN storage is also fully redundant, since each Fiber Channel
controller is connected to two independent SAN switches.

One node also hosts 2 HDDs, 1.8TB configured with software RAID--1, to work as service storage 
area for supplementary data-base backup and other maintenance tasks.

\subsection{Software setup}
A PostgreSQL 11.1 master has been installed on the two host. dbfarm-1
has been set up to work as master and dbfarm-2 works as a Hot standby
replica. With this configuration the master is the main database,
while the replica can be accessed in read only mode. This instance is
used to host the accounting database of the farming, the inventory of
the hardware of the T1-centre (docet) and a database used by the CUPID
experiment. The content of this database is updated directly by
authorized users of the experiment, while jobs running on our worker
nodes can access its data from the standby replica.

A second independent instance has also been installed on dbfarm-2
working as a hot standby replica of a remote Master instance managed
by the CUORE collaboration and located at INFN-LNGS. the continuous
synchronization with the master database happens through a VPN channel. 
Local read access from our Worker Nodes to this
instance can be quite intense: the standby server has been
sustaining up to 500 connections without any evident problem.

\subsection{Distributed MySQL service}
A different solution for the AUGER experiment has been put in place for several
years now, and has been recently redesigned when moving our Worker
Nodes to CentOS7.  Several jobs of the Auger experiment need
concurrent read-only access to a MySQL (actually MariaDB, with CentOS7
and later) data base. A single server instance cannot sustain the
overall load generated by the clients. For this reason we have
configured a reasonable subset of Worker Nodes (two racks) to host a
local binary copy of the AUGER data base. The ``master'' copy of this database 
is available from a dedicated User Interface and
users can update its content when they need to.

The copy on the Worker Nodes can be updated every few months, upon
request from the experiment. To do so, we must in order:
\begin{itemize}
\item drain any running job accessing the database
\item shutdown every MariaDB instance, 
\item update the binary copy using rsync
\item restart the database
\item re-enable normal auger activity
\end{itemize}

\section{Helix Nebula Science Cloud}
During the first part of 2018, farming group has been directly involved in the pilot phase of Helix Nebula Science Cloud project~\cite{ref:hnsc}, whose aim was to allow research institutes like INFN to be able to test commercial clouds against HEP use-cases, identifying strength and weak points.
The pilot phase has seen some very intense interaction between the public procurers and both commercial and public service providers. 

\subsection{Pilot Phase}
The pilot phase of the HNSciCloud PCP, is the final step in the implementation of the hybrid cloud platform proposed by the contractors that were selected. During the period January to June 2018, the technical activities of the project focused on
scalability of the platforms and on training of new users that will access the pilot at the end of this phase.
Farming members guided the contractors throughout the first part of the pilot phase,
testing the scalability of the proposed platforms, organizing the procurers’ hosted events and assessing the deliverables produced by the contractors together with the other partners of the project.

\subsection{Conclusions of the Pilot Phase}
Improvements to the platforms have been implemented during this phase even though
some R\&D activities had still to be completed, the general evaluation of the first part of the pilot phase is positive. 
In particular, the Buyers Group reiterated the need for a fully functioning cloud storage service and highlighted the commercial advantage such a transparent data service represents for the Contractors. Coupled with a flexible voucher scheme, such an offering will encourage a greater uptake within the Buyers Group and the wider public research sector. The increase in demand for GPUs, even if not originally considered critical during the design phase, has become more important and highlighted a weak point in the current offering.


\section{References}
\begin{thebibliography}{9}
\bibitem{ref:cineca} Cineca webpage: https://www.cineca.it/
\bibitem{ref:phedex} PhEDEx webpage: https://cmsweb.cern.ch/phedex/about.html
\bibitem{ref:singu} Singularity website: https://singularity.lbl.gov/
\bibitem{ref:meltdown} Meltdown attack website: https://meltdownattack.com/
\bibitem{ref:hnsc} Helix Nebula The Science Cloud website: https://www.hnscicloud.eu/
\bibitem{DGAS} Dal Pra, Stefano. ``Accounting Data Recovery. A Case Report from
  INFN-T1'' Nota interna, Commissione Calcolo e Reti dell'INFN,
  {\tt CCR-48/2014/P}  
  \bibitem{DOCET} Dal Pra, Stefano, and Alberto Crescente. ``The data operation centre tool. Architecture and population strategies'' Journal of Physics: Conference Series. Vol. 396. No. 4. IOP Publishing, 2012.
\end{thebibliography}

\end{document}