The farming group is responsible for the management of the computing resources of the centre. This implies the deployment of installation and configuration services, monitoring facilities and to fairly distribute the resources to the experiments that have agreed to run at CNAF.
The farming group is responsible for the management of the computing resources of the centre.
This implies the deployment of installation and configuration services, monitoring facilities and the fair distribution of the resources to the experiments that have agreed to run at CNAF.
%\begin{figure}
%\centering
...
...
@@ -40,7 +41,7 @@ trying to understand how much of the computing power was damaged and how much wa
We had quite a luck on blade servers (2015 tender), while on 2016 tender most of the nodes that we thought were reusable, after some time got broken and were unrecoverable. We were able to recover working parts from the broken servers (like ram, CPUs and disks) and with those we assembled some nodes to be used as service nodes: the parts were accurately tested by a system integrator that guaranteed for us the stability and reliability of the resulting platform.
As a result of the flooding, approximately 24 kHS06 got damaged.
In spring we finally installed the new tender, composed of AMD EPYC nodes, sporting more than 42 kHS06, with 256GB of ram, 2x1TB SSDs and 10Gbit Ethernet network. This is the first time we adopt 10Gbit connection for WNs and we think from now on it will be a basic requirement: modern CPUs provide several cores, enabling us to pack more jobs in a single node, where a 1Gbit network speed may be a significant bottleneck. The same applies to HDDs vs SSDs: we think that modern computing nodes can provide 100\% of their capabilities only with SSDs disks.
In spring we finally installed the new tender, composed of AMD EPYC nodes, providing more than 42 kHS06, with 256GB of ram, 2x1TB SSDs and 10Gbit Ethernet network. This is the first time we adopt 10Gbit connection for WNs and we think from now on it will be a basic requirement: modern CPUs provide several cores, enabling us to pack more jobs in a single node, where a 1Gbit network speed may be a significant bottleneck. The same applies to HDDs vs SSDs: we think that modern computing nodes can provide 100\% of their capabilities only with SSDs disks.
General job execution trend can be seen in Figure~\ref{farm-jobs}.
\begin{figure}
...
...
@@ -74,7 +75,7 @@ To mitigate this limitation, a reconfiguration of the local RAID configuration o
done\footnote{The initial choice of using RAID-1 for local disks instead of RAID-0 proved to slow down the system even if safer from an operational point of view.} and the amount of jobs per node was slightly reduced (generally this equals the number of logical cores). It's important to notice that we did not reach this limit with the latest tender we purchased, since it comes with two enterprise class SSDs.
During 2018 we kept using also the Bari ReCaS farm extension,
with a reduced set of nodes that provided approximately 10 kHS06. See 2017 AR for details on the setup.
with a reduced set of nodes that provided approximately 10 kHS06\cite{ref:ar17farming}.
\subsection{Hardware resources}
Hardware resources for farming group are quite new, and a refresh was not foreseen during 2018. The main concern is on the two different virtualization infrastructures, that only required a warranty renewal. Since we were able to recover a few parts from the flood-damaged nodes, we were able to acquire a 2U 4 node enclosure to be used as the main resource provider for the forthcoming HTCondor instance.
...
...
@@ -91,7 +92,7 @@ Singularity enables users to have full control of their environment through cont
Year 2018 has been terrible from a security point of view.
Several critical vulnerabilities have been discovered, affecting data-center CPUs and major software stacks:
the major ones were meltdown and spectre~\cite{ref:meltdown} (see Figure~\ref{meltdown} and~\ref{meltdown2}).
the major ones were Meltdown and Spectre~\cite{ref:meltdown} (see Figure~\ref{meltdown} and~\ref{meltdown2}).
These discoveries required us to promptly intervene in order to mitigate and/or correct these vulnerabilities,
applying software updates (this mostly breaks down to updating Linux kernel and firmware) that most of the times required to reboot the whole farm.
This impacts greatly in term of resource availability, but it's mandatory in order to prevent security issues and possible sensitive data disclosures.
...
...
@@ -208,6 +209,7 @@ In particular, the Buyers Group reiterated the need for a fully functioning clou