Skip to content
Snippets Groups Projects
report-cms-feb-2019.tex 9.01 KiB
Newer Older
Fornari's avatar
Fornari committed
\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\begin{document}
\title{The CMS Experiment at the INFN CNAF Tier1}

\author{Giuseppe Bagliesi$^1$}
Fornari's avatar
Fornari committed

\address{$^1$ INFN Sezione di Pisa, Pisa, IT}
Fornari's avatar
Fornari committed

\ead{giuseppe.bagliesi@cern.ch}

\begin{abstract}
A brief description of the CMS Computing operations during LHC RunII and their recent developments is given. The CMS utilization at Tier-1 CNAF is described
\end{abstract}

\section{Introduction}
The CMS Experiment \cite{CMS-descr} at CERN collects and analyses data from the pp collisions in the LHC Collider.
The first physics Run, at centre of mass energy of 7-8 TeV, started in late March 2010, and ended in February 2013; more than 25~fb$^{-1}$ of collisions were collected during the Run. RunII, at 13 TeV, started in 2015, and finished at the end of 2018.

During the first two years of RunII, LHC has been able to largely exceed its design parameters: already in 2016 instantaneous luminosity reached $1.5\times 10^{34}\mathrm{cm^{-2}s^{-1}}$, 50\% more than the planned “high luminosity” LHC phase. The most astonishing achievement, still, is a huge improvement on the fraction of time LHC can serve physics collision, increased form ~35\% of RunI to more than 80\% in some months on 2016.
The most visible effect, computing wise, is a large increase of data to be stored, processed and analysed offline, with 2016 allowing for the collection of more than 40 fb$^{-1}$ of physics data.

In 2017 CMS recorded more than 46 fb$^{-1}$ of pp collisions, in addition to the data collected during 2016. These data were collected under considerably higher than expected pileup conditions forcing CMS to request a lumi-levelling to PU~55 for the first hours of the LHC fill; this has challenged both the computing system and CMS analysts with more complex events to process with respect to the modelling. From the computing operations side, higher pileup meant larger events and more time to process events than anticipated in the 2017 planning. As these data taking conditions affected only the second part of the year, the average 2017 pileup was in line with that used during the CMS resource planning.

2018 was another excellent year for LHC operations and luminosity delivered to the experiments. CMS recorded 64 fb$^{-1}$ of pp collisions during 2018, in addition to the 84 fb$^{-1}$ collected during 2016 and 2017. This brings the total luminosity delivered in RunII to more than 150 fb$^{-1}$ , and the total RunI + RunII dataset to more than 190 fb$^{-1}$.



\section{Run II computing operations}
During Run~II, the computing 2004 model designed for Run~I has greatly evolved. The MONARC Hierarchical division of sites in Tier0, Tier-1s and Tier-2s, is still present, but less relevant during operations. All simulation, analysis and processing workflows can now be executed at virtually any site, with a full transfer mesh allowing for point-to-point data movement, outside the rigid hierarchy.

Remote access to data, using WAN-aware protocols like XrootD and data federations, are used more and more instead of planned data movement, allowing for an easier exploitation of CPU resources.
Opportunistic computing is becoming a key component, with CMS having explored access to HPC systems, Commercial Clouds, and with the capability of running its workflows on virtually any (sizeable) resource we have access to.

In 2018 CMS deployed singularity \cite{singu} to all sites supporting the CMS VO.  Singularity is a container solution which allows CMS to select the OS on a per job basis and decouples the OS of worker nodes from that required by experiments. Sites can setup worker nodes with a Singularity supported OS and CMS will choose the appropriate OS image for each job.

CMS deployed a new version of the prompt reconstruction software on July 2018, during LHC MD2. This software is adapted to detector upgrades and data taking conditions, and the production level of alignment and calibration algorithms is reached. Data collected before this point has now been reprocessed for a fully consistent data set for analysis, in time for the Moriond 2019 conference. Production and distributed analysis activities continued at a very high level throughout 2018. The MC17 campaign, to be used for Winter and Summer 18 conferences, continued throughout the year, with decreasing utilization of resources; overall, more than 15B events were available by the Summer. The equivalent simulation campaign for 2018 data, MC18, started in October 2018 and is now almost completed.

Developments to increase CMS throughput and disk usage efficiently continue. Of particular interest is the development of the NanoAOD data tier as a new alternative for analysis users.
The NanoAOD size per event is approximately 1 kB,  30-50 times smaller than the MiniAOD data tier and relies on only simple data types rather than the hierarchical data format structure in the CMS MiniAOD (and AOD) data tier. NanoAOD samples for the 2016, 2017 and 2018 data and corresponding Monte Carlo simulation have been produced, and are being used in many analyses. NanoAOD is now automatically produced in all the central production campaigns, and fast reprocessing campaigns from MiniAOD to NanoAOD have been tested and are able to achieve more than 4B events per day using only a fraction of  CMS resources.


\section{CMS WLCG Resources and expected increase}
CMS Computing model has been used to request resources for 2018-19 RunII data taking and reprocessing, with total requests (Tier-0 + Tier-1s + Tier-2s) exceeding 2073 kHS06, 172 PB on disk, and 320 PB on tape.
However the actual pledged resources have been substantially lower than the requests due to budget restrictions from the funding agencies. To reduce the impact of this issue, CMS was able to achieve and deploy several technological advancements, including reducing the needed amount of AOD(SIM) on disk and to reduce the amount of simulated raw events on tape. In addition, some computing resource providers were able to provide more than their pledged level of resources to CMS during 2018.
Thanks to the optimizations and technological improvements described before it has been possible to tune accordingly the computing model of CMS. Year-by-year increases, which would have been large in presence of the reference computing model, have been reduced substantially.

Italy contributes to CMS computing with 13\% of the Tier-1 and Tier-2 resources. The increase of CNAF pledges for 2019 have been reduced by a factor two with respect to the original request, due to INFN budget limitations, and the remaining increase has been postponed to 2021.
The 2019 pledges are therefore 78 kHS06 of CPU, 8020 TB of disk, and 26 PB for tape.

CMS usage of CNAF is very intense and it represents one of the largest Tier-1 in CMS as number of processed hours, after the US Tier-1; the same holds for total number of processed jobs, as shown in Fig.~\ref{cms-jobs}.


\begin{figure}
\begin{center}
\includegraphics[width=0.8\textwidth,bb=0 0 900 900]{tier1-jobs-2018.pdf}
\end{center}
\caption{\label{cms-jobs}Jobs processed at CMS Tier1s during 2018}
\end{figure}



\section{The CNAF flood incident}
On November 9th 2017 a major incident happened when the CNAF computer center was flooded.
This caused an interruption of all CNAF services and the damage of many  disk arrays and servers, as well as of the tape library. About 40 damaged tapes (out of a total of 150) belonged to CMS. They contained unique copy of MC and RECO data.  Six tapes contained a 2nd custodial copy of RAW data.
A special recovery procedure was adopted by CNAF team through a specialized company and no data have been permanently lost.

The impact of this incident for CMS, although serious, was mitigated  thanks to the intrinsic redundancy of our distributed computing model. Other Tier1s increased temporary their share to compensate the CPU loss, deploying the 2018 pledges as soon as possible.
A full recovery of CMS services of CNAF was achieved by beginning of March 2018.

It is important to point out that, despite the incident affecting the first months of 2018, the integrated site readiness of CNAF in 2018 was very good and at the same level or better than the other CMS Tier1s, see Fig.~\ref{tier1-cms-sr}.

\begin{figure}
\begin{center}
\includegraphics[width=0.8\textwidth,bb=0 0 900 900]{tier1-readiness-2018.pdf}
\end{center}
\caption{\label{tier1-cms-sr}Site readiness of CMS Tier1s in 2018}
\end{figure}


\section{Conclusions}
CNAF is an important asset for the CMS Collaboration, being the second Tier1 in terms of resource utilization, pledges and availability.
The unfortunate incident of the end of 2017 has been managed professionally and efficiently by the CNAF staff, guaranteeing the fastest possible recovery with minimal data losses at the beginning of 2018.


\section*{References}
\begin{thebibliography}{9}
\bibitem{CMS-descr}CMS Collaboration, The CMS experiment at the CERN LHC, JINST 3 (2008) S08004,
doi:10.1088/1748-0221/3/08/S08004.
\bibitem{singu} http://singularity.lbl.gov/
\end{thebibliography}

\end{document}