Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • faproietti/ar2018
  • chierici/ar2018
  • SDDS/ar2018
  • cnaf/annual-report/ar2018
4 results
Show changes
Showing
with 428 additions and 30 deletions
contributions/tier1/disk2018.png

28.4 KiB

contributions/tier1/pledge.png

180 KiB

contributions/tier1/tape2018.png

30.2 KiB

\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\usepackage{url}
\usepackage{color, colortbl}
\definecolor{LightCyan}{rgb}{0.88,1,1}
\definecolor{LightYellow}{rgb}{1,1,0.88}
\definecolor{Red}{rgb}{1,0,0}
\definecolor{Green}{rgb}{0,1,0}
\definecolor{MediumSpringGreen}{rgb}{0,0.98,0.6} %rgb(0,250,154)
\definecolor{Gold}{rgb}{1,0.84,0}%rgb(255,215,0)
\definecolor{Gainsboro}{rgb}{0.86,0.86,0.86}%rgb(220,220,220)
\begin{document}
\title{The INFN Tier 1}
\author{Luca dell'Agnello$^1$}
\address{$^1$ INFN-CNAF, Bologna, IT}
\ead{luca.dellagnello@cnaf.infn.it}
\section{Introduction}
CNAF hosts the Italian Tier 1 data center for WLCG: over the years, Tier 1 has become the main computing facility for INFN.
Nowadays, besides the four LHC experiments, the INFN Tier 1 provides services and resources to 30 other scientific collaborations,
including BELLE2 and several astro-particle experiments (see Table \ref{T1-pledge}).
As shown in Fig.~\ref{pledge2018}, besides LHC, the main users are the astro-particle experiments.
\begin{figure}[h]
\begin{center}
\includegraphics[keepaspectratio,width=15cm]{pledge.png}
\caption{\label{pledge2018}Relative requests of resources at INFN Tier 1}
\end{center}
\end{figure}
Despite the flooding that occurred at the end of 2017, we were able to provide the resources committed to the experiments for 2018 almost in time.
\begin{table}
\begin{center}
\begin{tabular}{l|rrr}
\br
\textbf{Experiment}&\textbf{CPU (kHS06)}&\textbf{Disk (PB-N)}&\textbf{Tape (PB)}\\
\hline
\rowcolor{MediumSpringGreen}
ALICE&52020&5185&13497\\
\rowcolor{MediumSpringGreen}
ATLAS&85410&6480&17550\\
\rowcolor{MediumSpringGreen}
CMS&72000&7200&24440\\
\rowcolor{MediumSpringGreen}
LHCB&46805&5606&11400\\
\rowcolor{MediumSpringGreen}
\hline
\textbf{LHC Total}&\textbf{256235}&\textbf{24471}&\textbf{66887}\\
\hline
\rowcolor{LightYellow}
Belle2&13000&350&0\\
\rowcolor{LightYellow}
CDF&0&0&4000\\
\rowcolor{LightYellow}
Compass&40&10&40\\
\rowcolor{LightYellow}
KLOE&0&33&3075\\
\rowcolor{LightYellow}
LCHf&6000&90&0\\
\rowcolor{LightYellow}
NA62&3000&250&200\\
\rowcolor{LightYellow}
PADME&1500&10&500\\
\rowcolor{LightYellow}
LHCb Tier2&26085&0&0\\
\rowcolor{LightYellow}
\hline
\rowcolor{LightYellow}
\textbf{CSN 1 Total}&\textbf{49625}&\textbf{743}&\textbf{7815}\\
\hline
\rowcolor{LightCyan}
AMS&15800&1990&510\\
\rowcolor{LightCyan}
ARGO&0&120&1000\\
\rowcolor{LightCyan}
Auger&2000&615&0\\
\rowcolor{LightCyan}
BOREX&2000&185&41\\
\rowcolor{LightCyan}
CTA&4000&796&120\\
\rowcolor{LightCyan}
CUORE&1900&262&0\\
\rowcolor{LightCyan}
Cupid&100&15&10\\
\rowcolor{LightCyan}
DAMPE&8000&200&100\\
\rowcolor{LightCyan}
DARKSIDE&2000&980&300\\
\rowcolor{LightCyan}
ENUBET&500&10&0\\
\rowcolor{LightCyan}
EUCLID&1000&1042&0\\
\rowcolor{LightCyan}
Fermi&500&15&40\\
\rowcolor{LightCyan}
Gerda&40&45&40\\
\rowcolor{LightCyan}
Icarus&4000&500&1500\\
\rowcolor{LightCyan}
JUNO&3000&230&0\\
\rowcolor{LightCyan}
KM3&300&250&200\\
\rowcolor{LightCyan}
LHAASO&300&60&0\\
\rowcolor{LightCyan}
LIMADOU&400&8&0\\
\rowcolor{LightCyan}
LSPE&1000&14&0\\
\rowcolor{LightCyan}
MAGIC&296&65&150\\
\rowcolor{LightCyan}
NEWS&200&60&60\\
\rowcolor{LightCyan}
Opera&200&15&15\\
\rowcolor{LightCyan}
PAMELA&650&100&150\\
\rowcolor{LightCyan}
Virgo&30000&656&1368\\
\rowcolor{LightCyan}
Xenon100&1000&200&1000\\
\rowcolor{LightCyan}
\hline
\rowcolor{LightCyan}
\textbf{CSN 2 Total}&\textbf{79186}&\textbf{8433}&\textbf{6604}\\
\hline
\rowcolor{Gainsboro}
FOOT&200&20&0\\
\rowcolor{Gainsboro}
Famu&2250&15&187\\
\rowcolor{Gainsboro}
GAMMA/AGATA&0&0&1160\\
\rowcolor{Gainsboro}
NEWCHIM/FARCOS&0&10&300\\
\rowcolor{Gainsboro}
\hline
\rowcolor{Gainsboro}
\textbf{CSN 3 Total}&\textbf{2450}&\textbf{45}&\textbf{1460}\\
\hline \hline
\rowcolor{Green}
\textbf{Grand Total}&\textbf{387496}&\textbf{33692}&\textbf{82766}\\
\rowcolor{Green}
\textbf{Installed}&\textbf{340000}&\textbf{34000}&\textbf{71000}\\
\br
\end{tabular}
\end{center}
\caption{Pledged and installed resources at INFN Tier 1 in 2018 (for the CPU power an overlap factor is applied). CSN 1, CSN 2 and CSN 3 are the National Scientific Committees of the INFN, respectively, for experiments in high energy physics with accelerators, astro-particle experiments and experiments in nuclear physics with accelerators.}
\label{T1-pledge}
\hfill
\end{table}
\subsection{Out of the mud}
The year 2018 began with the recovery procedures of the data center after the flooding of November 2017.
Despite the serious damages to the power plants (both power lines were compromised), immediately after the flooding we started the recovery procedures of both the infrastructure and the IT equipment. The first mandatory intervention was to restore, at least, one of the two power lines (with a leased UPS in the first period). This goal was achieved during December 2017.
In January, after the restart of the chillers, we could proceed to re-open all services, including part of the farm (at the beginning only $\sim$ 50 kHS06, 1/5 of the total power capacity, were online, while 13\% was lost) and, one by one, the storage systems.
The first experiments to resume operations at CNAF have been Alice, Virgo and Darkside:
in fact, the storage system used by Virgo and Darkside had been easily recovered after Christmas break, while Alice is able to use computing resources relaying on remote storage. During February and March, we were able to progressively re-open the services for all other experiments.
%(Fig.\ref{farm2018} shows the restart of the farm). Meanwhile, we had setup a new partition of the farm hosted at CINECA super-computing center premises (see Par.~\ref{CINECAext}).
The final damage inventory shows the loss of $\sim$ 30 kHS06,
1.4 PB of data and 60 tapes: on the other hand, it was possible to repair all the other systems recovering $\sim$ 20 PB of data;
with respect to the infrastructure, the second line was recovered (see \cite{FLOODCHEP} for details).
%\begin{figure}[h]
% \begin{center}
% \includegraphics[width=40pc]{t1-img/farm2018.png}\hspace{2pc}%
% \caption{\label{farm2018}Farm usage in 2018}
% \end{center}
%\end{figure}
\subsection{The long-term consequences of the flooding}
The data center was designed taking into account all possible accidents, e.g. fires, power outages... except very unlikely events
such as the breaking of one of the main water pipelines in Bologna, located in a road next to CNAF,
which is precisely what happened in November 2017.
In fact, it was believed that the only threat due to water could come from a very heavy rain and, indeed,
waterproof doors were installed some years ago, after a heavy rain.
The post-mortem analysis showed that the causes, beside the breaking of the pipe, are to be found in the unfavorable position (2 underground levels) and in the excessive permeability of the perimeter (while the anti-flood doors worked). Therefore, an intervention has been carried out to increase the waterproofing of the data center and, moreover, work is planned for summer 2019 to strengthen the perimeter of the building and build a second water collection tank.
Even if the search for a new location to move the data center had started before the flooding (the main drive being its limited expandability not able to cope with the foreseen requirements for HL-LHC era when we should scale up to 10 MW of power for IT), the flooding gave us a second strong reason to move.
An opportunity is given by the new ECMWF center which will be hosted in Bologna, in a new Technopole area, starting from 2019.
In the same area the INFN Tier 1 and the CINECA\footnote{CINECA is the Italian Supercomputing center, also located near Bologna ($\sim17$ km far from CNAF). See \url{http://www.cineca.it/}} computing centers can be hosted too: funding has been guaranteed to INFN and CINECA by the Italian Government for this. The goal is to have the new data center for the INFN Tier 1 fully operational by the end of 2021.
\section{INFN Tier 1 extension at CINECA}\label{CINECAext}
Out of the 400 kHS06 CPU power (340 kHS06 pledged) of the CNAF farm, $\sim180$ are provided by servers installed in the CINECA data center.
%Each server is equipped with a 10 Gbit uplink connection to the rack switch while each of them, in turn, is connected to the aggregation router with 4x40 Gbit links.
The logical network of the farm partition at CINECA is set as an extension of INFN Tier 1 LAN: a dedicated fiber couple interconnects the aggregation router at CINECA with the core switch at the INFN Tier 1 (see Farm and Network Chapters for more details). %Fig.~\ref{cineca-t1}).
%The transmission on the fiber is managed by a couple of Infinera DCI, allowing to have a logical channel up to 1.2 Tbps (currently it is configured to transmit up to 400 Gbps).
%\begin{figure}
% % \begin{minipage}[b]{0.45\textwidth}
% \begin{center}
% \includegraphics[width=30pc]{t1-img/cineca-t1.png}
% \caption{\label{cineca-t1}Schematic view of the CINECA - INFN Tier-1 interconnection}
% \end{center}
% % \end{minipage}
%\end{figure}
These nodes, in production since March 2018 for WLCG experiments have been gradually opened to all other collaborations. %Due the low latency (the RTT is 0.48 ms vs. 0.28 ms measured on the CNAF LAN), there is no need of a disk cache on the CINECA side and the WNs directly access the storage located at CNAF; in fact, the
The efficiency of the jobs\footnote{The efficiency of a job is defined as the ratio beyween its CPU time and its wall-clock time.} is comparable to the one measured on the farm partition at CNAF.
Since this partition have been installed from the beginning with CentOS 7, legacy applications requiring a different flavour of Operating System can use it through the container technology Singularity~\cite{singularity}.
%Moreover, this partition has undergone several reconfigurations due to both the hardware and the type of workflow of the experiments. In April we had to upgrade the BIOS to overcome a bug which was preventing the full resource usage, limiting at $\sim$~78\% of the total what we were getting from the nodes. Moreover a reconfiguration of the local RAID configuration of disks is ongoing\footnote{The initial choice of using RAID-1 for local disks instead of RAID-0 has been proven to slow down the system even if safer from an operational point of view.} as well as tests to choose the best number of computing slots.
\section*{References}
\begin{thebibliography}{9}
\bibitem{FLOODCHEP} L. dell'Agnello, "Disaster recovery of the INFN Tier 1 data center: lesson learned" to be published in Proceedings of the 23rd International Conference on Computing in High Energy and Nuclear Physics - EPJ Web of Conferences
\bibitem{singularity} \url{http://singularity.lbl.gov}
\end{thebibliography}
\end{document}
File added
......@@ -2,34 +2,34 @@
\usepackage{graphicx}
\begin{document}
\title{User and Operational Support at CNAF}
\author{D. Cesini, E. Corni, F. Fornari, L. Morganti, C. Pellegrino, M. V. P. Soares, M. Tenti, L. Dell'Agnello}
\address{INFN-CNAF, Bologna, IT}
\author{D. Cesini$^1$, E. Corni$^1$, F. Fornari$^1$, L. Morganti$^1$, C. Pellegrino$^1$, M. V. P. Soares$^1$, M. Tenti$^1$, L. Dell'Agnello$^1$}
\address{$^1$ INFN-CNAF, Bologna, IT}
\ead{user-support@lists.cnaf.infn.it}
\begin{abstract}
Many different research groups, typically organized in Virtual Organizations (VOs),
exploit the Tier-1 Data center facilities for computing and/or data storage and management. Moreover, CNAF hosts two small HPC farms and a Cloud infrastructure. The User Support unit provides to the users of all CNAF facilities with a direct operational support, and promotes common technologies and best-practices to access the ICT resources in order to facilitate the usage of the center and maximize its efficiency.
exploit the Tier 1 Data center facilities for computing and/or data storage and management. Moreover, CNAF hosts two small HPC farms and a Cloud infrastructure. The User Support unit provides to the users of all CNAF facilities with a direct operational support, and promotes common technologies and best-practices to access the ICT resources in order to facilitate the usage of the center and maximize its efficiency.
\end{abstract}
\section{Current status}
Born in April 2012, the User Support team in 2018 was composed by one coordinator and up to five fellows with post-doctoral education or equivalent work experience in scientific research or computing.
The main activities of the team include:
\begin{itemize}
\item providing a prompt feedback to VO-specific issues via ticketing systems or official mail channels;
\item forwarding to the appropriate Tier-1 units those requests which cannot be autonomously satisfied, and taking care of answers and fixes, e.g. via the tracker JIRA, until a solution is delivered to the experiments;
\item forwarding to the appropriate Tier 1 units those requests which cannot be autonomously satisfied, and taking care of answers and fixes, e.g. via the tracker JIRA, until a solution is delivered to the experiments;
\item supporting the experiments in the definition and debugging of computing models in distributed and Cloud environments;
\item helping the supported experiments by developing code, monitoring frameworks and writing guides and documentation for users (see e.g. https://www.cnaf.infn.it/en/users-faqs/);
\item solving issues on experiment software installation, access problems, new accounts creation and any other daily usage problems;
\item porting applications to new parallel architectures (e.g. GPUs and HPC farms);
\item providing the Tier-1 Run Coordinator, who represents CNAF at the Daily WLCG calls, and reports about resource usage and problems at the monthly meeting of the Tier-1 management body (Comitato di Gestione del Tier-1).
\item providing the Tier 1 Run Coordinator, who represents CNAF at the Daily WLCG calls, and reports about resource usage and problems at the monthly meeting of the Tier 1 management body (Comitato di Gestione del Tier 1).
\end{itemize}
People belonging to the User Support team represent INFN Tier-1 inside the VOs.
People belonging to the User Support team represent INFN Tier 1 inside the VOs.
In some cases, they are directly integrated in the supported experiments. Moreover, they can play the role of a member of any VO for debugging purposes.
The User Support staff is also involved in different CNAF internal projects, notably the Computing on SoC Architectures (COSA) project (www.cosa-project.it) dedicated to the technology tracking and benchmarking of the modern low-power architectures for computing applications.
\section{Supported experiments}
The LHC experiments represent the main users of the data center, handling more than 80\% of the total computing and storage resources funded at CNAF. Besides the four LHC experiments (ALICE, ATLAS, CMS, LHCb) for which CNAF acts as Tier-1 site, the data center also supports an ever increasing number of experiments from the Astrophysics, Astroparticle physics and High Energy Physics domains, and specifically Agata, AMS-02, Argo-YBJ, Auger, Belle II, Borexino, CDF, Compass, COSMO-WNEXT CTA, Cuore, Cupid, Dampe, DarkSide-50, Enubet, Famu, Fazia, Fermi-LAT, Gerda, Icarus, LHAASO, LHCf, Limadou, Juno, Kloe, KM3Net, Magic, NA62, Newchim, NEWS, NTOP, Opera, Padme, Pamela, Panda, Virgo, and XENON.
The LHC experiments represent the main users of the data center, handling more than 80\% of the total computing and storage resources funded at CNAF. Besides the four LHC experiments (ALICE, ATLAS, CMS, LHCb) for which CNAF acts as Tier 1 site, the data center also supports an ever increasing number of experiments from the Astrophysics, Astroparticle physics and High Energy Physics domains, and specifically Agata, AMS-02, Auger, Belle II, Borexino, CDF, Compass, COSMO-WNEXT CTA, Cuore, Cupid, Dampe, DarkSide-50, Enubet, Famu, Fazia, Fermi-LAT, Gerda, Icarus, LHAASO, LHCf, Limadou, Juno, Kloe, KM3Net, Magic, NA62, Newchim, NEWS, NTOP, Opera, Padme, Pamela, Panda, Virgo, and XENON.
Clearly, a bigger effort from the User Support team is needed to answer to the varied and diverse needs from these no-LHC experiments and to encourage them to adopt more modern technologies, e.g. FTS, Dirac, token-based authorization.
\begin{figure}[ht]
......@@ -60,12 +60,13 @@ The following figures show resources pledged and used by the supported experimen
Unfortunately, the accounting data for storage, both disk and tape statistics, are available only after summer 2018, given the restoration of the complex system of sensors for accounting after the 2017 flooding had a lower priority with respect to activities needed for a complete of the storage resources involved in the flood.
\section{Support to HPC and cloud-based experiment}
Apart from Tier-1 facilities, CNAF hosts two small HPC farms and a cloud infrastructure. The first HPC cluster, in production since 2015, is composed of 27 nodes, some of them also equipped with one or more GPUs (NVIDIA Tesla K20, K40 and K1). All nodes are infiniband interconnected and equipped with 2 Intel CPUs, 8 physical cores each, HyperThread enabled. The cluster is accessible via the LSF batch system. It is open to various INFN communities, but the main users are theoretical physicist dealing with plasma laser acceleration simulations. The cluster serves as testing infrastructure to prepare the high resolution runs submitted to supercomputers.
Apart from Tier 1 facilities, CNAF hosts two small HPC farms and a cloud infrastructure. The first HPC cluster, in production since 2015, is composed of 27 nodes, some of them also equipped with one or more GPUs (NVIDIA Tesla K20, K40 and K1). All nodes are infiniband interconnected and equipped with 2 Intel CPUs, 8 physical cores each, HyperThread enabled. The cluster is accessible via the LSF batch system. It is open to various INFN communities, but the main users are theoretical physicists dealing with plasma laser acceleration simulations. The cluster is used as a testing infrastructure to prepare the high resolution runs to be submitted afterwards to supercomputers.
A second HPC cluster entered into production in 2017 to serve the CERN accelerators R/D groups. The cluster consists of 12 nodes OmniPath interconnected. Can be access through batch queues managed by the IBM LSF system.
A second HPC cluster entered into production in 2017 to serve the CERN accelerators R/D groups. The cluster consists of 12 nodes OmniPath interconnected. It can be access through batch queues managed by the IBM LSF system.
The support is provided on a daily base for what concerns software installation, access problems, new accounts creation and any other usage problems.
The User Support team manages an OpenStack-based tenant hosted within the Cloud@CNAF. This tenant, provided with 300 vCPUs, is mostly devoted to support peculiar use cases which require unusual software configurations and only for a limited amount of time. The most important of these use cases is the FAZIA experiment, for which 256 vCPUs were provided, distributed over 16 worker nodes with 8GB of RAM each, where the Debian 8.4 operating system has been installed and configured with LDAP+Kerberos for user authentication and authorization, and NFS 4 for network storage sharing. Recently, other experiments started accessing the Cloud infrastructure: AMS, EEE, FAZIA, Icarus and NTOF.
The User Support team manages an OpenStack-based tenant hosted within the Cloud@CNAF. This tenant, provided with 300 vCPUs, is mostly devoted to support peculiar use cases which require unusual software configurations and only for a limited amount of time. The most important of these use cases is the FAZIA experiment, for which 256 vCPUs were provided, distributed over 16 worker nodes with 8GB of RAM each, where the Debian 8.4 operating system has been installed and configured with LDAP and Kerberos for user authentication and authorization, and NFS 4 for network storage sharing.
Recently, other experiments started accessing the Cloud infrastructure: AMS, EEE, Icarus and NTOF.
\end{document}
......
......@@ -5,18 +5,18 @@
%\author{P. Astone$^1$, F. Badaracco$^{2,3}$, S. Bagnasco$^4$, S. Caudill$^5$, F. Carbognani$^6$, A. Cirone$^{7,8}$, G. Fronz\'e$^{4}$, J. Harms$^{2,3}$, I. LaRosa$^1$, C. Lazzaro$^9$, P. Leaci$^1$, S. Lusso$^4$, C. Palomba$^1$, R. DePietri$^{11,12}$, M. Punturo$^{10}$, L. Rei$^8$, L. Salconi$^6$, S. Vallero$^{4}$, on behalf of the Virgo collaboration}
\author{P. Astone$^1$, F. Badaracco$^{2,3}$, S. Bagnasco$^4$, S. Caudill$^5$, F. Carbognani$^6$, A. Cirone$^{7,8}$, M. Drago$^{2,3}$, G. Fronz\'e$^{4}$, J. Harms$^{2,3}$, I. LaRosa$^1$, C. Lazzaro$^9$, P. Leaci$^1$, S. Lusso$^4$, C. Palomba$^1$, R. DePietri$^{11,12}$, M. Punturo$^{10}$, L. Rei$^8$, L. Salconi$^6$, S. Vallero$^{4}$, on behalf of the Virgo collaboration}
\address{$^1$ INFN, Roma, IT}
\address{$^2$ Gran Sasso Science Institute (GSSI), IT}
\address{$^3$ INFN, Laboratori Nazionali del Gran Sasso, IT}
\address{$^4$ INFN, Torino, IT}
\address{$^5$ Nikhef, Science Park, NL}
\address{$^6$ EGO-European Gravitational Observatory, Cascina, Pisa, IT}
\address{$^7$ Universit\`a degli Studi di Genova, IT}
\address{$^8$ INFN, Genova, IT}
\address{$^9$ INFN, Padova, IT}
\address{$^{10}$ INFN, Perugia, IT}
\address{$^{11}$ Universit\`a degli Studi di Parma, IT}
\address{$^{12}$ INFN, Gruppo Collegato Parma, IT}
\address{$^1$ INFN Sezione di Roma, Roma, IT}
\address{$^2$ Gran Sasso Science Institute (GSSI), L'Aquila, IT}
\address{$^3$ INFN Laboratori Nazionali del Gran Sasso, L'Aquila, IT}
\address{$^4$ INFN Sezione di Torino, Torino, IT}
\address{$^5$ Nikhef, Amsterdam, NL}
\address{$^6$ EGO-European Gravitational Observatory, Cascina (PI), IT}
\address{$^7$ Universit\`a degli Studi di Genova, Genova, IT}
\address{$^8$ INFN Sezione di Genova, Genova, IT}
\address{$^9$ INFN Sezione di Padova, Padova, IT}
\address{$^{10}$ INFN Sezione di Perugia, Perugia, IT}
\address{$^{11}$ Universit\`a degli Studi di Parma, Parma, IT}
\address{$^{12}$ INFN Gruppo Collegato Parma, Parma, IT}
%\address{Production Editor, \jpcs, \iopp, Dirac House, Temple Back, Bristol BS1~6BE, UK}
......@@ -32,7 +32,7 @@ The amount of data processed during the last few years has emphasized the fact t
\section{Advanced Virgo computing model}
\subsection{Data production and data transfer}
The Advanced Virgo data acquisition system is writing about 35MB/s of data (so-called ``bulk data'') during O3. CNAF and CC-IN2P3 are the Virgo Tier-0: during the science runs, bulk data is stored in a circular buffer located at the Virgo site, and simultaneously transferred to the remote computing centres where they are archived in tape libraries. The transfer is realized through an ad-hoc procedure based on GridFTP (at CNAF) and iRods (at CC-IN2P3). Other data fluxes reach CNAF during science runs:
The Advanced Virgo data acquisition system is writing about 35MB/s of data (so-called ``bulk data'') during O3. CNAF and CC-IN2P3 are the Virgo Tier 0: during the science runs, bulk data is stored in a circular buffer located at the Virgo site, and simultaneously transferred to the remote computing centers where they are archived in tape libraries. The transfer is realized through an ad-hoc procedure based on GridFTP (at CNAF) and iRods (at CC-IN2P3). Other data fluxes reach CNAF during science runs:
\begin{itemize}
\item trend data (few GB/day), periodically transferred using the system described above;
......@@ -42,23 +42,23 @@ The Advanced Virgo data acquisition system is writing about 35MB/s of data (so-c
\subsection{Data Analysis at CNAF}
%The analysis of the LIGO and Virgo data was made jointly by the two collaborations; the analysis pipelines are distributed among the worldwide network of computing facilities offering computing resources to the GW experiments. CNAF was mainly used for CW analysis, looking for continuous gravitational wave signals, developed by INFN–Roma people (see hereafter more details). But at CNAF is also running part of the pyCBC pipeline, submitted via OSG, looking for compact binaries signals. pyCBC has a crucial role in the detection of the coalescence of BBH and BNS. CNAF contributed to the computation performed through pyCBC for the analysis of the events GW170814, the first BBH coalescence detected also by Virgo, and GW170817, the BNS coalescence. During the last month a new extension of CVMFS, \emph{big cvmfs} was mounted at cnaf to support another OSG pipeline, \emph{BayesWave}. The big cvmfs is able to export, in a posix fashion, big file of data from nearby cache in Amsterdam instead of accessing data directly from Nebraska. BayesWave is a Bayesian algorithm designed to robustly distinguish gravitational wave signals from noise and instrumental glitches without relying on any prior assumptions of waveform morphology. In the last year coherent WaveBurst \emph{cwb} was ported to cnaf and made available to run. cwb is a pipeline based on coherent algorithm for detection and reconstruction of modelled and unmodelled GW bursts. A new newtonian noise cancellation algoritmh, developed by the group of Gran Sasso Science Institute (\emph{GSSI}) was made available very recently. The increased number of LVC pipelines running at cnaf has led to saturate advance virgo pledge at cnaf, cnaf promptly rensponded to advance virgo needed enlargin our quota and giving experimental access to gpu.
LIGO-Virgo data analysis is organized jointly, meaning that the analysis pipelines are made available to the computing facilities related to the LVC network, ready to be distributed to each GW detector. CNAF has been mainly used for Continuous Wave(\emph{CW}) analysis, led by the Roma INFN group, and for the Compact Binary Coalescence python-based analysis (\emph{pyCBC}), submitted via OSG. In particular CNAF computationally contributed to GW170814 and GW170817 events, respectively the first BBH coalescence detected by Virgo and the first BNS merger ever observed. During the last month a new extension of CVMFS, so-called ``big cvmfs'', was mounted at CNAF to support another OSG-based pipeline, Bayes Wave. The former is able to make available, in a POSIX-like fashion, big data files from a cache in Amsterdam, instead of accessing the data directly from Nebraska. The latter is a Bayesian algorithm, designed to robustly distinguish GW signals from noise and instrumental glitches, without relying on any prior assumptions on the waveform shape. During the last year, coherent WaveBurst(\emph{cWB}), an algorithm dedicated to the detection and reconstruction of GW Bursts, was also ported to CNAF. Furthermore, new Newtonian Noise cancellation algorithms, which are currently being developed by the GSSI group, were made recently available. The increasing number of LVC pipelines running at CNAF has led to resource saturation, and consequently to a demand for enlarged computing power, together with access to GPUs.
LIGO-Virgo data analysis is organized jointly, meaning that the analysis pipelines are made available to the computing facilities related to the LVC network, ready to be distributed to each GW detector. CNAF has been mainly used for Continuous Wave (\emph{CW}) analysis, led by the Roma INFN group, and for the Compact Binary Coalescence python-based analysis (\emph{pyCBC}), submitted via OSG. In particular CNAF computationally contributed to GW170814 and GW170817 events, respectively the first BBH coalescence detected by Virgo and the first BNS merger ever observed. During the last month a new extension of CVMFS, so-called ``big cvmfs'', was mounted at CNAF to support another OSG-based pipeline, Bayes Wave. The former is able to make available, in a POSIX-like fashion, big data files from a cache in Amsterdam, instead of accessing the data directly from Nebraska. The latter is a Bayesian algorithm, designed to robustly distinguish GW signals from noise and instrumental glitches, without relying on any prior assumptions on the waveform shape. During the last year, coherent WaveBurst (\emph{cWB}), an algorithm dedicated to the detection and reconstruction of GW Bursts, was also ported to CNAF. Furthermore, new Newtonian Noise cancellation algorithms, which are currently being developed by the GSSI group, were made recently available. The increasing number of LVC pipelines running at CNAF has led to resource saturation, and consequently to a demand for enlarged computing power, together with access to GPUs.
\subsubsection{CW pipeline}
CNAF has been in 2018 the main computing center for Virgo all-sky continuous wave (CW) searches. The search for this kind of signals, emitted by spinning neutron stars, covers a large portion of the source parameter space and consists of several steps organized in a hierarchical analysis pipeline. CNAF has been mainly used for the ``incoherent'' stage, based of a particular implementation of the Hough transform, which is the heaviest part of the analysis from a computational point of view. The code implementing the Hough transform has been written in such a way that the exploration of the parameter space can be split in several independent jobs, each covering a range of signal frequencies and a portion of the sky. This is an embarrassingly parallel problem, very well suited to be run in a distributed computing environment. The analysis jobs have been run using the EGI UMD grid middleware, with input and output files stored in a StoRM-based Storage Element at CNAF. Candidate post-processing, consisting of clusterisation, coincidences and ranking, and parts of the candidate follow-up analysis have been also carried on at CNAF. Typical Hough transform jobs needs about 4GB of memory (with a fraction requiring more, up to 8GB). Past year most of the resources have been used to analyze Advanced LIGO O2 data. Overall, in 2018 more than 10M CPU hours have been used at CNAF for CW searches, by running O($10^5$) jobs, with duration from a few hours to ~3 days.
CNAF has been in 2018 the main computing center for Virgo all-sky continuous wave (CW) searches. The search for this kind of signals, emitted by spinning neutron stars, covers a large portion of the source parameter space and consists of several steps organized in a hierarchical analysis pipeline. CNAF has been mainly used for the ``incoherent'' stage, based of a particular implementation of the Hough transform, which is the heaviest part of the analysis from a computational point of view. The code implementing the Hough transform has been written in such a way that the exploration of the parameter space can be split in several independent jobs, each covering a range of signal frequencies and a portion of the sky. This is an embarrassingly parallel problem, very well suited to be run in a distributed computing environment. The analysis jobs have been run using the EGI UMD grid middleware, with input and output files stored in a StoRM-based Storage Element at CNAF. Candidate post-processing, consisting of clusterisation, coincidences and ranking, and parts of the candidate follow-up analysis have been also carried on at CNAF. A typical Hough transform job needs about 4GB of memory (with a fraction requiring more, up to 8GB). Past year most of the resources have been used to analyze Advanced LIGO O2 data. Overall, in 2018 more than 10M CPU hours have been used at CNAF for CW searches, by running O($10^5$) jobs, with duration from a few hours to ~3 days.
\subsubsection{cWB pipeline}
Starting in 2019, the coherent WaveBurst based pipelines have been ported and adapted to run at CNAF to reproduce the cWB environment setup on the worker nodes, without the constraint to read the user home account during running. It is planned to run at CNAF all Virgo offline long duration all-sky searches on the data that will be collected during the Observational Run 3 (03) that started April 1st, 2019. cWB is a data-analysis tool to search for a broad range of gravitational-wave (GW) transients. The pipeline identifies coincident events in the GW data from earth-based interferometric detectors and reconstructs the gravitational wave signal by using a constrained maximum likelihood approach. The algorithm performs a time-frequency analysis of the data, using wavelet representation, and identifies the events by clustering time-frequency pixels with significant excess coherent power. The likelihood statistics is built as a coherent sum over the responses of different detectors and estimates the total signal to noise ratio of the GW signal in the network. The pipeline splits the total analysis time into sub-periods to be analyzed in parallel jobs, using HTCondor tools and it is expected to use a consistent amount of CPU hours during 2019.
Starting in 2019, the coherent WaveBurst based pipelines have been ported and adapted to run at CNAF to reproduce the cWB environment setup on the worker nodes, without the constraint to read the user home account during running. It is planned to run at CNAF all Virgo offline long duration all-sky searches on the data that will be collected during the Observational Run 3 (03) that started April 1, 2019. cWB is a data-analysis tool to search for a broad range of gravitational-wave (GW) transients. The pipeline identifies coincident events in the GW data from earth-based interferometric detectors and reconstructs the gravitational wave signal by using a constrained maximum likelihood approach. The algorithm performs a time-frequency analysis of the data, using wavelet representation, and identifies the events by clustering time-frequency pixels with significant excess coherent power. The likelihood statistics is built as a coherent sum over the responses of different detectors and estimates the total signal to noise ratio of the GW signal in the network. The pipeline splits the total analysis time into sub-periods to be analyzed in parallel jobs, using HTCondor tools and it is expected to use a consistent amount of CPU hours during 2019.
\subsubsection{Newtonian noise pipeline}
The cancellation of gravitational noise from seismic fields will be a major challenge both from theoretical and computational point of view, since the involved simulations are very demanding. This activity requires the accurate positioning of a large number of seismometers. A cluster at CNAF was used to run position optimisations of the seismic arrays used for cancellation and to determine the cancellation performance as a function of the number of sensors and its robustness with respect to sensor-positioning accuracy.
\subsection{outlook}
The first detection of gravitational waves (GW) and the birth of multi-messenger astrophysics have opened a new field of scientific research. With the possibility to detect GW from various kind of sources we can probe new physical phenomena in regions of the Universe we couldn't explore before, with new perspectives on our knowledge about how it works.
Indeed, so far only signals from the coalescence of compact objects have been detected, while one of the most interesting and promising class of continuous GW signals, coming from asymmetrical rotating neutron stars, is still missing. Wide searches of this kind of signals require a huge amount of computational power due to the Doppler effect of the Earth motion, which disrupts the incoming signal dramatically increases the parameters space. This means that it is necessary to develop complex algorithms to reduce the computational power needed, at the price of significantly reducing the sensitivity of the search.
\subsection{Outlook}
The first detection of gravitational waves (GW) and the birth of multi-messenger astrophysics have opened a new field of scientific research. With the possibility to detect GW from various kinds of sources we can probe new physical phenomena in regions of the Universe we couldn't explore before, with new perspectives on our knowledge about how it works.
Indeed, so far only signals from the coalescence of compact objects have been detected, while one of the most interesting and promising class of continuous GW signals, coming from asymmetrical rotating neutron stars, is still missing. Wide searches of this kind of signals require a huge amount of computational power due to the Doppler effect of the Earth motion, which disrupts the incoming signal and dramatically increases the parameters space. This means that it is necessary to develop complex algorithms to reduce the computational power needed, at the price of significantly reducing the sensitivity of the search.
The development of new algorithms, which use the high efficiency and computational power of modern GPUs, showed that the new codes on a single GPU can run with a factor of ten speed-up with respect to the older ones on a ten times more expensive multi-core CPU.
For the CW case, using real data from the 9 months long run of the LIGO detectors we have estimated that on a cluster of about 200 GPUs a complete search can be done in about a couple of months, to be confronted with the several months required by the older code on a 2000 CPUs cluster.\\ A GPU cluster would be also extremely useful to test and train Machine Learning algorithms, which in the recent years were shown to be able to face very complex analyses with high efficiency and speed.\\
For the CW case, using real data from the 9 months long run of the LIGO detectors we have estimated that on a cluster of about 200 GPUs a complete search can be done in about a couple of months, to be compared with the several months required by the older code on a 2000 CPUs cluster.\\ A GPU cluster would be also extremely useful to test and train Machine Learning algorithms, which in the recent years were shown to be able to face very complex analyses with high efficiency and speed.\\
Advanced Virgo and Advanced LIGO are also exploring different technologies to face the new challenges of GW physics. The growing number of computing centers involved in GW research forces us to relax our idea on computing, searching a way to uniformly run different pipelines in complex and heterogeneous infrastructures. For example, the de-supporting of GridFTP pushes towards the use of Rucio, a well supported and flexible tool for data-transfer and management, while the de-supporting of the Cream-CE suggests a redesign of the job submission strategy, possibly under the control of an overall management system like DIRAC. \\ CNAF staff is intensively supporting Virgo members in all this these tests.
......
\documentclass[a4paper,12pt]{jpconf}
\usepackage[american]{babel}
\usepackage{geometry}
%\usepackage{fancyhdr}
\usepackage{graphicx}
\geometry{a4paper,top=4.0cm,left=2.5cm,right=2.5cm,bottom=2.7cm}
%\usepackage[mmm]{fncychap}
%\fancyhf{} % azzeriamo testatine e piedino
%\fancyhead[L]{\thepage}
%\renewcommand{\sectionmark}[1]{\markleft{\thesection.\ #1}}
%\fancyhead[R]{\bfseries\leftmark}
%\rhead{XENON computing activities}
\begin{document}
\title{XENON computing model}
%\pagestyle{fancy}
\author{Marco Selvi$^1$}
\address{$^1$ INFN Sezione di Bologna, Bologna, IT}
\ead{marco.selvi@bo.infn.it}
\begin{abstract}
The XENON project is dedicated to the direct search of dark matter at LNGS.
XENON1T was the largest double-phase TPC ever built and operated so far, with 2 t of active xenon, decommissioned in December 2018. It successfully set the best world-wide limit to the interaction cross-section of WIMPs with nucleons. In the context of rare event search detectors, the amount of data (in the form of raw waveform) was significant: order of 1 PB/year, including both Science and Calibration runs. The next phase of the experiment, XENONnT, is under construction at LNGS, with a 3 times larger TPC and correspondingly increased data rate. Its commissioning is foreseen by the end of 2019.
We describe the computing model of the XENON project, with details of the data transfer and management, the massive raw data processing, and the production of Monte Carlo simulation.
All these topics are addressed using in the most efficient way the computing resources spread mainly in the US and EU, thanks to the OSG and EGI facilities, including those available at CNAF.
\end{abstract}
\section{The XENON project}
\thispagestyle{empty}
The matter composition of the universe has been a debate topic
among scientists for centuries. In the last couple of decades a series
of astronomical and astrophysical measurements have corroborated
the hypothesis that ordinary matter e.g. electrons, quarks,
neutrinos, etc. represents only 15\% of the total matter in the universe.
The remaining 85\% is thought to be made of a
new, yet-undiscovered exotic species of elementary particles called
dark matter. These indirect evidences of its existence
triggered a world-wide effort to try observe its interaction with
ordinary matter in extremely sensitive detectors, but its nature is
still a mystery.
The XENON experimental program \cite{225, mc, instr-1T} is searching
for weakly interacting massive particles (WIMPs), hypothetical
particles that, if existing, could account for dark matter and
that might interact with ordinary matter through nuclear recoil.
XENON1T is the third generation of the experimental
program; it completed the data taking at the end of 2018, setting the best world-wide limit to the interaction cross-section of WIMPs with nucleons.
The experiment employs a dual-phase (liquid-gas) xenon
time projection chamber (TPC) featuring as target for WIMPs two
tonnes of ultrapure liquid xenon. The detector is designed
in such a way to be sensitive to rare nuclear recoils of xenon
nuclei possibly induced by WIMPs scattering within the detector.
The TPC is surrounded by a water-based muon veto (MV). Each
sub-detector is read out by its own data acquisition system (DAQ).
The detector is located underground at the INFN Laboratori Nazionali
del Gran Sasso in Italy to shield the experiment from cosmic rays.
XENON1T is an order of magnitude larger than any of its predecessor
experiments. This upscaling in detector size produced a
proportional increase in the data rate and computing needs of
the collaboration. The size of the data set required the collaboration
to transition from a centralized computing model, i.e. the entire
dataset is stored on a local facility at various institutions, to having
to distribute the data across collaboration resources. Similarly,
the computing requirements called for incorporating distributed
resources, such as the Open Science Grid (OSG) \cite{osg} and the European
Grid Infrastructure (EGI) \cite{egi}, for main computing tasks,
e.g. initial data processing and Monte Carlo production.
\section{XENON1T}
For what concern the data flow, the XENON1T experiment uses a DAQ machine hosted in the XENON1T service
building underground to acquire data. The DAQ rate in DM mode is ~1.3 TB/day, while in calibration mode it can be significantly larger: up to
$\sim$13 TB/day.
A significant challenge for the collaboration has been that there is
no single institution that has the capacity to store the entire data set.
This requires the data to either be stored in a cloud environment
or be distributed across various collaboration institutions. Storing
the data in a cloud environment is prohibitively expensive at this
point. The data set size and the network traffic charges would
consume the entire computing budget several times over.
The only feasible option was to distribute the data across several
computing facilities associated with collaboration institutions.
The raw data are copied into {\it Rucio}, a data handling system. There are several Rucio endpoints or Rucio
storage elements (RSE) around the world, including LNGS, NIKHEF, Lyon and Chicago. The raw data are replicated in at
least two positions and there are two mirrored tape backups, at CNAF and in Stockholm, with 5.6 PB in total. %Help
When the data have to be processed, they are first copied onto Chicago storage then they are processed using the OSG. The processed data are
then copied back to Chicago and become available for the analysis.
In addition, for each user there is a home space of 100 GB available on a disk of 10 TB. A dedicated server will take
care of the data transfer to/from remote facilities. A high memory 32 cores machine is used to host several virtual
machines, each one running a dedicated service: code (data processing and Monte Carlo) and documents repository on
SVN/GIT, the run database, the on-line monitoring web interface, the XENON wiki and GRID UI.
In fig. \ref{fig:xenonCM} we show a sketch of the XENON computing model and data management scheme.
\begin{figure}[t]
\begin{center}
\includegraphics[width=15cm]{xenon-computing-model.pdf}
\end{center}
\caption{Overview of the XENON1T Job and Data Management Scheme.}
\label{fig:xenonCM}
\end{figure}
The resources at CNAF (CPU and Disk) are used so far mainly for the Monte Carlo simulation of the
detector (GEANT4 model of the detector and waveform generator), and for the €œreal-data€ storage and processing. %Currently we used about XX TB of the XX TB available for 2018. %Help
%For this purpose,
There were some improvements performed recently by the Computing Working group of the experiment. The CNAF Disk at the beginning was not integrated into the Rucio framework because it was not large enough to justify the amount of work needed for the integration (it was 60 TB up to 2016). For this reason we required for 2018 an additional amount of 90 TB, to reach a total 200 TB which is considered significant by the collaboration to consider a full integration of the Disk space.\\
The second improvement has been to perform the data processing on both the US and EU GRID (previously it was done in the US only). Some software tools have been successfully developed and tested during 2017, and they are used for a fully distributed massive data processing. To fulfil this goal, we required 300 HS06 additional CPUs, for a total of 1000 HS06, equivalent to the resources available on the US OSG.\\
The request of Tapes (1000 TB) in 2018 was done to fulfil the requirement by INFN to have a copy of all the XENON1T data in Italy, as discussed inside the INFN Astroparticle Committee. A dedicate automatic data transfer to tapes has been developed by CNAF.
The computing model described in this report allowed for a fast and effective processing and analysis of the first XENON1T data in 2017, and of the final ones in 2018, which led to the best limit in the search of WIMPs so far \cite{sr0, sr1}.
\section{XENONnT}
The planning and initial implementation of the data and job management
for the next generation experiment, XENONnT, has already
begun. The experiment is currenlty under construction at LNGS, and it's scheduled to start taking data by the end of 2019. The current plan is the increase the TPC volume by a factor 3
to have 6 t of active liquid xenon. The new experimental setup will
also have an additional veto layer called Neutron Veto.
The larger detector will require modifications to the current data
and job management. The processing chain and its products will
undergo significant changes. The larger data volume
and improved knowledge about data access patterns has informed
changes to the data organization. Rather than store the full raw
dataset for later re-processing, the data coming from the detector
will be filtered to only include interesting events. The full raw
dataset will only be stored on tape at one or two sites, where one
of these sites is for long-term archival. The filtered raw dataset will
be stored at OSG/EGI sites for later reprocessing. The overall data
volume of the reduced dataset will be similar to the current data
volume of XENON1T.
\section{References}
\begin{thebibliography}{9}
\bibitem{225} Aprile E. et al (XENON Collaboration), {\it Dark Matter Results from 225 Live Days of XENON100 Data}, Phys. Rev. Lett. {\bf 109} (2012), 181301
\bibitem{mc} Aprile E. et al (XENON Collaboration), {\it Physics reach of the XENON1T dark matter experiment}, JCAP {\bf 04} (2016), 027
\bibitem{instr-1T} Aprile E. et al (XENON Collaboration), {\it The XENON1T Dark Matter Experiment}, Eur. Phys. J. C77 {\bf 12} (2017), 881
\bibitem{osg} Ruth Pordes et al., {\it The open science grid}, Journal of Physics: Conference Series 78, 1 (2007), 012057.
\bibitem{egi} D. Kranzlmüller et al., {\it The European Grid Initiative (EGI)}, Remote Instrumentation and Virtual Laboratories. Springer US, Boston, MA, 61–66 (2010).
\bibitem{sr0} Aprile E. et al (XENON Collaboration), {\it First Dark Matter Search Results from the XENON1T Experiment }, Phys. Rev. Lett. {\bf 119} (2017), 181301
\bibitem{sr1} Aprile E. et al (XENON Collaboration), {\it Dark Matter Search Results from a One Ton-Year Exposure of XENON1T}, Phys. Rev. Lett. {\bf 121} (2018), 111302
\end{thebibliography}
\end{document}
File added
immagini/Additional-Information_18_web.jpg

752 KiB

#!/bin/bash
for file in `ls -l | grep -v total | awk '{print $9}'`; do sudo convert $file ${file::(-4)}.pdf; done
immagini/copertina_web.jpg

2.21 MiB

immagini/datacenter_18_web.jpg

846 KiB

immagini/esperiment_18_web.jpg

1.87 MiB

immagini/research_18_web.jpg

1.39 MiB

immagini/transfer_18_web.jpg

1.28 MiB