Skip to content
Snippets Groups Projects
Commit a2c8b72c authored by Lucia Morganti's avatar Lucia Morganti
Browse files

Update main.tex

parent 66f37f8d
No related branches found
No related tags found
No related merge requests found
Pipeline #22830 passed
......@@ -90,7 +90,7 @@ The general upgrade strategy for Run3 is conceived to deal with this challenge w
\section{Computing model and R\&D activity in Italy}
The ALICE computing model is still heavily based on Grid distributed computing; since the very beginning, the base principle underlying it has been that every physicist should have equal access to the data and computing resources~\cite{ALICE:2005aa}. According to this principle, the ALICE peculiarity has always been to operate its Grid as a “cloud” of computing resources (both CPU and storage) with no specific role assigned to any given centre, the only difference between them being the Tier level to which they belong. All resources have to be made available to all ALICE members, according only to experiment policy and not on resource physical location, and data is distributed according to network topology and availability of resources and not in pre-defined datasets. Tier1s only peculiarities are their size and the availability of tape custodial storage, which holds a collective second copy of raw data and allows the collaboration to run event reconstruction tasks there. In the ALICE model, though, tape recall is almost never done: all useful data reside on disk, and the custodial tape copy is used only for safekeeping. All data access is done through the xrootd protocol, either through the use of “native” xrootd storage or, like in many large deployments, using xrootd servers in front of a distributed parallel filesystem like GPFS.\\
The ALICE computing model is still heavily based on Grid distributed computing; since the very beginning, the base principle underlying it has been that every physicist should have equal access to the data and computing resources~\cite{ALICE:2005aa}. According to this principle, the ALICE peculiarity has always been to operate its Grid as a “cloud” of computing resources (both CPU and storage) with no specific role assigned to any given center, the only difference between them being the Tier level to which they belong. All resources have to be made available to all ALICE members, according only to experiment policy and not on resource physical location, and data is distributed according to network topology and availability of resources and not in pre-defined datasets. Tier1s only peculiarities are their size and the availability of tape custodial storage, which holds a collective second copy of raw data and allows the collaboration to run event reconstruction tasks there. In the ALICE model, though, tape recall is almost never done: all useful data reside on disk, and the custodial tape copy is used only for safekeeping. All data access is done through the xrootd protocol, either through the use of “native” xrootd storage or, like in many large deployments, using xrootd servers in front of a distributed parallel filesystem like GPFS.\\
The model has not changed significantly for Run2, except for scavenging of some extra computing power by opportunistically use the HLT farm when not needed for data taking. All raw data collected in 2017 has been passed through the calibration stages, including the newly developed track distortion calibration for the TPC, and has been validated by the offline QA process before entering the final reconstruction phase. The ALICE software build system has been extended with additional functionality to validate the AliRoot release candidates with a large set of raw data from different years as well as with various MC generators and configurations. It uses the CERN elastic cloud infrastructure, thus allowing for dynamic provision of resources as needed. The Grid utilization in the accounting period remained high, with no major incidents. The CPU/Wall efficiency remained constant, at about 85\% across all Tiers, similar to the previous year. The much higher data rate foreseen for Run3, though, will require a major rethinking of the current computing model in all its components, from the software framework to the algorithms and of the distributed infrastructure. The design of the new computing framework for Run3, started in 2013 and mainly based on the concepts of Online-Offline integration (“\OO\ Project”), has been finalized with the corresponding Technical Design Report~\cite{Buncic:2015ari}: development and implementation phases as well as performance tests are currently ongoing.\\
The Italian share of the ALICE distributed computing effort (currently about 17\%) includes resources both form the Tier1 at CNAF and from the Tier2s in Bari, Catania, Torino and Padova-LNL, plus some extra resources in Trieste. The contribution from the Italian community to the ALICE computing in 2018 has been mainly spread over the usual items, such as the development and maintenance of the (AliRoot) software framework, the management of the computing infrastructure (Tier1 and Tier2 sites) and the participation in the Grid operations of the experiment.\\
......@@ -98,9 +98,9 @@ In addition, in the framework of the computing R\&D activities in Italy, the des
\section{Role and contribution of the INFN Tier1 at CNAF}
CNAF is a full-fledged ALICE Tier1 centre, having been one of the first to join the production infrastructure years ago. According to the ALICE cloud-like computing model, it has no special assigned task or reference community, but provides computing and storage resources to the whole collaboration, along with offering valuable support staff for the experiment’s computing activities. It provides reliable xrootd access both to its disk storage and to the tape infrastructure, through a TSM plugin that was developed by CNAF staff specifically for ALICE use.\\
As a result of flooding, the CNAF computing centre stopped operation on November 8th, 2017; tape access had been made available again on January 31st 2018, and the ALICE Storage Element was fully recovered by February 23th. The loss of CPU resources during the Tier1 shutdown was partially mitigated by the reallocation of the Tier1 worker nodes located in Bari to the Tier2 Bari queue. At the end of February 2018 the CNAF local farm had been powered again moving from 50 kHS06 gradually to 140 kHS06. In addition, on March 15th 170 kHS06 at CINECA became available thanks to a 500 Gb/s dedicated link.
Since March running at CNAF has been remarkably stable: for example, both the disk and tape storage availabilities have been better than 98\%, ranking CNAF in the top 5 most reliable sites for ALICE. The computing resources provided for ALICE at the CNAF Tier1 centre were fully used along the year, matching and often exceeding the pledged amounts due to access to resources unused by other collaborations. Overall, about 64\% of the ALICE computing activity was Monte Carlo simulation, 14\% raw data processing (which takes place at the Tier0 and Tier1 centres only) and 22\% analysis activities: Fig.~\ref{fig:runjobsusers} illustrates the share among the different activities in the ALICE running job profile along the last 12 months.\\
CNAF is a full-fledged ALICE Tier1 center, having been one of the first to join the production infrastructure years ago. According to the ALICE cloud-like computing model, it has no special assigned task or reference community, but provides computing and storage resources to the whole collaboration, along with offering valuable support staff for the experiment’s computing activities. It provides reliable xrootd access both to its disk storage and to the tape infrastructure, through a TSM plugin that was developed by CNAF staff specifically for ALICE use.\\
As a result of flooding, the CNAF computing center stopped operation on November 8th, 2017; tape access had been made available again on January 31st 2018, and the ALICE Storage Element was fully recovered by February 23th. The loss of CPU resources during the Tier1 shutdown was partially mitigated by the reallocation of the Tier1 worker nodes located in Bari to the Tier2 Bari queue. At the end of February 2018 the CNAF local farm had been powered again moving from 50 kHS06 gradually to 140 kHS06. In addition, on March 15th 170 kHS06 at CINECA became available thanks to a 500 Gb/s dedicated link.
Since March running at CNAF has been remarkably stable: for example, both the disk and tape storage availabilities have been better than 98\%, ranking CNAF in the top 5 most reliable sites for ALICE. The computing resources provided for ALICE at the CNAF Tier1 center were fully used along the year, matching and often exceeding the pledged amounts due to access to resources unused by other collaborations. Overall, about 64\% of the ALICE computing activity was Monte Carlo simulation, 14\% raw data processing (which takes place at the Tier0 and Tier1 centers only) and 22\% analysis activities: Fig.~\ref{fig:runjobsusers} illustrates the share among the different activities in the ALICE running job profile along the last 12 months.\\
\begin{figure}[!ht]
\begin{center}
\includegraphics[width=0.75\textwidth]{running_jobs_per_users_2018}
......@@ -115,7 +115,7 @@ The INFN Tier1 has provided about 4,9\% since March 2018 and about 4.20\% along
\begin{center}
\includegraphics[width=0.75\textwidth]{wall_time_tier1_2018}
\end{center}
\caption{Ranking of CNAF among ALICE Tier1 centres in 2018.}
\caption{Ranking of CNAF among ALICE Tier1 centers in 2018.}
\label{fig:walltimesharet1}
\end{figure}
This amounts to about 44\% of the total Wall Time of the INFN contribution: it successfully completed nearly 10.5 million jobs, for a total of more than 44 millions CPU hours, the running job profile at CNAF in 2018 is shown in Fig.\ref{fig:rjobsCNAFunov}.\\
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment