Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • faproietti/ar2018
  • chierici/ar2018
  • SDDS/ar2018
  • cnaf/annual-report/ar2018
4 results
Show changes
Showing
with 523 additions and 30 deletions
contributions/summerstudent/MLalgorithms.png

25.4 KiB

contributions/summerstudent/StoRM-full-picture.png

381 KiB

contributions/summerstudent/StoRM.png

17 KiB

contributions/summerstudent/kibana.png

388 KiB

\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\begin{document}
\title{INFN CNAF log analysis: a first experience with summer students}
\author{D. Bonacorsi$^1$, A. Ceccanti$^2$, T. Diotalevi$^1$, A. Falabella$^2$, L. Giommi$^2$, B. Martelli$^2$, D. Michelotto$^2$, L. Morganti$^2$, E. Ronchieri$^2$, S. Rossi Tisbeni$^1$, E. Vianello$^2$}
\address{$^1$ University of Bologna, Bologna, IT}
\address{$^2$ INFN-CNAF, Bologna, IT}
\ead{barbara.martelli@cnaf.infn.it}
\begin{abstract}
In 2018 the INFN CNAF computing center has started to investigate predictive and preventive maintenance solutions in order to improve fault diagnosis by applying machine learning techniques to hardware and service logs. An excellent experience has been carried out by three students who dedicated three summer months to collect logs of the StoRM services and the resources that host them, to preprocess these logs in order to remove all bias information and to perform initial data analysis. Here we are going to present the activities fulfilled by these students, the initial outcome and the ongoing work at the INFN CNAF data center.
\end{abstract}
\section{Introduction}
In recent years INFN CNAF has put a great effort to define and implement a common monitoring infrastructure based on Sensu, InfluxDB and Grafana and to centralize logs from the most relevant services \cite{bovina2015, bovina2017}. Nowadays, this unified infrastructure has been fully integrated in the data center \cite{fattibene2018} and there is the intention to face the new challenge/opportunity to correlate this vast volume of data and extract actionable insights.
During the summer 2018 a first investigation has been exploited with the help of three summer student \cite{seminario}. Once identified a specific system to analyze, i. e. StoRM, the following activities have been addressed:
\begin{itemize}
\item Log collection and harmonization
\item Log parsing of various services, such as StoRMfrontend, StoRMbackend, heartbeat, messages, GridFTP and GPFS (not covered in our study, but potentially interesting)
\item Metrics data adding (from Tier 1 InfluxDB)
\end{itemize}
However, to provide a first proof of concept for the predictive and preventive maintenance, data categorization and machine learning techniques application represent two key points that have been conducted from the end 2018 and the middle 2019.
\section{Log collection and harmonization}
The first part of the work consisted in the collection of StoRM logs from the StoRM servers dedicated to the Atlas experiment.
Subsequently, most relevant information was extracted from the logs using the ELK Stack suite \cite{elk}. The ELK stack consists of four components: Beats used for data collection from multiple sources, Logstash used for data aggregation and processing, Elasticsearch used for store and index data, Kibana for data analysis and visualization. In particular, Logstash has been used to ingest data from Beats in a continuous live-feed streaming, filter relevant entries and parse each event, identifying named fields to build a user defined structure and ship parsed data to the Elasticsearch engine. Most data was filtered using a \textit{grok} filter which is based on regular expressions and provides predefined filters together with the ability of defining customized ones.
Finally, several dashboards were created using Kibana in order to show in a human-friendly way a summary of the most relevant information derived from StoRM logs (ee for example \ref{fig3}).
\begin{figure}[h]
\includegraphics[width=20pc]{kibana.png}\hspace{2pc}
\begin{minipage}[b]{14pc}\caption{\label{fig3}An example of Kibana dashboard created.}
\end{minipage}
\end{figure}
\section{Log parsing}
Among the INFN Tier 1 services hosted at the INFN CNAF computing center, there are efficient storage systems, like StoRM that is a grid Storage Resource Manager (SRM) solution. Figure \ref{fig1} shows the StoRM architecture: the frontend service manages user authentication and stores requests data, while the backend service executes SRM functionalities and takes care of space and authorization.
The log files contains basically three types of information: timestamp, metrics, and messages.
\begin{figure}[h]
\includegraphics[width=20pc]{StoRM-full-picture.png}\hspace{2pc}%
\begin{minipage}[b]{14pc}\caption{\label{fig1}The StoRM architecture.}
\end{minipage}
\end{figure}
At the beginning of this work (mid 2018), StoRM at Tier 1 was monitored by InfluxDB and Grafana. Metrics monitored included CPU, RAM, network and disk usage; number of sync SRM request per minute per host; duration of async PTG and PTP per host (avg). We wanted to add information derived from the analysis of StoRM logs to already available monitoring information, in order to derive new insights potentially useful to enhance service availability and efficiency with the long-term intent of implementing a global predictive maintenance solution for Tier 1. In order to build a Machine Learning model for anomaly prediction, logs from two different period were analyzed: a normal behavior period and a critical behavior period (due to wrong configuration of the file system and wrong configuration of the queues coming from the farm).
A four-steps activity has been carried out:
\begin{enumerate}
\item Parsing: log files were parsed and deconstructed, converting them to CSV format
\item Feature selection: was done grouping messages based on their common content (core part of the message). The grouping phase resulted in 20 \textit{Request Types} (Connection, Run, Ping, Ls, Check permission, PTG, PTG status, Get space tokens, PTP, PTP status, BOL status, Put don, Release files, Mv, Mkdir, BOL, Abort request, Abort files, Get space metadata, nan) and 15 \textit{Result Types} (SRM\_SUCCESS, SRM\_FAILURE, SRM\_NOT\_SUPPORTED, SRM\_REQUEST\_QUEUED, SRM\_REQUEST\_INPROGRESS, Protocol check failed, Received 4 protocols, Some protocols supported, SRM\_DUPLICATION\_ERROR, rpcResponseHandler\_AbortFiles, SRM\_INVALID\_REQUEST, SRM\_INVALID\_PATH, Received 5 protocols, SRM\_INTERNAL\_ERROR, nan). A first data exploration phase was performed by counting occurrencies of messages in each group.
Techniques used for the feature selection procedure were: SelectKBest with the chi-squared statistical test, Recursive Feature Elimination, Principal Component Analysis (PCA) and Feature Importance from ensembles of decision tree methods.
\item One-hot encoding: CSV rows encoded in binary vectors (feature vectors). Each vector represents the summary of 15-minutes log contents.
\item Labelling: operation specific for StoRM log files done manually discriminating between normal and critical period based on help-desk tickets.
\end{enumerate}
Feature vectors obtained in (iii) and labeled datasets built in (iv) were used to train several ML algorithms and to test their accuracy. Figure \ref{fig2} depicts the results of tests performed on the following algorithms: LogisticRegression (LR), LinearDiscriminantAnalysis (LDA), KNeighborsClassifier (KNN), GaussianNB (GNB), DecisionTreeClassifier (CART), BaggingClassifier (BgDT), RandomForestClassifier (RF), ExtraTreesClassifier (ET), AdaBoostClassifier (AB), GradientBoostingClassifier (GB), XGBoostClassifier (XGB), MultiLayerPerceptronClassifier (MLP).
\begin{center}
\begin{figure}[h]
\includegraphics[width=20pc]{MLalgorithms.png}\hspace{2pc}
\begin{minipage}[b]{14pc}\caption{\label{fig2}Machine Learning Algorithms Comparison (scorer=accuracy).}
\end{minipage}
\end{figure}
\end{center}
\section{Metrics data adding}
This activity was mainly focused on collecting metric data from InfluxDB in order to put them in relation with StoRM logs obtained with activities explained in previous sections and extract new insights.
Key components of log files were identified, parsed and structured in a CSV file with the following columns: timestamp, metric, message, descriptive keys and separators. All timestamps were converted in UNIX epoch time in order to be comparable. On one side, InfluxDB stores information with different granularity depending on the age of data collected and on the other side, StoRM front-end and back-end logs are produced with different frequencies (one line each minute for heartbeat logs, multiple lines every minute for metrics logs, one line every five minutes for InfluxDB more recent data, and so on). Therefore, some concatenation rules have been implemented in order to correctly put in relation all data sources based on the time of occurrence of the event: backend metrics are split by type, timestamp is rounded off to one‐minute precision, in case of overlap the more recent is kept and every CSV file is concatenated and ordered by timestamp.
\section{Conclusion}
This experience is a good example of mutually beneficial collaboration between university students and INFN CNAF. The outcome has allowed master students (i.e. Diotalevi T. and Giommi L.) to publish papers at international conferences \cite{diotalevi, giommi20191}, to win Giulia Vita Finzi's award \cite{giommi20192}, and to start their PhD courses with success. Furthermore, the undergraduate student (i.e. Rossi Tisbeni R) will hold a master degree in Physics in July 2019. On the other hand, the INFN CNAF data center managers has decided to continue exploiting predictive and preventive maintenance to establish where and when to use it to keep services running optimally.
\section*{References}
\begin{thebibliography}{9}
\bibitem{seminario} Martelli B, Giommi L, Rossi Tisbeni S, Diotalevi T, https://agenda.infn.it/event/17430/, 2018.
\bibitem{bovina2015} Bovina S, Michelotto D, Misurelli G, \emph{CNAF Annual Report}, pp. 111--114, 2015.
\bibitem{bovina2017} Bovina S, Michelotto D, In Proc of CHEP 2017.
\bibitem{fattibene2018} Fattibene E, Dal Pra S, Falabella A, De Cristofaro T, Cincinelli G, Ruini M, In Proc of CHEP 2018.
\bibitem{diotalevi} Diotalevi T, Bonacorsi D, Michelotto D, Falabella A, In Proc of International Symposium on Grids \& Clouds (ISGC), Taipei, Taiwan, 2019 (under review).
\bibitem{giommi20191} Giommi L, Bonacorsi D, Diotalevi T, Rossi Tisbeni S, Rinaldi L, Morganti L, Falabella A, Ronchieri E, Ceccanti A, Martelli B, In Proc of International Symposium on Grids \& Clouds (ISGC), Taipei, Taiwan, 2019 (under review).
\bibitem{giommi20192} Giommi L, In INFN CCR Workshop, La Biodola, 3-7 June 2019.
\bibitem{elk}https://www.elastic.co/, site visited on June 2019.
\end{thebibliography}
\end{document}
contributions/sysinfo/container_ci.png

54 KiB

contributions/sysinfo/cronjob_annotation.png

40.1 KiB

contributions/sysinfo/deps_scan.png

4.3 MiB

contributions/sysinfo/presenze_kibana.png

22.4 KiB

\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\usepackage{hyperref}
\begin{document}
\title{The INFN Information System}
\author{
S. Bovina$^1$,
M. Canaparo$^1$,
E. Capannini$^1$,
F. Capannini$^1$,
C. Galli$^1$,
G. Guizzunti$^1$,
B. Demin$^1$
}
\address{$^1$ INFN-CNAF, Bologna, IT}
\ead{
stefano.bovina@cnaf.infn.it,
marco.canaparo@cnaf.infn.it,
enrico.capannini@cnaf.infn.it,
fabio.capannini@cnaf.infn.it,
claudio.galli@cnaf.infn.it,
guido.guizzunti@cnaf.infn.it,
barbara.demin@cnaf.infn.it
}
\begin{abstract}
The mission of the Information System Service is the implementation, management and optimization of all the infrastructural and application components of the administrative services of the Institute. In order to guarantee high reliability and redundancy, the same systems are replicated in an analogous infrastructure at the National Laboratories of Frascati (LNF).
The Information System's team manages all the administrative services of the Institute,
both from the hardware and the software point of view, and it is in charge of carrying out several software projects.
The core of the Information System is made up of the salary and HR systems.
Connected to the core, there are several other systems reachable from a unique web portal:
firstly, the organizational chart system (GODiVA); secondly, the accounting, the time and attendance,
the trip and purchase order and the business intelligence systems.
Finally, there are other systems which manage the training of the employees, their subsidies, their timesheet, the official documents,
the computer protocol, the recruitment, the user support etc.
\end{abstract}
\section{Introduction}
The INFN Information System project was set up in 2001 with the purpose of digitizing and managing all the administrative and accounting processes of the INFN Institute,
and of carrying out a gradual dematerialization of documents.\\
In 2010, INFN decided to transfer the accounting system, based on the Oracle Business Suite (EBS) and the SUN Solaris operating system,
from the National Laboratories of Frascati (LNF) to CNAF, where the SUN Solaris platform was migrated to a RedHat Linux Cluster and implemented on commodity hardware.\\
The Service “Information System” was officially established at CNAF in 2013 with the aim of developing, maintaining and coordinating many IT services which are critical
for INFN. Together with the corresponding office at the National Laboratories of Frascati, it is actively involved in fields related to INFN management and administration, developing tools for business intelligence and research quality assurance; it is also involved in the dematerialization process and in the provisioning of interfaces between users and INFN administration.\\
Over the years, other services have been added, leading to a complex infrastructure that covers all aspects of people's life working at INFN.
In 2018, the Information System service team at CNAF was composed of 8 people, both developers and system engineers.\\
\section{Infrastructure}
In 2018, the infrastructure-related activity was composed of various tasks that can be summarized as follows:
firstly, the consolidation of the Disaster Recovery site in Bari and the restore of CNAF as primary site;
secondly, the finalization of Puppet 3 phase out and related Foreman upgrades;
thirdly, the improvement of our ELK (Elasticsearch/Logstash/Kibana) and monitoring infrastructure and finally, several ``Misure Minime'' AGID and GDPR compliance adjustments.
\newline
After the complete revisiting and upgrade of the ELK stack to version 5 last year,
many activities have been done to enhance systems and applications monitoring using this set of tools.
To improve the discovery and resolution of problems, several views and dashboards (see Figure~\ref{fig:presenze_kibana}) have been created on Kibana,
as well as a deep analysis and customization of application logs to introduce useful information.
\begin{figure}[htbp]
\begin{center}
\includegraphics[scale=0.5]{presenze_kibana.png}
\end{center}
\caption{\label{fig:presenze_kibana} Time and attendance system manual squaring statistics on Kibana (ELK).}
\end{figure}
With the aim of enhancing our cronjobs management, improving its monitoring and management, avoiding cronjob overlap and in order to identify ``dead-man-switches'''
a new cronjob management tool has been adopted.
Cronjob executions are available both on Kibana and Grafana (as annotation),
so they can be used to be correlated with system events (see Figure~\ref{fig:cronjob_annotation}); In the same way, software releases are also displayed on Grafana.
\begin{figure}[htbp]
\begin{center}
\includegraphics[scale=0.5]{cronjob_annotation.png}
\end{center}
\caption{\label{fig:cronjob_annotation} Annotations for cronjobs on Grafana.}
\end{figure}
\newpage
Because of the recent regulations that came into force (``Misure Minime'' AGID and GDPR), many audits and related adjustments were made, also relying on both official Center for Internet Security (CIS) guides and Openscap scan, using the Payment Card Industry - Data Security Standard (PCI-DSS) profile.
Afterwards, we introduced a proactive security model on some pilot projects, adopting tools for static code analysis and dependency scanning (see Figure~\ref{fig:deps_scan}).
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=1.0\textwidth]{deps_scan.png}
\end{center}
\caption{\label{fig:deps_scan} Dependencies scan tool in action on Gitlab-CI.}
\end{figure}
In addition to this, the Platform as a Service (PaaS) infrastructure based on RedHat Openshift Origin (3.x) was upgraded to release 3.11
and a signature/scan services was deployed at container registry level for all container-based projects (see Figure~\ref{fig:container_ci}).
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=1.0\textwidth]{container_ci.png}
\end{center}
\caption{\label{fig:container_ci} Container registry details and related Gitlab-CI pipeline.}
\end{figure}
\newpage
In 2018, Oracle databases related activities concerned their maintenance,
an initial analysis about the necessary activities to upgrade to later versions and the study on how to achieve real-time replication
between the Oracle databases of the Accounting application. Periodic recovery tests were also conducted on the Bari Disaster Recovery site.
\section{Time and attendance system improvements}
The time and attendance system allows employees to clock in and out electronically via swipe card.
The data is instantly transferred into a database and shown in a web-based application.
This system tracks the working hours and offers employees self-service that allows them to handle many time-tracking tasks on their own,
all subjected to customizable approval workflows, which include reviewing the hours they have worked, the current and future schedule and requests of paid or unpaid leaves.
In 2018, the Time and Attendance system related activities concerned both the introduction of new features and the modifications of the existing ones. Furthermore, developers focused on the performance improvement of the system through the optimization of some common procedures.
The Time and Attendance system was enabled to ``read'' codes introduced together with the clock in/out: through this mechanism, employees can specify the reasons for their leave of absence without using the web-based application.
Some modifications have been carried out to implement some changes occurred in the national collective agreement. This activity included two new leaves of absence and an extension from three to four months of the period for the check of the average weekly working hours.
As concerns performance, the developers' team have optimized the procedure that manages the clock in/out by web portal, and the report that shows the paid overtime aggregated in sectors, employees and months.
\section{Oracle EBS improvements}
In 2018, a new Electronic Payments and Receipts (EPR) Framework was introduced,
in compliance with the standard set by the Agency for Digital Italy (Agenzia per l'Italia Digitale, AgID) and transmitted through SIOPE+.
SIOPE+ is the new infrastructure that enables general government entities and banks that provide treasury services
to exchange information, with the aim of improving the quality of the data used for monitoring government expenditure and tracking the payment times to firms that supply general government entities.
SIOPE+ responds to the following needs:
\begin{itemize}
\item availability of detailed information on payments made by general government bodies without burdening the entities involved in the flow of outlays and collections. This will make it easier to obtain information on the payments of trade receivables and, more broadly, to monitor public sector financial flows in real time.
\item standardization of information exchange between government bodies and treasury service providers by adopting a single digital standard OPI (Ordinativo di Pagamento e Incasso) in place of the previous local standard OIL (Ordinativo Informatico Locale), with the aim of raising the quality of treasury services, facilitating further integration between the accounting systems of the entities and between payment processes, and supporting the development of electronic payments services.
\end{itemize}
\section{Business Intelligence improvements}
In 2018, the main task was investigating alternative technical solutions to the current Business Intelligence installation,
with the aim of reducing licensing costs, while remaining on an open source solution and preserving functionalities and compatibility with other INFN tools and platforms.
At the end of this activity, the current solution, based on TIBCO platform, was confirmed the best one.
%At present, we are converting reports that are using deprecated features. Once all reports are converted, the Business Intelligence infrastructure will be upgraded to the last version.
\section{Contratti}
Contratti (previously named Repertorio Contratti) is a new Java application (in test phase) for long term preservation of contracts made between INFN and an external supplier, based on Alfresco and mDM protocol.
Each contract is enriched with a full set of metadata which describe the contract in its relevant parts, and suppliers are extracted automatically from the central supplier registry, together with details of the contract signer.
Last year, several bugfix and improvements have been made, in order to respect our customers requirements. Improvements can be summarized as follows:
\begin{enumerate}
\item integration with mDM protocol:
\begin{itemize}
\item it is now possible to manage a set of folders where to store the contract file, as if it was a complete folder explorer;
\item before the contract file is stored in mDM, a protocol signature is written onto the document, without invalidating PAdES (PDF Advanced Electronic Signatures) signature of the issuer.
\end{itemize}
\item complete refactoring of the ACLs mechanism used to manage document and app permissions;
\item added email notification in order to send a contract link to a set of recipients, extracted automatically from Godiva;
\item it is now possible to print a label containing the relevant characteristics of the contract;
\item complete UI restyling in order to improve both readability and usability of the product.
\end{enumerate}
\end{document}
contributions/tier1/cpu2018.png

27.8 KiB

contributions/tier1/disk2018.png

28.4 KiB

contributions/tier1/pledge.png

180 KiB

contributions/tier1/tape2018.png

30.2 KiB

\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\usepackage{url}
\usepackage{color, colortbl}
\definecolor{LightCyan}{rgb}{0.88,1,1}
\definecolor{LightYellow}{rgb}{1,1,0.88}
\definecolor{Red}{rgb}{1,0,0}
\definecolor{Green}{rgb}{0,1,0}
\definecolor{MediumSpringGreen}{rgb}{0,0.98,0.6} %rgb(0,250,154)
\definecolor{Gold}{rgb}{1,0.84,0}%rgb(255,215,0)
\definecolor{Gainsboro}{rgb}{0.86,0.86,0.86}%rgb(220,220,220)
\begin{document}
\title{The INFN Tier 1}
\author{Luca dell'Agnello$^1$}
\address{$^1$ INFN-CNAF, Bologna, IT}
\ead{luca.dellagnello@cnaf.infn.it}
\section{Introduction}
CNAF hosts the Italian Tier 1 data center for WLCG: over the years, Tier 1 has become the main computing facility for INFN.
Nowadays, besides the four LHC experiments, the INFN Tier 1 provides services and resources to 30 other scientific collaborations,
including BELLE2 and several astro-particle experiments (see Table \ref{T1-pledge}).
As shown in Fig.~\ref{pledge2018}, besides LHC, the main users are the astro-particle experiments.
\begin{figure}[h]
\begin{center}
\includegraphics[keepaspectratio,width=15cm]{pledge.png}
\caption{\label{pledge2018}Relative requests of resources at INFN Tier 1}
\end{center}
\end{figure}
Despite the flooding that occurred at the end of 2017, we were able to provide the resources committed to the experiments for 2018 almost in time.
\begin{table}
\begin{center}
\begin{tabular}{l|rrr}
\br
\textbf{Experiment}&\textbf{CPU (kHS06)}&\textbf{Disk (PB-N)}&\textbf{Tape (PB)}\\
\hline
\rowcolor{MediumSpringGreen}
ALICE&52020&5185&13497\\
\rowcolor{MediumSpringGreen}
ATLAS&85410&6480&17550\\
\rowcolor{MediumSpringGreen}
CMS&72000&7200&24440\\
\rowcolor{MediumSpringGreen}
LHCB&46805&5606&11400\\
\rowcolor{MediumSpringGreen}
\hline
\textbf{LHC Total}&\textbf{256235}&\textbf{24471}&\textbf{66887}\\
\hline
\rowcolor{LightYellow}
Belle2&13000&350&0\\
\rowcolor{LightYellow}
CDF&0&0&4000\\
\rowcolor{LightYellow}
Compass&40&10&40\\
\rowcolor{LightYellow}
KLOE&0&33&3075\\
\rowcolor{LightYellow}
LCHf&6000&90&0\\
\rowcolor{LightYellow}
NA62&3000&250&200\\
\rowcolor{LightYellow}
PADME&1500&10&500\\
\rowcolor{LightYellow}
LHCb Tier2&26085&0&0\\
\rowcolor{LightYellow}
\hline
\rowcolor{LightYellow}
\textbf{CSN 1 Total}&\textbf{49625}&\textbf{743}&\textbf{7815}\\
\hline
\rowcolor{LightCyan}
AMS&15800&1990&510\\
\rowcolor{LightCyan}
ARGO&0&120&1000\\
\rowcolor{LightCyan}
Auger&2000&615&0\\
\rowcolor{LightCyan}
BOREX&2000&185&41\\
\rowcolor{LightCyan}
CTA&4000&796&120\\
\rowcolor{LightCyan}
CUORE&1900&262&0\\
\rowcolor{LightCyan}
Cupid&100&15&10\\
\rowcolor{LightCyan}
DAMPE&8000&200&100\\
\rowcolor{LightCyan}
DARKSIDE&2000&980&300\\
\rowcolor{LightCyan}
ENUBET&500&10&0\\
\rowcolor{LightCyan}
EUCLID&1000&1042&0\\
\rowcolor{LightCyan}
Fermi&500&15&40\\
\rowcolor{LightCyan}
Gerda&40&45&40\\
\rowcolor{LightCyan}
Icarus&4000&500&1500\\
\rowcolor{LightCyan}
JUNO&3000&230&0\\
\rowcolor{LightCyan}
KM3&300&250&200\\
\rowcolor{LightCyan}
LHAASO&300&60&0\\
\rowcolor{LightCyan}
LIMADOU&400&8&0\\
\rowcolor{LightCyan}
LSPE&1000&14&0\\
\rowcolor{LightCyan}
MAGIC&296&65&150\\
\rowcolor{LightCyan}
NEWS&200&60&60\\
\rowcolor{LightCyan}
Opera&200&15&15\\
\rowcolor{LightCyan}
PAMELA&650&100&150\\
\rowcolor{LightCyan}
Virgo&30000&656&1368\\
\rowcolor{LightCyan}
Xenon100&1000&200&1000\\
\rowcolor{LightCyan}
\hline
\rowcolor{LightCyan}
\textbf{CSN 2 Total}&\textbf{79186}&\textbf{8433}&\textbf{6604}\\
\hline
\rowcolor{Gainsboro}
FOOT&200&20&0\\
\rowcolor{Gainsboro}
Famu&2250&15&187\\
\rowcolor{Gainsboro}
GAMMA/AGATA&0&0&1160\\
\rowcolor{Gainsboro}
NEWCHIM/FARCOS&0&10&300\\
\rowcolor{Gainsboro}
\hline
\rowcolor{Gainsboro}
\textbf{CSN 3 Total}&\textbf{2450}&\textbf{45}&\textbf{1460}\\
\hline \hline
\rowcolor{Green}
\textbf{Grand Total}&\textbf{387496}&\textbf{33692}&\textbf{82766}\\
\rowcolor{Green}
\textbf{Installed}&\textbf{340000}&\textbf{34000}&\textbf{71000}\\
\br
\end{tabular}
\end{center}
\caption{Pledged and installed resources at INFN Tier 1 in 2018 (for the CPU power an overlap factor is applied). CSN 1, CSN 2 and CSN 3 are the National Scientific Committees of the INFN, respectively, for experiments in high energy physics with accelerators, astro-particle experiments and experiments in nuclear physics with accelerators.}
\label{T1-pledge}
\hfill
\end{table}
\subsection{Out of the mud}
The year 2018 began with the recovery procedures of the data center after the flooding of November 2017.
Despite the serious damages to the power plants (both power lines were compromised), immediately after the flooding we started the recovery procedures of both the infrastructure and the IT equipment. The first mandatory intervention was to restore, at least, one of the two power lines (with a leased UPS in the first period). This goal was achieved during December 2017.
In January, after the restart of the chillers, we could proceed to re-open all services, including part of the farm (at the beginning only $\sim$ 50 kHS06, 1/5 of the total power capacity, were online, while 13\% was lost) and, one by one, the storage systems.
The first experiments to resume operations at CNAF have been Alice, Virgo and Darkside:
in fact, the storage system used by Virgo and Darkside had been easily recovered after Christmas break, while Alice is able to use computing resources relaying on remote storage. During February and March, we were able to progressively re-open the services for all other experiments.
%(Fig.\ref{farm2018} shows the restart of the farm). Meanwhile, we had setup a new partition of the farm hosted at CINECA super-computing center premises (see Par.~\ref{CINECAext}).
The final damage inventory shows the loss of $\sim$ 30 kHS06,
1.4 PB of data and 60 tapes: on the other hand, it was possible to repair all the other systems recovering $\sim$ 20 PB of data;
with respect to the infrastructure, the second line was recovered (see \cite{FLOODCHEP} for details).
%\begin{figure}[h]
% \begin{center}
% \includegraphics[width=40pc]{t1-img/farm2018.png}\hspace{2pc}%
% \caption{\label{farm2018}Farm usage in 2018}
% \end{center}
%\end{figure}
\subsection{The long-term consequences of the flooding}
The data center was designed taking into account all possible accidents, e.g. fires, power outages... except very unlikely events
such as the breaking of one of the main water pipelines in Bologna, located in a road next to CNAF,
which is precisely what happened in November 2017.
In fact, it was believed that the only threat due to water could come from a very heavy rain and, indeed,
waterproof doors were installed some years ago, after a heavy rain.
The post-mortem analysis showed that the causes, beside the breaking of the pipe, are to be found in the unfavorable position (2 underground levels) and in the excessive permeability of the perimeter (while the anti-flood doors worked). Therefore, an intervention has been carried out to increase the waterproofing of the data center and, moreover, work is planned for summer 2019 to strengthen the perimeter of the building and build a second water collection tank.
Even if the search for a new location to move the data center had started before the flooding (the main drive being its limited expandability not able to cope with the foreseen requirements for HL-LHC era when we should scale up to 10 MW of power for IT), the flooding gave us a second strong reason to move.
An opportunity is given by the new ECMWF center which will be hosted in Bologna, in a new Technopole area, starting from 2019.
In the same area the INFN Tier 1 and the CINECA\footnote{CINECA is the Italian Supercomputing center, also located near Bologna ($\sim17$ km far from CNAF). See \url{http://www.cineca.it/}} computing centers can be hosted too: funding has been guaranteed to INFN and CINECA by the Italian Government for this. The goal is to have the new data center for the INFN Tier 1 fully operational by the end of 2021.
\section{INFN Tier 1 extension at CINECA}\label{CINECAext}
Out of the 400 kHS06 CPU power (340 kHS06 pledged) of the CNAF farm, $\sim180$ are provided by servers installed in the CINECA data center.
%Each server is equipped with a 10 Gbit uplink connection to the rack switch while each of them, in turn, is connected to the aggregation router with 4x40 Gbit links.
The logical network of the farm partition at CINECA is set as an extension of INFN Tier 1 LAN: a dedicated fiber couple interconnects the aggregation router at CINECA with the core switch at the INFN Tier 1 (see Farm and Network Chapters for more details). %Fig.~\ref{cineca-t1}).
%The transmission on the fiber is managed by a couple of Infinera DCI, allowing to have a logical channel up to 1.2 Tbps (currently it is configured to transmit up to 400 Gbps).
%\begin{figure}
% % \begin{minipage}[b]{0.45\textwidth}
% \begin{center}
% \includegraphics[width=30pc]{t1-img/cineca-t1.png}
% \caption{\label{cineca-t1}Schematic view of the CINECA - INFN Tier-1 interconnection}
% \end{center}
% % \end{minipage}
%\end{figure}
These nodes, in production since March 2018 for WLCG experiments have been gradually opened to all other collaborations. %Due the low latency (the RTT is 0.48 ms vs. 0.28 ms measured on the CNAF LAN), there is no need of a disk cache on the CINECA side and the WNs directly access the storage located at CNAF; in fact, the
The efficiency of the jobs\footnote{The efficiency of a job is defined as the ratio beyween its CPU time and its wall-clock time.} is comparable to the one measured on the farm partition at CNAF.
Since this partition have been installed from the beginning with CentOS 7, legacy applications requiring a different flavour of Operating System can use it through the container technology Singularity~\cite{singularity}.
%Moreover, this partition has undergone several reconfigurations due to both the hardware and the type of workflow of the experiments. In April we had to upgrade the BIOS to overcome a bug which was preventing the full resource usage, limiting at $\sim$~78\% of the total what we were getting from the nodes. Moreover a reconfiguration of the local RAID configuration of disks is ongoing\footnote{The initial choice of using RAID-1 for local disks instead of RAID-0 has been proven to slow down the system even if safer from an operational point of view.} as well as tests to choose the best number of computing slots.
\section*{References}
\begin{thebibliography}{9}
\bibitem{FLOODCHEP} L. dell'Agnello, "Disaster recovery of the INFN Tier 1 data center: lesson learned" to be published in Proceedings of the 23rd International Conference on Computing in High Energy and Nuclear Physics - EPJ Web of Conferences
\bibitem{singularity} \url{http://singularity.lbl.gov}
\end{thebibliography}
\end{document}
File added
......@@ -2,34 +2,34 @@
\usepackage{graphicx}
\begin{document}
\title{User and Operational Support at CNAF}
\author{D. Cesini, E. Corni, F. Fornari, L. Morganti, C. Pellegrino, M. V. P. Soares, M. Tenti, L. Dell'Agnello}
\address{INFN-CNAF, Bologna, IT}
\author{D. Cesini$^1$, E. Corni$^1$, F. Fornari$^1$, L. Morganti$^1$, C. Pellegrino$^1$, M. V. P. Soares$^1$, M. Tenti$^1$, L. Dell'Agnello$^1$}
\address{$^1$ INFN-CNAF, Bologna, IT}
\ead{user-support@lists.cnaf.infn.it}
\begin{abstract}
Many different research groups, typically organized in Virtual Organizations (VOs),
exploit the Tier-1 Data center facilities for computing and/or data storage and management. Moreover, CNAF hosts two small HPC farms and a Cloud infrastructure. The User Support unit provides to the users of all CNAF facilities with a direct operational support, and promotes common technologies and best-practices to access the ICT resources in order to facilitate the usage of the center and maximize its efficiency.
exploit the Tier 1 Data center facilities for computing and/or data storage and management. Moreover, CNAF hosts two small HPC farms and a Cloud infrastructure. The User Support unit provides to the users of all CNAF facilities with a direct operational support, and promotes common technologies and best-practices to access the ICT resources in order to facilitate the usage of the center and maximize its efficiency.
\end{abstract}
\section{Current status}
Born in April 2012, the User Support team in 2018 was composed by one coordinator and up to five fellows with post-doctoral education or equivalent work experience in scientific research or computing.
The main activities of the team include:
\begin{itemize}
\item providing a prompt feedback to VO-specific issues via ticketing systems or official mail channels;
\item forwarding to the appropriate Tier-1 units those requests which cannot be autonomously satisfied, and taking care of answers and fixes, e.g. via the tracker JIRA, until a solution is delivered to the experiments;
\item forwarding to the appropriate Tier 1 units those requests which cannot be autonomously satisfied, and taking care of answers and fixes, e.g. via the tracker JIRA, until a solution is delivered to the experiments;
\item supporting the experiments in the definition and debugging of computing models in distributed and Cloud environments;
\item helping the supported experiments by developing code, monitoring frameworks and writing guides and documentation for users (see e.g. https://www.cnaf.infn.it/en/users-faqs/);
\item solving issues on experiment software installation, access problems, new accounts creation and any other daily usage problems;
\item porting applications to new parallel architectures (e.g. GPUs and HPC farms);
\item providing the Tier-1 Run Coordinator, who represents CNAF at the Daily WLCG calls, and reports about resource usage and problems at the monthly meeting of the Tier-1 management body (Comitato di Gestione del Tier-1).
\item providing the Tier 1 Run Coordinator, who represents CNAF at the Daily WLCG calls, and reports about resource usage and problems at the monthly meeting of the Tier 1 management body (Comitato di Gestione del Tier 1).
\end{itemize}
People belonging to the User Support team represent INFN Tier-1 inside the VOs.
People belonging to the User Support team represent INFN Tier 1 inside the VOs.
In some cases, they are directly integrated in the supported experiments. Moreover, they can play the role of a member of any VO for debugging purposes.
The User Support staff is also involved in different CNAF internal projects, notably the Computing on SoC Architectures (COSA) project (www.cosa-project.it) dedicated to the technology tracking and benchmarking of the modern low-power architectures for computing applications.
\section{Supported experiments}
The LHC experiments represent the main users of the data center, handling more than 80\% of the total computing and storage resources funded at CNAF. Besides the four LHC experiments (ALICE, ATLAS, CMS, LHCb) for which CNAF acts as Tier-1 site, the data center also supports an ever increasing number of experiments from the Astrophysics, Astroparticle physics and High Energy Physics domains, and specifically Agata, AMS-02, Argo-YBJ, Auger, Belle II, Borexino, CDF, Compass, COSMO-WNEXT CTA, Cuore, Cupid, Dampe, DarkSide-50, Enubet, Famu, Fazia, Fermi-LAT, Gerda, Icarus, LHAASO, LHCf, Limadou, Juno, Kloe, KM3Net, Magic, NA62, Newchim, NEWS, NTOP, Opera, Padme, Pamela, Panda, Virgo, and XENON.
The LHC experiments represent the main users of the data center, handling more than 80\% of the total computing and storage resources funded at CNAF. Besides the four LHC experiments (ALICE, ATLAS, CMS, LHCb) for which CNAF acts as Tier 1 site, the data center also supports an ever increasing number of experiments from the Astrophysics, Astroparticle physics and High Energy Physics domains, and specifically Agata, AMS-02, Auger, Belle II, Borexino, CDF, Compass, COSMO-WNEXT CTA, Cuore, Cupid, Dampe, DarkSide-50, Enubet, Famu, Fazia, Fermi-LAT, Gerda, Icarus, LHAASO, LHCf, Limadou, Juno, Kloe, KM3Net, Magic, NA62, Newchim, NEWS, NTOP, Opera, Padme, Pamela, Panda, Virgo, and XENON.
Clearly, a bigger effort from the User Support team is needed to answer to the varied and diverse needs from these no-LHC experiments and to encourage them to adopt more modern technologies, e.g. FTS, Dirac, token-based authorization.
\begin{figure}[ht]
......@@ -60,12 +60,13 @@ The following figures show resources pledged and used by the supported experimen
Unfortunately, the accounting data for storage, both disk and tape statistics, are available only after summer 2018, given the restoration of the complex system of sensors for accounting after the 2017 flooding had a lower priority with respect to activities needed for a complete of the storage resources involved in the flood.
\section{Support to HPC and cloud-based experiment}
Apart from Tier-1 facilities, CNAF hosts two small HPC farms and a cloud infrastructure. The first HPC cluster, in production since 2015, is composed of 27 nodes, some of them also equipped with one or more GPUs (NVIDIA Tesla K20, K40 and K1). All nodes are infiniband interconnected and equipped with 2 Intel CPUs, 8 physical cores each, HyperThread enabled. The cluster is accessible via the LSF batch system. It is open to various INFN communities, but the main users are theoretical physicist dealing with plasma laser acceleration simulations. The cluster serves as testing infrastructure to prepare the high resolution runs submitted to supercomputers.
Apart from Tier 1 facilities, CNAF hosts two small HPC farms and a cloud infrastructure. The first HPC cluster, in production since 2015, is composed of 27 nodes, some of them also equipped with one or more GPUs (NVIDIA Tesla K20, K40 and K1). All nodes are infiniband interconnected and equipped with 2 Intel CPUs, 8 physical cores each, HyperThread enabled. The cluster is accessible via the LSF batch system. It is open to various INFN communities, but the main users are theoretical physicists dealing with plasma laser acceleration simulations. The cluster is used as a testing infrastructure to prepare the high resolution runs to be submitted afterwards to supercomputers.
A second HPC cluster entered into production in 2017 to serve the CERN accelerators R/D groups. The cluster consists of 12 nodes OmniPath interconnected. Can be access through batch queues managed by the IBM LSF system.
A second HPC cluster entered into production in 2017 to serve the CERN accelerators R/D groups. The cluster consists of 12 nodes OmniPath interconnected. It can be access through batch queues managed by the IBM LSF system.
The support is provided on a daily base for what concerns software installation, access problems, new accounts creation and any other usage problems.
The User Support team manages an OpenStack-based tenant hosted within the Cloud@CNAF. This tenant, provided with 300 vCPUs, is mostly devoted to support peculiar use cases which require unusual software configurations and only for a limited amount of time. The most important of these use cases is the FAZIA experiment, for which 256 vCPUs were provided, distributed over 16 worker nodes with 8GB of RAM each, where the Debian 8.4 operating system has been installed and configured with LDAP+Kerberos for user authentication and authorization, and NFS 4 for network storage sharing. Recently, other experiments started accessing the Cloud infrastructure: AMS, EEE, FAZIA, Icarus and NTOF.
The User Support team manages an OpenStack-based tenant hosted within the Cloud@CNAF. This tenant, provided with 300 vCPUs, is mostly devoted to support peculiar use cases which require unusual software configurations and only for a limited amount of time. The most important of these use cases is the FAZIA experiment, for which 256 vCPUs were provided, distributed over 16 worker nodes with 8GB of RAM each, where the Debian 8.4 operating system has been installed and configured with LDAP and Kerberos for user authentication and authorization, and NFS 4 for network storage sharing.
Recently, other experiments started accessing the Cloud infrastructure: AMS, EEE, Icarus and NTOF.
\end{document}
......
......@@ -5,18 +5,18 @@
%\author{P. Astone$^1$, F. Badaracco$^{2,3}$, S. Bagnasco$^4$, S. Caudill$^5$, F. Carbognani$^6$, A. Cirone$^{7,8}$, G. Fronz\'e$^{4}$, J. Harms$^{2,3}$, I. LaRosa$^1$, C. Lazzaro$^9$, P. Leaci$^1$, S. Lusso$^4$, C. Palomba$^1$, R. DePietri$^{11,12}$, M. Punturo$^{10}$, L. Rei$^8$, L. Salconi$^6$, S. Vallero$^{4}$, on behalf of the Virgo collaboration}
\author{P. Astone$^1$, F. Badaracco$^{2,3}$, S. Bagnasco$^4$, S. Caudill$^5$, F. Carbognani$^6$, A. Cirone$^{7,8}$, M. Drago$^{2,3}$, G. Fronz\'e$^{4}$, J. Harms$^{2,3}$, I. LaRosa$^1$, C. Lazzaro$^9$, P. Leaci$^1$, S. Lusso$^4$, C. Palomba$^1$, R. DePietri$^{11,12}$, M. Punturo$^{10}$, L. Rei$^8$, L. Salconi$^6$, S. Vallero$^{4}$, on behalf of the Virgo collaboration}
\address{$^1$ INFN, Roma, IT}
\address{$^2$ Gran Sasso Science Institute (GSSI), IT}
\address{$^3$ INFN, Laboratori Nazionali del Gran Sasso, IT}
\address{$^4$ INFN, Torino, IT}
\address{$^5$ Nikhef, Science Park, NL}
\address{$^6$ EGO-European Gravitational Observatory, Cascina, Pisa, IT}
\address{$^7$ Universit\`a degli Studi di Genova, IT}
\address{$^8$ INFN, Genova, IT}
\address{$^9$ INFN, Padova, IT}
\address{$^{10}$ INFN, Perugia, IT}
\address{$^{11}$ Universit\`a degli Studi di Parma, IT}
\address{$^{12}$ INFN, Gruppo Collegato Parma, IT}
\address{$^1$ INFN Sezione di Roma, Roma, IT}
\address{$^2$ Gran Sasso Science Institute (GSSI), L'Aquila, IT}
\address{$^3$ INFN Laboratori Nazionali del Gran Sasso, L'Aquila, IT}
\address{$^4$ INFN Sezione di Torino, Torino, IT}
\address{$^5$ Nikhef, Amsterdam, NL}
\address{$^6$ EGO-European Gravitational Observatory, Cascina (PI), IT}
\address{$^7$ Universit\`a degli Studi di Genova, Genova, IT}
\address{$^8$ INFN Sezione di Genova, Genova, IT}
\address{$^9$ INFN Sezione di Padova, Padova, IT}
\address{$^{10}$ INFN Sezione di Perugia, Perugia, IT}
\address{$^{11}$ Universit\`a degli Studi di Parma, Parma, IT}
\address{$^{12}$ INFN Gruppo Collegato Parma, Parma, IT}
%\address{Production Editor, \jpcs, \iopp, Dirac House, Temple Back, Bristol BS1~6BE, UK}
......@@ -32,7 +32,7 @@ The amount of data processed during the last few years has emphasized the fact t
\section{Advanced Virgo computing model}
\subsection{Data production and data transfer}
The Advanced Virgo data acquisition system is writing about 35MB/s of data (so-called ``bulk data'') during O3. CNAF and CC-IN2P3 are the Virgo Tier-0: during the science runs, bulk data is stored in a circular buffer located at the Virgo site, and simultaneously transferred to the remote computing centres where they are archived in tape libraries. The transfer is realized through an ad-hoc procedure based on GridFTP (at CNAF) and iRods (at CC-IN2P3). Other data fluxes reach CNAF during science runs:
The Advanced Virgo data acquisition system is writing about 35MB/s of data (so-called ``bulk data'') during O3. CNAF and CC-IN2P3 are the Virgo Tier 0: during the science runs, bulk data is stored in a circular buffer located at the Virgo site, and simultaneously transferred to the remote computing centers where they are archived in tape libraries. The transfer is realized through an ad-hoc procedure based on GridFTP (at CNAF) and iRods (at CC-IN2P3). Other data fluxes reach CNAF during science runs:
\begin{itemize}
\item trend data (few GB/day), periodically transferred using the system described above;
......@@ -42,23 +42,23 @@ The Advanced Virgo data acquisition system is writing about 35MB/s of data (so-c
\subsection{Data Analysis at CNAF}
%The analysis of the LIGO and Virgo data was made jointly by the two collaborations; the analysis pipelines are distributed among the worldwide network of computing facilities offering computing resources to the GW experiments. CNAF was mainly used for CW analysis, looking for continuous gravitational wave signals, developed by INFN–Roma people (see hereafter more details). But at CNAF is also running part of the pyCBC pipeline, submitted via OSG, looking for compact binaries signals. pyCBC has a crucial role in the detection of the coalescence of BBH and BNS. CNAF contributed to the computation performed through pyCBC for the analysis of the events GW170814, the first BBH coalescence detected also by Virgo, and GW170817, the BNS coalescence. During the last month a new extension of CVMFS, \emph{big cvmfs} was mounted at cnaf to support another OSG pipeline, \emph{BayesWave}. The big cvmfs is able to export, in a posix fashion, big file of data from nearby cache in Amsterdam instead of accessing data directly from Nebraska. BayesWave is a Bayesian algorithm designed to robustly distinguish gravitational wave signals from noise and instrumental glitches without relying on any prior assumptions of waveform morphology. In the last year coherent WaveBurst \emph{cwb} was ported to cnaf and made available to run. cwb is a pipeline based on coherent algorithm for detection and reconstruction of modelled and unmodelled GW bursts. A new newtonian noise cancellation algoritmh, developed by the group of Gran Sasso Science Institute (\emph{GSSI}) was made available very recently. The increased number of LVC pipelines running at cnaf has led to saturate advance virgo pledge at cnaf, cnaf promptly rensponded to advance virgo needed enlargin our quota and giving experimental access to gpu.
LIGO-Virgo data analysis is organized jointly, meaning that the analysis pipelines are made available to the computing facilities related to the LVC network, ready to be distributed to each GW detector. CNAF has been mainly used for Continuous Wave(\emph{CW}) analysis, led by the Roma INFN group, and for the Compact Binary Coalescence python-based analysis (\emph{pyCBC}), submitted via OSG. In particular CNAF computationally contributed to GW170814 and GW170817 events, respectively the first BBH coalescence detected by Virgo and the first BNS merger ever observed. During the last month a new extension of CVMFS, so-called ``big cvmfs'', was mounted at CNAF to support another OSG-based pipeline, Bayes Wave. The former is able to make available, in a POSIX-like fashion, big data files from a cache in Amsterdam, instead of accessing the data directly from Nebraska. The latter is a Bayesian algorithm, designed to robustly distinguish GW signals from noise and instrumental glitches, without relying on any prior assumptions on the waveform shape. During the last year, coherent WaveBurst(\emph{cWB}), an algorithm dedicated to the detection and reconstruction of GW Bursts, was also ported to CNAF. Furthermore, new Newtonian Noise cancellation algorithms, which are currently being developed by the GSSI group, were made recently available. The increasing number of LVC pipelines running at CNAF has led to resource saturation, and consequently to a demand for enlarged computing power, together with access to GPUs.
LIGO-Virgo data analysis is organized jointly, meaning that the analysis pipelines are made available to the computing facilities related to the LVC network, ready to be distributed to each GW detector. CNAF has been mainly used for Continuous Wave (\emph{CW}) analysis, led by the Roma INFN group, and for the Compact Binary Coalescence python-based analysis (\emph{pyCBC}), submitted via OSG. In particular CNAF computationally contributed to GW170814 and GW170817 events, respectively the first BBH coalescence detected by Virgo and the first BNS merger ever observed. During the last month a new extension of CVMFS, so-called ``big cvmfs'', was mounted at CNAF to support another OSG-based pipeline, Bayes Wave. The former is able to make available, in a POSIX-like fashion, big data files from a cache in Amsterdam, instead of accessing the data directly from Nebraska. The latter is a Bayesian algorithm, designed to robustly distinguish GW signals from noise and instrumental glitches, without relying on any prior assumptions on the waveform shape. During the last year, coherent WaveBurst (\emph{cWB}), an algorithm dedicated to the detection and reconstruction of GW Bursts, was also ported to CNAF. Furthermore, new Newtonian Noise cancellation algorithms, which are currently being developed by the GSSI group, were made recently available. The increasing number of LVC pipelines running at CNAF has led to resource saturation, and consequently to a demand for enlarged computing power, together with access to GPUs.
\subsubsection{CW pipeline}
CNAF has been in 2018 the main computing center for Virgo all-sky continuous wave (CW) searches. The search for this kind of signals, emitted by spinning neutron stars, covers a large portion of the source parameter space and consists of several steps organized in a hierarchical analysis pipeline. CNAF has been mainly used for the ``incoherent'' stage, based of a particular implementation of the Hough transform, which is the heaviest part of the analysis from a computational point of view. The code implementing the Hough transform has been written in such a way that the exploration of the parameter space can be split in several independent jobs, each covering a range of signal frequencies and a portion of the sky. This is an embarrassingly parallel problem, very well suited to be run in a distributed computing environment. The analysis jobs have been run using the EGI UMD grid middleware, with input and output files stored in a StoRM-based Storage Element at CNAF. Candidate post-processing, consisting of clusterisation, coincidences and ranking, and parts of the candidate follow-up analysis have been also carried on at CNAF. Typical Hough transform jobs needs about 4GB of memory (with a fraction requiring more, up to 8GB). Past year most of the resources have been used to analyze Advanced LIGO O2 data. Overall, in 2018 more than 10M CPU hours have been used at CNAF for CW searches, by running O($10^5$) jobs, with duration from a few hours to ~3 days.
CNAF has been in 2018 the main computing center for Virgo all-sky continuous wave (CW) searches. The search for this kind of signals, emitted by spinning neutron stars, covers a large portion of the source parameter space and consists of several steps organized in a hierarchical analysis pipeline. CNAF has been mainly used for the ``incoherent'' stage, based of a particular implementation of the Hough transform, which is the heaviest part of the analysis from a computational point of view. The code implementing the Hough transform has been written in such a way that the exploration of the parameter space can be split in several independent jobs, each covering a range of signal frequencies and a portion of the sky. This is an embarrassingly parallel problem, very well suited to be run in a distributed computing environment. The analysis jobs have been run using the EGI UMD grid middleware, with input and output files stored in a StoRM-based Storage Element at CNAF. Candidate post-processing, consisting of clusterisation, coincidences and ranking, and parts of the candidate follow-up analysis have been also carried on at CNAF. A typical Hough transform job needs about 4GB of memory (with a fraction requiring more, up to 8GB). Past year most of the resources have been used to analyze Advanced LIGO O2 data. Overall, in 2018 more than 10M CPU hours have been used at CNAF for CW searches, by running O($10^5$) jobs, with duration from a few hours to ~3 days.
\subsubsection{cWB pipeline}
Starting in 2019, the coherent WaveBurst based pipelines have been ported and adapted to run at CNAF to reproduce the cWB environment setup on the worker nodes, without the constraint to read the user home account during running. It is planned to run at CNAF all Virgo offline long duration all-sky searches on the data that will be collected during the Observational Run 3 (03) that started April 1st, 2019. cWB is a data-analysis tool to search for a broad range of gravitational-wave (GW) transients. The pipeline identifies coincident events in the GW data from earth-based interferometric detectors and reconstructs the gravitational wave signal by using a constrained maximum likelihood approach. The algorithm performs a time-frequency analysis of the data, using wavelet representation, and identifies the events by clustering time-frequency pixels with significant excess coherent power. The likelihood statistics is built as a coherent sum over the responses of different detectors and estimates the total signal to noise ratio of the GW signal in the network. The pipeline splits the total analysis time into sub-periods to be analyzed in parallel jobs, using HTCondor tools and it is expected to use a consistent amount of CPU hours during 2019.
Starting in 2019, the coherent WaveBurst based pipelines have been ported and adapted to run at CNAF to reproduce the cWB environment setup on the worker nodes, without the constraint to read the user home account during running. It is planned to run at CNAF all Virgo offline long duration all-sky searches on the data that will be collected during the Observational Run 3 (03) that started April 1, 2019. cWB is a data-analysis tool to search for a broad range of gravitational-wave (GW) transients. The pipeline identifies coincident events in the GW data from earth-based interferometric detectors and reconstructs the gravitational wave signal by using a constrained maximum likelihood approach. The algorithm performs a time-frequency analysis of the data, using wavelet representation, and identifies the events by clustering time-frequency pixels with significant excess coherent power. The likelihood statistics is built as a coherent sum over the responses of different detectors and estimates the total signal to noise ratio of the GW signal in the network. The pipeline splits the total analysis time into sub-periods to be analyzed in parallel jobs, using HTCondor tools and it is expected to use a consistent amount of CPU hours during 2019.
\subsubsection{Newtonian noise pipeline}
The cancellation of gravitational noise from seismic fields will be a major challenge both from theoretical and computational point of view, since the involved simulations are very demanding. This activity requires the accurate positioning of a large number of seismometers. A cluster at CNAF was used to run position optimisations of the seismic arrays used for cancellation and to determine the cancellation performance as a function of the number of sensors and its robustness with respect to sensor-positioning accuracy.
\subsection{outlook}
The first detection of gravitational waves (GW) and the birth of multi-messenger astrophysics have opened a new field of scientific research. With the possibility to detect GW from various kind of sources we can probe new physical phenomena in regions of the Universe we couldn't explore before, with new perspectives on our knowledge about how it works.
Indeed, so far only signals from the coalescence of compact objects have been detected, while one of the most interesting and promising class of continuous GW signals, coming from asymmetrical rotating neutron stars, is still missing. Wide searches of this kind of signals require a huge amount of computational power due to the Doppler effect of the Earth motion, which disrupts the incoming signal dramatically increases the parameters space. This means that it is necessary to develop complex algorithms to reduce the computational power needed, at the price of significantly reducing the sensitivity of the search.
\subsection{Outlook}
The first detection of gravitational waves (GW) and the birth of multi-messenger astrophysics have opened a new field of scientific research. With the possibility to detect GW from various kinds of sources we can probe new physical phenomena in regions of the Universe we couldn't explore before, with new perspectives on our knowledge about how it works.
Indeed, so far only signals from the coalescence of compact objects have been detected, while one of the most interesting and promising class of continuous GW signals, coming from asymmetrical rotating neutron stars, is still missing. Wide searches of this kind of signals require a huge amount of computational power due to the Doppler effect of the Earth motion, which disrupts the incoming signal and dramatically increases the parameters space. This means that it is necessary to develop complex algorithms to reduce the computational power needed, at the price of significantly reducing the sensitivity of the search.
The development of new algorithms, which use the high efficiency and computational power of modern GPUs, showed that the new codes on a single GPU can run with a factor of ten speed-up with respect to the older ones on a ten times more expensive multi-core CPU.
For the CW case, using real data from the 9 months long run of the LIGO detectors we have estimated that on a cluster of about 200 GPUs a complete search can be done in about a couple of months, to be confronted with the several months required by the older code on a 2000 CPUs cluster.\\ A GPU cluster would be also extremely useful to test and train Machine Learning algorithms, which in the recent years were shown to be able to face very complex analyses with high efficiency and speed.\\
For the CW case, using real data from the 9 months long run of the LIGO detectors we have estimated that on a cluster of about 200 GPUs a complete search can be done in about a couple of months, to be compared with the several months required by the older code on a 2000 CPUs cluster.\\ A GPU cluster would be also extremely useful to test and train Machine Learning algorithms, which in the recent years were shown to be able to face very complex analyses with high efficiency and speed.\\
Advanced Virgo and Advanced LIGO are also exploring different technologies to face the new challenges of GW physics. The growing number of computing centers involved in GW research forces us to relax our idea on computing, searching a way to uniformly run different pipelines in complex and heterogeneous infrastructures. For example, the de-supporting of GridFTP pushes towards the use of Rucio, a well supported and flexible tool for data-transfer and management, while the de-supporting of the Cream-CE suggests a redesign of the job submission strategy, possibly under the control of an overall management system like DIRAC. \\ CNAF staff is intensively supporting Virgo members in all this these tests.
......