summerstudent.tex

\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\begin{document}
\title{INFN CNAF log analysis: a first experience with summer students}

\author{Daniele Bonacorsi$^1$, Andrea Ceccanti$^2$, Tommaso Diotalevi$^1$, Antonio Falabella$^2$, Luca Giommi$^2$, Barbara Martelli$^2$, Diego Michelotto$^2$, Lucia Morganti$^2$, Elisabetta Ronchieri$^2$, Simone Rossi Tisbeni$^1$, Enrico Vianello$^2$}

\address{$^1$University of Bologna, $^2$INFN CNAF}

\ead{barbara.martelli@cnaf.infn.it}

\begin{abstract}
In 2018 the INFN CNAF computing center has started to investigate predictive and preventive maintenance solutions in order to improve fault diagnosis by applying machine learning techniques to hardware and service logs. An excellent experience has been carried out by three students who dedicated three summer months to collect logs of the StoRM services and the resources that host them, to preprocess these logs in order to remove all bias information and to perform initial data analysis. Here we are going to present the activities fulfilled by these students, the initial outcome and the ongoing work at the INFN CNAF data center.
\end{abstract}

\section{Introduction}
In recent years INFN CNAF has put a great effort to define and implement a common monitoring infrastructure based on Sensu, InfluxDB and Grafana and to centralize logs from the most relevant services \cite{bovina2015, bovina2017}. Nowadays, this unified infrastructure has been fully integrated in the data center \cite{fattibene2018} and there is the intention to face the new challenge/opportunity to correlate this vast volume of data and extract actionable insights.

During the summer 2018 a first investigation has been exploited with the help of three summer student \cite{seminario}. Once identified a specific system to analyze, i. e. StoRM, the following activities have been addressed:
\begin{itemize}
\item Log collection and harmonization
\item Log parsing of various services, such as StoRMfrontend, StoRMbackend, heartbeat, messages, GridFTP and GPFS (not covered in our study, but potentially interesting)
\item Metrics data adding (from Tier 1 InfluxDB)
\end{itemize}

However, to provide a first proof of concept for the predictive and preventive maintenance, data categorization and machine learning techniques application represent two key points that have been conducted from the end 2018 and the middle 2019.

\section{Log collection and harmonization}
The first part of the work consisted in the collection of StoRM logs from the StoRM servers dedicated to the Atlas experiment. 

Subsequently, most relevant information was extracted from the logs using the ELK Stack suite \cite{elk}. The ELK stack consists of four components: Beats used for data collection from multiple sources, Logstash used for data aggregation and processing, Elasticsearch used for store and index data, Kibana for data analysis and visualization. In particular, Logstash has been used to ingest data from Beats in a continuous live-feed streaming, filter relevant entries and parse each event, identifying named fields to build a user defined structure and ship parsed data to the Elasticsearch engine. Most data was filtered using a \textit{grok} filter which is based on regular expressions and provides predefined filters together with the ability of defining customized ones. 
Finally, several dashboards were created using Kibana in order to show in a human-friendly way a summary of the most relevant information derived from StoRM logs (ee for example \ref{fig3}).
\begin{figure}[h]
\includegraphics[width=20pc]{kibana.png}\hspace{2pc}
\begin{minipage}[b]{14pc}\caption{\label{fig3}An example of Kibana dashboard created.}
\end{minipage}
\end{figure}


\section{Log parsing}
Among the INFN Tier 1 services hosted at the INFN CNAF computing center, there are efficient storage systems, like StoRM that is a grid Storage Resource Manager (SRM) solution. Figure \ref{fig1} shows the StoRM architecture: the frontend service manages user authentication and stores requests data, while the backend service executes SRM functionalities and takes care of space and authorization.
The log files contains basically three types of information: timestamp, metrics, and messages.
\begin{figure}[h]
\includegraphics[width=20pc]{StoRM-full-picture.png}\hspace{2pc}%
\begin{minipage}[b]{14pc}\caption{\label{fig1}The StoRM architecture.}
\end{minipage}
\end{figure}

At the beginning of this work (mid 2018), StoRM at Tier 1 was monitored by InfluxDB and Grafana. Metrics monitored included CPU, RAM, network and disk usage; number of sync SRM request per minute per host; duration of async PTG and PTP per host (avg). We wanted to add information derived from the analysis of StoRM logs to already available monitoring information, in order to derive new insights potentially useful to enhance service availability and efficiency with the long-term intent of implementing a global predictive maintenance solution for Tier 1. In order to build a Machine Learning model for anomaly prediction, logs from two different period were analyzed: a normal behavior period and a critical behavior period (due to wrong configuration of the file system and wrong configuration of the queues coming from the farm). 
A four-steps activity has been carried out: 
\begin{enumerate}
\item Parsing: log files were parsed and deconstructed, converting them to CSV format
\item Feature selection: was done grouping messages based on their common content (core part of the message). The grouping phase resulted in 20 \textit{Request Types} (Connection, Run, Ping, Ls, Check permission, PTG, PTG status, Get space tokens, PTP, PTP status, BOL status, Put don, Release files, Mv, Mkdir, BOL, Abort request, Abort files, Get space metadata, nan) and 15 \textit{Result Types} (SRM\_SUCCESS, SRM\_FAILURE, SRM\_NOT\_SUPPORTED, SRM\_REQUEST\_QUEUED, SRM\_REQUEST\_INPROGRESS, Protocol check failed, Received 4 protocols, Some protocols supported, SRM\_DUPLICATION\_ERROR, rpcResponseHandler\_AbortFiles, SRM\_INVALID\_REQUEST, SRM\_INVALID\_PATH, Received 5 protocols, SRM\_INTERNAL\_ERROR, nan). A first data exploration phase was performed by counting occurrencies of messages in each group. 
Techniques used for the feature selection procedure were: SelectKBest with the chi-squared statistical test, Recursive Feature Elimination, Principal Component Analysis (PCA) and Feature Importance from ensembles of decision tree methods.
\item One-hot encoding: CSV rows encoded in binary vectors (feature vectors). Each vector represents the summary of 15-minutes log contents.
\item Labelling: operation specific for StoRM log files done manually discriminating between normal and critical period based on help-desk tickets. 
\end{enumerate}
Feature vectors obtained in (iii) and labeled datasets built in (iv) were used to train several ML algorithms and to test their accuracy. Figure \ref{fig2} depicts the results of tests performed on the following algorithms: LogisticRegression (LR), LinearDiscriminantAnalysis (LDA), KNeighborsClassifier (KNN), GaussianNB (GNB), DecisionTreeClassifier (CART), BaggingClassifier (BgDT), RandomForestClassifier (RF), ExtraTreesClassifier (ET), AdaBoostClassifier (AB), GradientBoostingClassifier (GB), XGBoostClassifier (XGB), MultiLayerPerceptronClassifier (MLP).

\begin{center}
\begin{figure}[h]
\includegraphics[width=20pc]{MLalgorithms.png}\hspace{2pc}
\begin{minipage}[b]{14pc}\caption{\label{fig2}Machine Learning Algorithms Comparison (scorer=accuracy).}
\end{minipage}
\end{figure}
\end{center}

\section{Metrics data adding}
This activity was mainly focused on collecting metric data from InfluxDB in order to put them in relation with StoRM logs obtained with activities explained in previous sections and extract new insights. 
Key components of log files were identified, parsed and structured in a CSV file with the following columns: timestamp, metric, message, descriptive keys and separators. All timestamps were converted in UNIX epoch time in order to be comparable. On one side, InfluxDB stores information with different granularity depending on the age of data collected and on the other side, StoRM front-end and back-end logs are produced with different frequencies (one line each minute for heartbeat logs, multiple lines every minute for metrics logs, one line every five minutes for InfluxDB more recent data, and so on). Therefore, some concatenation rules have been implemented in order to correctly put in relation all data sources based on the time of occurrence of the event: backend metrics are split by type, timestamp is rounded off to one‐minute precision, in case of overlap the more recent is kept and every CSV file is concatenated and ordered by timestamp.

\section{Conclusion}
This experience is a good example of mutually beneficial collaboration between university students and INFN CNAF. The outcome has allowed master students (i.e. Diotalevi T. and Giommi L.) to publish papers at international conferences \cite{diotalevi, giommi20191}, to win Giulia Vita Finzi's award \cite{giommi20192}, and to start their PhD courses with success. Furthermore, the undergraduate student (i.e. Rossi Tisbeni R) will hold a master degree in Physics in July 2019. On the other hand, the INFN CNAF data center managers has decided to continue exploiting predictive and preventive maintenance to establish where and when to use it to keep services running optimally.

\section*{References}
\begin{thebibliography}{9}
\bibitem{seminario} Martelli B, Giommi L, Rossi Tisbeni S, Diotalevi T, https://agenda.infn.it/event/17430/, 2018.
\bibitem{bovina2015} Bovina S, Michelotto D, Misurelli G, \emph{CNAF Annual Report}, pp. 111--114, 2015.
\bibitem{bovina2017} Bovina S, Michelotto D, In Proc of CHEP 2017.
\bibitem{fattibene2018} Fattibene E, Dal Pra S, Falabella A, De Cristofaro T, Cincinelli G, Ruini M, In Proc of CHEP 2018.
\bibitem{diotalevi} Diotalevi T, Bonacorsi D, Michelotto D, Falabella A, In Proc of International Symposium on Grids \& Clouds (ISGC), Taipei, Taiwan, 2019 (under review).
\bibitem{giommi20191} Giommi L, Bonacorsi D, Diotalevi T, Rossi Tisbeni S, Rinaldi L, Morganti L, Falabella A, Ronchieri E, Ceccanti A, Martelli B, In Proc of International Symposium on Grids \& Clouds (ISGC), Taipei, Taiwan, 2019 (under review).
\bibitem{giommi20192} Giommi L, In INFN CCR Workshop, La Biodola, 3-7 June 2019.
\bibitem{elk}https://www.elastic.co/, site visited on June 2019.
\end{thebibliography}

\end{document}