@@ -20,7 +20,7 @@ During the summer 2018 a first investigation has been exploited with the help of
\begin{itemize}
\item Log collection and harmonization
\item Log parsing of various services, such as StoRMfrontend, StoRMbackend, heartbeat, messages, GridFTP and GPFS (not covered in our study, but potentially interesting)
\item Metrics data adding (from Tier1 InfluxDB)
\item Metrics data adding (from Tier1 InfluxDB)
\end{itemize}
However, to provide a first proof of concept for the predictive and preventive maintenance, data categorization and machine learning techniques application represent two key points that have been conducted from the end 2018 and the middle 2019.
...
...
@@ -46,7 +46,7 @@ The log files contains basically three types of information: timestamp, metrics,
\end{minipage}
\end{figure}
At the beginning of this work (mid 2018), StoRM at Tier1 was monitored by InfluxDB and Grafana. Metrics monitored included CPU, RAM, network and disk usage; number of sync SRM request per minute per host; duration of async PTG and PTP per host (avg). We wanted to add information derived from the analysis of StoRM logs to already available monitoring information, in order to derive new insights potentially useful to enhance service availability and efficiency with the long-term intent of implementing a global predictive maintenance solution for Tier-1. In order to build a Machine Learning model for anomaly prediction, logs from two different period were analyzed: a normal behavior period and a critical behavior period (due to wrong configuration of the file system and wrong configuration of the queues coming from the farm).
At the beginning of this work (mid 2018), StoRM at Tier1 was monitored by InfluxDB and Grafana. Metrics monitored included CPU, RAM, network and disk usage; number of sync SRM request per minute per host; duration of async PTG and PTP per host (avg). We wanted to add information derived from the analysis of StoRM logs to already available monitoring information, in order to derive new insights potentially useful to enhance service availability and efficiency with the long-term intent of implementing a global predictive maintenance solution for Tier1. In order to build a Machine Learning model for anomaly prediction, logs from two different period were analyzed: a normal behavior period and a critical behavior period (due to wrong configuration of the file system and wrong configuration of the queues coming from the farm).
A four-steps activity has been carried out:
\begin{enumerate}
\item Parsing: log files were parsed and deconstructed, converting them to CSV format