Skip to content
Snippets Groups Projects
Commit 92c6ff01 authored by Lucia Morganti's avatar Lucia Morganti
Browse files

Update summerstudent.tex

parent 393db110
No related branches found
No related tags found
No related merge requests found
Pipeline #22822 passed
......@@ -20,7 +20,7 @@ During the summer 2018 a first investigation has been exploited with the help of
\begin{itemize}
\item Log collection and harmonization
\item Log parsing of various services, such as StoRMfrontend, StoRMbackend, heartbeat, messages, GridFTP and GPFS (not covered in our study, but potentially interesting)
\item Metrics data adding (from Tier1 InfluxDB)
\item Metrics data adding (from Tier 1 InfluxDB)
\end{itemize}
However, to provide a first proof of concept for the predictive and preventive maintenance, data categorization and machine learning techniques application represent two key points that have been conducted from the end 2018 and the middle 2019.
......@@ -46,7 +46,7 @@ The log files contains basically three types of information: timestamp, metrics,
\end{minipage}
\end{figure}
At the beginning of this work (mid 2018), StoRM at Tier1 was monitored by InfluxDB and Grafana. Metrics monitored included CPU, RAM, network and disk usage; number of sync SRM request per minute per host; duration of async PTG and PTP per host (avg). We wanted to add information derived from the analysis of StoRM logs to already available monitoring information, in order to derive new insights potentially useful to enhance service availability and efficiency with the long-term intent of implementing a global predictive maintenance solution for Tier-1. In order to build a Machine Learning model for anomaly prediction, logs from two different period were analyzed: a normal behavior period and a critical behavior period (due to wrong configuration of the file system and wrong configuration of the queues coming from the farm).
At the beginning of this work (mid 2018), StoRM at Tier 1 was monitored by InfluxDB and Grafana. Metrics monitored included CPU, RAM, network and disk usage; number of sync SRM request per minute per host; duration of async PTG and PTP per host (avg). We wanted to add information derived from the analysis of StoRM logs to already available monitoring information, in order to derive new insights potentially useful to enhance service availability and efficiency with the long-term intent of implementing a global predictive maintenance solution for Tier 1. In order to build a Machine Learning model for anomaly prediction, logs from two different period were analyzed: a normal behavior period and a critical behavior period (due to wrong configuration of the file system and wrong configuration of the queues coming from the farm).
A four-steps activity has been carried out:
\begin{enumerate}
\item Parsing: log files were parsed and deconstructed, converting them to CSV format
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment