storage.tex

\documentclass[a4paper]{jpconf}

\usepackage{url}
\usepackage[]{color}
\usepackage{graphicx}
\usepackage{makecell}
\usepackage{booktabs}
\usepackage{subfig}
\usepackage{float}
\usepackage{graphicx}
\usepackage{tikz}
\usepackage[binary-units=true,per-mode=symbol]{siunitx}

% \usepackage{pgfplots}
% \usepgfplotslibrary{patchplots}
% \usepackage[binary-units=true,per-mode=symbol]{siunitx}

\begin{document}
\title{Data management and storage systems}
\author{A. Cavalli$^1$, D. Cesini$^1$, A. Falabella$^1$, E. Fattibene$^1$, L.Morganti$^1$, A. Prosperini$^1$, V. Sapunenko$^1$}
\address{$^1$ INFN-CNAF, Bologna, IT}
\ead{vladimir.sapunenko@cnaf.infn.it}


\section{Introduction}
The  Data  management  group,  composed  by  7  people  (5  permanent),  is  responsible  for  the  installation,  configuration  and  operations  of  all  Data  Storage  systems  including  the  Storage  Area  Network  (SAN)  infrastructure, the disk servers, the Mass Storage System (MSS) and the tape library, as well as the data management services like GridFTP, XRootD, WebDAV and the SRM interfaces (StoRM in our case) to the storage systems. 

The installed capacity, at the end of 2018, was around 37 PB of net disk space, and around 72 PB of tape space.

The  storage  infrastructure  is  based  on  industry standards  allowing  the  implementation  of  a  completely  redundant data access system from a hardware point of view and capable of very high performances. 

The structure is the following: 
\begin{itemize}
\item Hardware level: Medium Range or enterprise level storage systems interconnected via Storage Area Network (Fiber Channel or InfiniBand) to the disk-servers.
\item File-system level: IBM Spectrum Scale (formerly GPFS) to manage the storage systems. The GPFS file-systems, one for each main experiment, are directly mounted on the compute nodes, so that jobs from users have direct Posix access to all data. Given the latency between two sites is low (comparable to the one experienced on the LAN), the file-systems from CNAF are directly mounted on CINECA nodes as well.
\item Hierarchical Storage Manager (HSM): GEMSS (Grid Enabled Mass Storage System, see Section 3 below), a thin software layer developed in house, and IBM Spectrum Protect (formerly TSM) to manage the tape library. In the current setup, a total number of 5 HSM nodes (1 for each LHC experiments and 1 for all the others) are used in production to provide all the data movements between disks and tapes for all the experiments. 

\item Data access: users can manage data via StoRM and access the data via Posix, GridFTP, XRootD, WebDAV/http. Virtualization of data management servers (StoRM FrontEnd and BackEnd) within single VM for each major experiment permitted to consolidate HW and increase availability of services; data movers, instead, run on 
dedicated high performing hardware.
\end{itemize}

By the end of March 2018, we have concluded the installation of the last part of the 2018 tender. The new storage
was represented by 3 Huawei OceanStor 18800v5 systems for the total of 11.52 PB of usable space and 12 I/O servers equipped with 2x100GbE and 2x56Gps IB cards.

By the end of 2018, we dismissed older storage systems which were in service for more than 6 years, and migrated about 9 PB of data to the recently installed storage. The data migration was performed without any service interruption thanks to Spectrum Scale functionality which permits file system reconfiguration on-the-fly.

A list of storage systems in production as of 31.12.2018 is given in Table \ref{table:3}.
 
\begin{table}[h!]
    \centering
    \begin{tabular}{|c|c|c|}
    \hline
    Storage System & Quantity & Net Capacity, TB \\
    \hline
        DDN SFA 12K & 2 & 10240 \\
        DELL MD3860 & 4 & 2304 \\
        Huawei OS6800v5  & 1 & 5521\\
        Huawei OS18000v5  & 5 & 19320\\
        \hline
        Total (On-line) &&37385 \\
        \hline
    \end{tabular}
    \caption{Storage systems in production as of 31.12.2018.}
    \label{table:3}
\end{table}

\section{Recovery from the flooding of 9/11/2017}

The first three months of 2018 were completely dedicated to recovery of the hardware and restoring of the services after the flood event which
happened on November $9^{th}$ 2017. 
At that time, the Tier 1 storage at CNAF consisted of the resources listed in Table \ref{table:1}. Almost all storage resources were damaged or contaminated by dirty water. 
 
\begin{table}[h!]
    \centering
    \begin{tabular}{|c|c|c|c|}
    \hline
    System & Quantity & Net Capacity, TB & (\%) of Use, \\
    \hline
        DDN SFA 12K & 2 & 10240 & 95 \\
        DDN SFA 10K & 1 & 2500 & 96 \\
        DDN S2A 9900 & 6 & 5700 & 80 \\
        DELL MD3860 & 4 & 2200 & 97 \\
        Huawei OS6800v5  & 1 & 4480 & 48 \\
        \hline
        Total (On-line) && 25120 &\\
        \hline
    \end{tabular}
    \caption{Storage systems in production as of November $9^{th}$ 2017.}
    \label{table:1}
\end{table}

The recovery started as soon as the flooded halls became accessible. As the first step, we extracted all tape cartridges and hard disks which went in contact with water respectively from the tape library and from disk enclosures. After extraction, all of them were marked with respective position, cleaned, dried and stored in secure place.

\subsection{Recovery of disk storage systems}
The strategy for recovering disk storage systems varied depending on redundancy configuration and availability of technical support.

\subsubsection{DDN}
 All DDN storage systems consisted of a pair of controllers and 10 Disk Enclosures and were configured with RAID6 (8+2) level of data protection in such a way that every RAID group was distributed over all 10 enclosures. Thus, having one Disk Enclosure damaged in every DDN storage system means reduced level of redundancy. In this case we decided to operate systems with reduced redundancy for the time needed to evacuate data to newly installed storage or substitute damaged enclosures and relative disks with new ones and rebuild missing parity.
For the most recent and still maintained systems, we decided to replace all potentially damaged parts, and specifically 3 Disk Enclosures and 3x84 8TB disks. 

After cleaning and drying ,we tested several disk drives in our lab and found that Helium filled HDD being well insulated are mostly immune to water contamination. 
The only sensitive part on such drives is the electronic board and connectors which are easily cleanable even without special equipment.
Cleaning of Disk Enclosures is much more complicated or even impossible. 

For this reason, we decided to replace only DAE and populate them with old but cleaned HDDs, startup the system and then replace and reconstruct old disks one by one while in production. In this way we were able to start using the biggest part of our storage immediately after restore of our power plant.

For the older DDN systems like SFA10000 and S2A 9900, we decided to disconnect contaminated enclosures (one in each system) and to run them with reduced redundancy (RAID5 8+1 instead RAID6 8+2) while moving data to the new storage systems.
    
\subsubsection{Dell}
Air-filled disks after cleaning and drying demonstrated limited operability (up to 2-3 weeks), usually enough for data evacuation.
For Dell MD3860f storage system the situation was quite different since there were only 3 DAE of 60 HDD each, 24 contaminated disks in each system and data protection was based on Distributed RAID technology.

In this case, working in close connection with Dell Support Service and trying to minimize costs, we decided to replace only contaminated elements like electronics boards, backplanes and chassis, leaving original (cleaned and dried) disks in their places and replacing them with new ones after powering-on the system one-by-one, so to allow the rebuild of missing parity. Replacement and rebuild took about 3 weeks for each MD3860f system. During this time, we observed only 3 failures (distributed in time) of “wet” HDDs successfully recovered by automated rebuild using reserved capacity. 

\subsubsection{Huawei}
The Huawei OceanStor 6800v5 storage system consisting of 12 disk enclosures of 75 HDD each were installed in 2 cabinets and ended up with two disks enclosures on the lowest level. Therefore, they were contaminated by water. The two contaminated disk enclosures belonged to two different Storage pools.

The data protection in this case was similar to that adopted for Dell MD3860, i.e. three Distributed RAID groups built on top of Storage pools of four disk enclosures. For the recovery we followed the procedure described above, and replaced two disk enclosures. The spare parts were delivered and installed, the disks were cleaned and installed in their original places. However, when powered on, the system did not recognize new enclosures. It turned out that delivered enclosures were incompatible on firmware level with the controllers. While debugging this issue, the system remained powered on and the disks began deteriorating. Finally, when the compatibility issue was solved after two weeks, the number of failed disks had exceeded the supported redundancy. Hence, two out of three RAID-set became permanently damaged, and two third of all data stored on this system were permanently lost. 
The total volume of lost data amounts to 1.4 PB out of 22 PB stored at CNAF data center at the moment of flood.

    \subsection{Recovery of tapes and tape library}
    
    The SL8500 tape library was contaminated by water in its lowest 20 cm, enough to damage several components and 166 tape cartridges that were stored in the first two levels of slots (out of a total of 5500 cartridges in the tape library).
    Part of the damaged tapes (16) were still empty.
    As a first intervention, wet tapes were removed and placed in a safe place, so to let them dry and to start evaluating the potential data loss. 
 
 The TSM database was restored from a backup copy saved on a separate storage system, evacuated to CNR site. This operation permitted to individuate the content of all wet tapes. 

We communicated the content of each wet tape to the experiments, asking them whether the data on those tapes could be recovered from other sites or possibly be reproduced.
It turned out that data contained in 75 tapes were unique and non-reproducible, so those cartridges were sent to a laboratory of an external company to be recovered.
The recovery process lasted 6 months and 6 tapes resulted partially unrecoverable (20 TB lost out of a total of 630 TB).

    In parallel, a non-trivial work started to clean, repair and certify again the library, finally reinstating the maintenance contract that we still had in place (though temporarily suspended) with Oracle. External technicians disassembled and cleaned all the library and its modules, which also allowed the underlying damaged floating floor to be replaced. Main powers and two robot hands were replaced, and one T10kD tape drive went lost. 
    
    When the SL8500 was finally ready and turned on again, a control board placed in the front door panel got burned, and was therefore replaced, clearly damaged by the moisture.
    
Once the tape system was put back in production, we audited a sample of non-wet cartridges in order to understand whether the humidity had damaged the tapes during the period immediately after the flood. 500 cartridges (4.2 PB), heterogeneous per experiment and age, were chosen. As a result, 90 files resulted unreadable from 2 tapes, that is a normal error rate compared to production, so no issue related to the exposure to the water has been observed.

The flood affected also several tapes (of 8 GB each) containing data taken from the RUN1 of the CDF experiment, that ran at Fermilab since 1990. When the flood happened, CNAF team had been working to replicate CDF data stored on those old media tapes to modern and reliable storage technologies, in order to make them accessible for further usage. Those tapes were dried in the hours immediately after the flood, but their legibility was not verified afterwards.


    \subsection{Recovery of servers, switches, etc.}
In total 15 servers were damaged by contact with water, mainly by leak of acid from on-board batteries which happens in prolonged presence of moisture. In fact, recovery of servers was not of our priority and all contaminated servers remained untouched for about a month. Only one server has been recovered, 6 servers were replaced by already dismissed ones still in working conditions, and 8 servers were purchased as new.
Also, three Fiber Channel switches were affected by the flood: Brocade 48000 (384 ports) and two Brocade 5300 (96 ports each). All three switches were successfully recovered after cleaning and replacement of power supply modules.

\subsection{Results of hardware recovery}
At the end, after the restart of the Tier 1 data center, we have completely recovered all services and most part of the HW, as described in the following Table \ref{table:2}.

\begin{table}[h!]
    
    \begin{tabular}{|c|c|c|c|c|p{4cm}|}
    \hline

Category & Device & qty & Tot. Capacity & Status & Comment \\
\hline
SAN & Brocade 48000 & 1 & 384 ports & recovered & repaired power distribution board \\
SAN & Brocade 5300 & 2 & 196 ports & recovered & replaced power supply units \\
Storage & DDN S2A 9900 & 6 & 5.7 TB & recovered & repaired 6 controllers, replaced 30 disks and 6 JBODs using already decommissioned system, all data preserved \\
Storage & DDN SFA 10000 & 1& 2.5 PB & recovered & with reduced redundancy, all data moved to a new storage, then dismissed \\
Storage & DDN SFA 120000 & 3 & 11.7 PB & recovered & replaced 4 JBOD, 240 disks of 8TB (new) and 60 disks of 3TB (the last one from decommissioned system), all data preserved \\
Storage &Dell MD3860 &2&1.1PB& recovered & replaced 2 enclosures and 48 disks, all data preserved\\
Storage & Huawei OS6800 & 1 & 4.5PB & recovered & replaced 2 enclosures and 150 disks, 1.4PB of user data lost\\
Servers & & 15 && recovered & 1 recovered and 14 replaced\\
\hline
   \end{tabular}
    \caption{2017 flood: summary of damages.}
    \label{table:2}
\end{table}


\section{Storage infrastructure resiliency}

Considering the increase in single disk capacity, we have moved from RAID6 data protection to Distributed RAID in order to speed up the rebuild of the eventually failed disks. On the other hand, given the foreseen (huge) increase of the installed disk capacity, we are doing a consolidation of the disk-server infrastructure with a sharp decrease in their number: in the last two tenders, each server was configured with 2x100 Gbps Ethernet and 2x56 Gbps (FDR) IB connections while the disk density has been increased from ~200 TB-N/server to ~1000 TB-N/server. 

Currently, we have about 45 disk servers to manage ~37 PB of storage capacity.

Also the SAN is being moved from FC to IB, which is cheaper and more performing, whereas the part dedicated to the tape drives and the TSM servers (Tape Area Network or TAN) will remain based on FC. 

We are trying to keep all our infrastructures redundant: the dual-path connection from the servers to the SAN, coupled with the path-failover mechanism, which implements also load-balancing, eliminates several single points of failure (server connections, SAN switches, controllers of the disk storage box) and allows a robust and performing implementation of clustered file-systems like GPFS.

The StoRM instances have been virtualized both allowing the implementation of HA.

\section{GEMSS}
GEMSS is the Mass Storage System used at the Tier 1, a full HSM integration  of  the  General  Parallel  File  System  (GPFS),  the  Tivoli  Storage  Manager  (TSM),  both  from  IBM,  and  StoRM  (developed  at  INFN);  its  primary  advantages  are  a  high  reliability  and  a  low  effort  needed for its operation. 

The GPFS and TSM interaction is the main component of the GEMSS system: a thin software layer has been developed in order to optimize the migration (disk to tape data flow) and, in particular, the recall (tape to disk data flow) operations. 

While the native GPFS and TSM implementation of HSM performs recalls file per file, GEMSS collects all the requests in a configurable time lapse and then performs re-ordering to minimize the number of mount/dismount operations in the tape library and unnecessary tape “seek”  operation  on  a single tape.  

The  migrations  from  disk  to  tape  are  driven  through  configurable  GPFS  policies.

The TSM core component is the TSM server (with a “warm” standby machine ready) which relies on a database (replicated and backed up every 6 hours over the SAN) which keeps all metadata information.

StoRM implements the SRM interface and it is designed to support guaranteed space reservation and direct access using native Posix I/O calls to the storage. 

\section{Tape library}
At present, a single tape library SL8500 is installed. The library has undergone various upgrades and it is now fully populated with tape cartridges having 8.4 TB of capacity. In the period 2014-2016 a complete repack has been performed moving all the data to the current technology tapes. After  the flooding in 2017,  one  tape  drive  and  several  tapes  were  damaged:  now  the  library  is  equipped  with 16 T10kD drives, all interconnected via 16 Gbps FC to the TAN. 

Since the present library is expected to be completely filled over 2019, a tender is ongoing for a new one. In the meanwhile, the TAN infrastructure has been upgraded to FC 16 Gbps. 

The 16 T10kD tape drives are shared among the file systems handling the scientific data. Currently, there is no way to allocate dynamically more or less drives to recall or migration activities on the different file  systems. 
In  fact, the  HSM  system  administrators  can  only  set  manually  the  maximum  number  of  migration or recall threads for each file system by modifying the GEMSS configuration file. Due to this  static  setup,  we  experience  that  frequently  some  drives  are  idle  and,  at  the  same  time,  we  notice  a  certain number of pending recall threads that could become running by using those free drives. In order to  overcome  this  inefficiency,  we  designed  a  software  solution,  and namely a  GEMSS  extension,  to  automatically assign free tape drives to accomplish pending recalls and to perform administrative tasks on tape storage pools, such as space reclamations or repack. 

\section{Backup and recovery service}
The Data Management group is also responsible for the backup and recovery service that is running to protect different kinds of CNAF IT services data (mail servers, repositories, service configurations, logs, documents, etc.). 

This service was re-designed during 2016 after a couple of episodes of data loss that needed restore of backed-up data from the system. In those cases, data were recovered successfully, but that experience convinced the system administrators to make the service more efficient and secure.  Data are stored as multiple copies on both disk and tape, with different retention times. 

\section{Data preservation}

CNAF provides the Long Term Data Preservation of the CDF RUN-2 dataset ($\sim$4 PB) collected between 2001 and 2011 and already stored on CNAF tapes since 2015. 140 TB of CDF data were unfortunately lost  because  of  the  flood  occurred  at  CNAF  on  November  2017;  however  now  all  these  data have  been successfully  re-transferred  from  Fermilab  via  GridFTP  protocol.  The  CDF  database  (based  on  Oracle), containing information about CDF datasets such as their structure, file locations and metadata, has been imported from FNAL to CNAF.

The  Sequential  Access  via  Metadata  (SAM)  station, a  data-handling  tool  specific to CDF  data  management and  developed at Fermilab, has been installed on a dedicated SL6 server at CNAF. This is a fundamental step in the perspective of a complete decommissioning of CDF services by Fermilab. The SAM station allows to  manage  data  transfers  and  to  retrieve  information  from  the  CDF  database;  it also provides a SAMWeb tool which uses HTTP protocol for accessing the CDF database.  

Work is ongoing to verify the availability and the correctness of all CDF data stored on CNAF tapes: we are reading all files from the tapes, calculating their checksum and comparing it with the one stored in the database and retrieved through the SAM station. Recent tests showed that CDF analysis jobs, using CDF software distributed via CVMFS and requesting delivery of CDF files stored on CNAF tapes, work properly. When some minor issues regarding the use of X.509 certificates for authentication on CNAF farm will be completely solved, CDF users will be able to access CNAF nodes and submit their jobs via LSF or HTCondor batch systems.


\section{Third Party Copy activities in DOMA}

At the end of the summer, we joined the TPC (Third Party Copy) subgroup of the WLCG’s DOMA\footnote{Data Organization, Management, and Access. see https://twiki.cern.ch/twiki/bin/view/LCG/DomaActivities} project, dedicated to improving bulk transfers between WLCG sites using non-GridFTP protocols. In particular, the INFN-Tier 1 is involved in these activities for what concerns StoRM WebDAV. 

In October, the two StoRM WebDAV servers used in production by the ATLAS experiment have been upgraded to a version that implements basic support for Third-Party-Copy, and both endpoints entered the distributed TPC testbed of volunteer sites.


\section{References}
\begin{thebibliography}{9}
	\bibitem{ref:GEMSS} Ricci, Pier Paolo et al., The {G}rid {E}nabled {M}ass {S}torage {S}ystem ({GEMSS}): the {S}torage and {D}ata management system used at the {INFN} {T}ier1 at {CNAF} {\it J. Phys.: Conf. Ser.} {\bf 396} 042051 - IOP Publishing (2012)
	\bibitem{ref:storm} Carbone, A., dell'Agnello, L., Forti, A., Ghiselli, A., Lanciotti, E., Magnoni, L., ... \& Zappi, R. (2007, December). Performance studies of the StoRM storage resource manager. In e-Science and Grid Computing, IEEE International Conference on (pp. 423-430). IEEE.
	\bibitem{ref:puppet} “CNAF Provisioning system” on CNAF Annual Report 2015
\end{thebibliography}


\end{document}