\documentclass[a4paper]{jpconf} \usepackage{url} \usepackage{graphicx} \usepackage{float} \newcommand{\quotes}[1]{``#1''} \begin{document} \title{StoRM 2: initial design and development activities} \author{ A.~Ceccanti$^1$, F.~Giacomini$^1$, E.~Vianello$^1$, E.~Ronchieri$^1$ } \address{$^1$ INFN-CNAF, Bologna, IT} \ead{ andrea.ceccanti@cnaf.infn.it } \begin{abstract} StoRM is the storage element solution that powers the CNAF Tier 1 data center as well as more than 30 other sites. Experience in developing, maintaining and operating it at scale suggests that a significant refactoring of the codebase is necessary to improve StoRM maintainability, reliability, scalability and ease of operation in order to meet the data management requirements coming from HL-LHC and other communities served by the CNAF Tier 1 data center. In this contribution we highlight the initial StoRM 2 design and development activities. \end{abstract} \section{Introduction} \label{sec:introduction} StoRM was first developed by a joint collaboration between INFN-CNAF, CERN and ICTP to provide a lightweight storage element solution implementing the SRM~\cite{ref:srm} interface on top of a POSIX filesystem. StoRM has a layered architecture (Figure~\ref{fig:storm-arch}), split between two main components: the StoRM frontend and backend services. The StoRM frontend service implements the SRM interface exposed to client applications and frameworks. The StoRM backend service implements the actual storage management logic by interacting directly with the underlying file system. Communication between the frontend and the backend services happens in two ways: \begin{itemize} \item via an XML-RPC API, for synchronous requests; \item via a database, for asynchronous requests. \end{itemize} Data transfers are provided by GridFTP, HTTP and XRootD services accessing directly the file system underlying the StoRM deployment. StoRM is interfaced with the IBM Tivoli Storage Manager (TSM) via GEMSS~\cite{ref:gemss}, a component also developed at INFN, to provide optimized data archiving and tape recall functionality. The StoRM WebDAV service provides an alternative data management interface complementary to the SRM functionality, albeit without supporting tape operations yet. In the past years StoRM has powered the CNAF Tier 1 data center as well as dozens of other sites and proved to be a reliable SRM implementation. However, ten years of experience in developing and operating the service at scale has also shown limitations: \begin{itemize} \item The StoRM code base is not unit-tested; this means that there is no quick feedback loop that functionality is not broken when a change is introduced or a refactoring is implemented; there are integration and load test suites that can be used to assess that functionality is not broken, but these test suites are more complex to instantiate, require a full service deployment and do no provide coverage information. \item Data management responsibilities are scattered among several components without clear reasons, increasing maintenance and developments costs. \item The StoRM backend cannot be horizontally replicated; this causes operational problems in production and limits scalability and the ability to adapt dynamically to load changes. \item Logging is not harmonized among the StoRM services and limited tracing is provided, so that it is not trivial to trace the history of an incoming request across the services. \item Core StoRM communication and authentication functionality relies on dated technologies and libraries (e.g., XML-RPC, CGSI-gSOAP); \item The codebase is significantly more complex than needed due to the inorganic growth and lack of periodic quality assessment performed on the code base. \end{itemize} To address these shortcomings, a redesign of the StoRM service has been planned and started this year, in parallel with the main StoRM maintenance and development activities. \begin{figure} \centering \includegraphics[width=.6\textwidth]{storm-arch.png} \caption{\label{fig:storm-arch}The StoRM 1 architecture.} \end{figure} \section{StoRM 2 high-level architecture} The StoRM 2 architecture is depicted in Figure~\ref{fig:storm2-arch}. \begin{figure} \centering \includegraphics[width=.6\textwidth]{high-level-arch.png} \caption{\label{fig:storm2-arch}The StoRM 2 high-level architecture.} \end{figure} The layered architecture approach is maintained, so that service logic is again split between frontend and backend service components. The frontend responsibility is to implement the interfaces towards the outside world. In practice, the frontend is implemented by multiple microservices, each responsible of a specific interface (SRM, WebDAV, etc.). TLS termination and client authentication is implemented at the edge of the service perimeter by one (or more) Nginx reverse proxy instances. There are several advantages in this approach: \begin{itemize} \item The TLS handling load is decoupled from request management load. \item VOMS-related configuration and handling is centralized to a single component, leading to simplified service operation and troubleshooting. \item The TLS terminator becomes a natural place to implement load balancing for the frontend services. \end{itemize} VOMS authorization support is provided by an Nginx VOMS module ~\cite{ref:nginx-voms} developed for this purpose and described in more detail in another contribution in this report. Besides implementing the management protocol endpoints, the frontends expose other management and monitoring interfaces that can be consumed by internal services and may use a relational or in-memory database to persist state information in support of request management and accounting. Frontends do not directly interact with the storage, but delegate the interaction to a backend service. The backend is a stateless service that implements basic management operations on the storage. The storage management operations implemented are the minimum set of operations needed to support the data management interfaces exposed by the frontends. These operations are typically either data object lifecycle operations (e.g., create or remove a file or a directory, list directory contents) or metadata operations (e.g., get the size of a file, manage ACLs). The communication between the frontend and the backend services is implemented on top of gRPC~\cite{ref:grpc}, a remote procedure call system initially developed at Google. The actual messages exchanged between them are synthesized from a description expressed in an interface description language called \textit{Protocol Buffers}~\cite{ref:protocol-buffers}; from the same message description, language-specific client and server stubs are generated. As an example, the following listing shows the description of the messages and of the service involved in the simple case of the \textit{version} command. {\small \begin{verbatim} message VersionRequest { // The version of the client calling the service. string version = 1; } message VersionResponse { // The version of the service answering the call string version = 1; } service VersionService { rpc getVersion(VersionRequest) returns (VersionResponse); } \end{verbatim} } \section{Principles guiding the development work} The following principles have driven the StoRM 2 development work. \begin{itemize} \item The source code will be kept in a Git repository hosted on the INFN Gitlab service; the development will follow a branching model inspired at Git-workflow~\cite{ref:gitflow} and already successfully used for other components developed by the team (e.g., VOMS, INDIGO IAM, StoRM). \item Rhe code for all main components (frontend and backend services, CLIs, etc.) will be hosted on a single repository and a single version number will be shared for all the components. \item A test-driven development approach will be followed, using tools that allow to measure the test coverage of the codebase. The objective is to ensure high coverage ($>90\%$) on all code. \item Whenever possible, the code should be self-documenting; the source code folder structure will be documented with README.md files providing a description of each folder contents; a CHANGELOG file will provide information of new features and bug fixes following established industry best practices~\cite{ref:keep-a-changelog}. \item The development and testing environment will be containerized, in order to ensure a consistent environment definition and avoid "works on my machine" issues. \item Services should provide monitoring and metrics endpoints to enable the collection of status information and performance metrics. \item Service should support graceful shutdown and draining. \item A CI pipeline will be in place, to build and test continuously the code. \item A consistent configuration and logging format will be adopted across all the components, to make service operations easier and simplify log files interpretation, aggregation and management. \item Support for request traceability will be part of the system since its inception. \end{itemize} The development of StoRM 2 will be organized in SCRUM-like sprints, where each sprint will be roughly 4-5 weeks long. The output of each sprint should be a deployable instance of the services implementing a subset of the whole foreseen StoRM 2 functionality. \section{The build and test environment} The build environment heavily relies on container technology~\cite{ref:docker}, both to guarantee full build and test reproducibility and to offer a common reference platform for development. Since the code for all components is kept in a single git repository, we have also opted for a single Docker image to build everything, containing all the needed build tools (compilers, unit testing frameworks, static and dynamic analyzers, external dependencies, etc.). The resulting image is large but still manageable and having one image simplifies the operations. There are also a couple of other Docker images: one is a specialization of the build image mentioned above and is dedicated to the build of the Nginx VOMS module; the other is an image with client tools used during integration testing. All the image Dockerfiles are kept in a single repository, under continuous integration, so that every time there is a change the images are rebuilt. \section{The StoRM 2 frontend component} The StoRM 2 frontend is composed of a set of stateless Spring Boot 2 applications written in Java that implement the management protocol endpoints, such as SRM~\cite{ref:srm} and WebDAV~\cite{ref:webdav}. The frontend services maintain state in an external database. The main frontend responsibilities are to: \begin{itemize} \item implement consistent authorization, taking as input the authentication information exposed by the Nginx TLS terminator and matching this information with a common authorization policy; \item implement request validation and management, i.e., protocol-specific management of request queuing as well as conflict handling; \item translate protocol-specific requests to a set of basic storage management operations executed by the backend and exposed via a set of gRPC services; \item provide service management and observability endpoints, to allow administrators to get information about the requests currently being serviced by the system, drain the service or manually force request status transitions. \end{itemize} The first frontend service developed in StoRM 2 focuses on the SRM interface, and at the time of this writing implements support for the SRM \textit{ping} and \textit{ls} methods. In the initial development sprints, significant work has been devoted to ensure the testability of the frontend component in isolation, by leveraging the powerful testing support provided by Spring~\cite{ref:spring} and the gRPC frameworks. \section{The StoRM 2 backend component} The StoRM 2 backend is a gRPC server that provides multiple services. One service responds to \textit{version} requests. Another service responds to storage-related requests, which represent the main scope of StoRM. In general there is no direct, one-to-one mapping between SRM requests arriving at the frontend and requests addressed to the backend; rather, these represent building blocks that the frontend can compose in order to prepare the responses to SRM clients. Among the storage requests addressed to the backend, at the moment only a couple are implemented: \textit{ls}, in its multiple variations (for a file or a directory, recursive, up to a given depth, etc.), returns information about files and directories; \textit{pin}, \textit{unpin} and \textit{pin status} manage the \verb|user.storm.pinned| attribute of filesystem entities, which is essential for the implementation of the more complex \textit{srmPrepareToGet} SRM request. All the backend requests are currently blocking: a response is sent back to the frontend only when the request has been fully processed. The backend also incorporates sub-components of more general utility to operate on Filesystem Extended Attributes and POSIX Access Control Lists~\cite{p1003.1e}, adding a layer of safety and expressivity on top of the native C APIs. They allow to define attributes and ACLs respectively and to apply them to or read them from filesystem entities. For example the following sets the attribute \verb|user.storm.pinned| of file \verb|myFile.txt| to the pin duration: {\small \begin{verbatim} set_xattr( storage_dir / "myFile.txt", StormXAttrName{"pinned"}, XAttrValue{duration} ); \end{verbatim} } The following instead extends the ACL currently assigned to \verb|myFile.txt| with some additional entries: {\small \begin{verbatim} add_to_access_acl( storage_dir / "myFile.txt", { {User{"storm"}, Perms::Read | Perms::Write}, {Group{"storm"}, Perms::Read}, {other, Perms::None} } ); \end{verbatim} } The backend is implemented in C++, in the latest standard version supported by the toolset installed in the reference platform (currently C++17). The build system is based on CMake. The backend relies on some other third-party dependencies, the most important being for interaction with the filesystem (Boost Filesystem~\cite{ref:boost.fs}), for logging (Boost Log~\cite{ref:boost.log}) and for handling configuration (yaml-cpp~\cite{ref:yaml-cpp}). \section{Test suite and continuous integration} The test suite is based on the Robot Framework~\cite{ref:rf} and is typically run in a Docker container. A deployment test pipeline~\cite{ref:glcip} runs on our Gitlab-based continuous integration (CI) system every night (and after any commit on the master branch) to instantiate the main StoRM 2 services and execute the SRM testsuite. The reports of the test suite execution are archived and published on the Gitlab CI dashboard. Services and the test suite are orchestrated using Docker Compose~\cite{ref:dc}. This approach provides an intuitive, self-contained testing environment deployable on the CI system and on the developers workstations. The test deployment mirrors the architecture shown in Figure~\ref{fig:storm-arch}, with clients and services placed in different docker networks to mimic a real-life deployment scenario. \section{Conclusions and future work} In this contribution we have described the initial design and development activities performed during 2018 on StoRM 2, the next incarnation of the StoRM storage management system. The main objectives of the StoRM refactoring is to improve the service scalability and manageability in order to meet the data management requirements of HL-LHC. The initial work of this year focused on choosing tools, methodologies and approach with a strong emphasis on software quality. In the future we will build on this groundwork to provide a full replacement for the existing StoRM implementation. The lack of dedicated manpower for this activity makes it hard to estimate when StoRM 2 will be ready to be deployed in production. \section*{References} \bibliographystyle{iopart-num} \bibliography{biblio} \end{document}