\documentclass[a4paper]{jpconf}

\usepackage{url}
\usepackage{graphicx}
\usepackage{float}

\newcommand{\quotes}[1]{``#1''}

\begin{document}

\title{StoRM 2: initial design and development activities} 

\author{
  A.~Ceccanti$^1$,
  F.~Giacomini$^1$,
  E.~Vianello$^1$,
  E.~Ronchieri$^1$
}

\address{$^1$ INFN-CNAF, Bologna, IT}

\ead{
  andrea.ceccanti@cnaf.infn.it
}

\begin{abstract}
  StoRM is the storage element solution that powers the CNAF Tier 1
  data center as well as more than 30 other sites.  Experience in
  developing, maintaining and operating it at scale suggests that a
  significant refactoring of the codebase is necessary to improve
  StoRM maintainability, reliability, scalability and ease of
  operation in order to meet the data management requirements coming
  from HL-LHC and other communities served by the CNAF Tier 1 data
  center.  In this contribution we highlight the initial StoRM 2
  design and development activities.
\end{abstract}

\section{Introduction}
\label{sec:introduction}

StoRM was first developed by a joint collaboration between INFN-CNAF, CERN and
ICTP to provide a lightweight storage element solution implementing the
SRM~\cite{ref:srm} interface on top of a POSIX filesystem. StoRM has a layered
architecture (Figure~\ref{fig:storm-arch}), split between two main components:
the StoRM frontend and backend services. The StoRM frontend service implements
the SRM interface exposed to client applications and frameworks. The StoRM
backend service implements the actual storage management logic by interacting
directly with the underlying file system.

Communication between the frontend and the backend services happens in two ways:
\begin{itemize}
    \item via an XML-RPC API, for synchronous requests;
    \item via a database, for asynchronous requests.
\end{itemize}

Data transfers are provided by GridFTP, HTTP and XRootD services accessing
directly the file system underlying the StoRM deployment.

StoRM is interfaced with the IBM Tivoli Storage Manager (TSM) via
GEMSS~\cite{ref:gemss}, a component also developed at INFN, to provide optimized
data archiving and tape recall functionality.

The StoRM WebDAV service provides an alternative data management interface
complementary to the SRM functionality, albeit without supporting tape
operations yet.

In the past years StoRM has powered the CNAF Tier 1 data center as well as
dozens of other sites and proved to be a reliable SRM implementation. However,
ten years of experience in developing and operating the service at scale has
also shown limitations:

\begin{itemize}
 
    \item The StoRM code base is not unit-tested; this means that there is no
      quick feedback loop that functionality is not broken when a change is
      introduced or a refactoring is implemented; there are integration and load
      test suites that can be used to assess that functionality is not broken,
      but these test suites are more complex to instantiate, require a full
      service deployment and do no provide coverage information.

    \item Data management responsibilities are scattered among several
      components without clear reasons, increasing maintenance and developments
      costs.

    \item The StoRM backend cannot be horizontally replicated; this causes
      operational problems in production and limits scalability and the ability
      to adapt dynamically to load changes.

    \item Logging is not harmonized among the StoRM services and limited
      tracing is provided, so that it is not trivial to trace the history of an
      incoming request across the services.

    \item Core StoRM communication and authentication functionality relies on
      dated technologies and libraries (e.g., XML-RPC, CGSI-gSOAP);

    \item The codebase is significantly more complex than needed due to the
      inorganic growth and lack of periodic quality assessment performed on the
      code base.

\end{itemize}

To address these shortcomings, a redesign of the StoRM service has been planned
and started this year, in parallel with the main StoRM maintenance and
development activities.

\begin{figure}
    \centering
    \includegraphics[width=.6\textwidth]{storm-arch.png}
    \caption{\label{fig:storm-arch}The StoRM 1 architecture.}
\end{figure}

\section{StoRM 2 high-level architecture}

The StoRM 2 architecture is depicted in Figure~\ref{fig:storm2-arch}.

\begin{figure}
    \centering
    \includegraphics[width=.6\textwidth]{high-level-arch.png}
    \caption{\label{fig:storm2-arch}The StoRM 2 high-level architecture.}
\end{figure}

The layered architecture approach is maintained, so that service logic is again
split between frontend and backend service components.

The frontend responsibility is to implement the interfaces towards the outside
world. In practice, the frontend is implemented by multiple microservices,
each responsible of a specific interface (SRM, WebDAV, etc.).

TLS termination and client authentication is implemented at the edge of the
service perimeter by one (or more) Nginx reverse proxy instances. There are
several advantages in this approach:

\begin{itemize}

  \item The TLS handling load is decoupled from request management load.

  \item VOMS-related configuration and handling is centralized to a single
    component, leading to simplified service operation and troubleshooting.

  \item The TLS terminator becomes a natural place to implement load balancing
    for the frontend services.

\end{itemize}

VOMS authorization support is provided by an Nginx VOMS module
~\cite{ref:nginx-voms} developed for this purpose and described in more detail
in another contribution in this report.

Besides implementing the management protocol endpoints, the frontends expose other
management and monitoring interfaces that can be consumed by internal services and
may use a relational or in-memory database to persist state information in support
of request management and accounting.

Frontends do not directly interact with the storage, but delegate the
interaction to a backend service.

The backend is a stateless service that implements basic management operations on the
storage. The storage management operations implemented are the minimum set of
operations needed to support the data management interfaces exposed by the
frontends. These operations are typically either data object lifecycle
operations (e.g., create or remove a file or a directory, list directory contents) or
metadata operations (e.g., get the size of a file, manage ACLs).

The communication between the frontend and the backend services is implemented
on top of gRPC~\cite{ref:grpc}, a remote procedure call system initially
developed at Google. The actual messages exchanged between them are
synthesized from a description expressed in an interface description language
called \textit{Protocol Buffers}~\cite{ref:protocol-buffers}; from the same
message description, language-specific client and server stubs are generated. As
an example, the following listing shows the description of the messages and of
the service involved in the simple case of the \textit{version} command.

{\small
\begin{verbatim}
message VersionRequest {
  // The version of the client calling the service.
  string version = 1;
}

message VersionResponse {
  // The version of the service answering the call
  string version = 1;
}

service VersionService {
  rpc getVersion(VersionRequest) returns (VersionResponse);
}
\end{verbatim}
}

\section{Principles guiding the development work}

The following principles have driven the StoRM 2 development work.

\begin{itemize}

    \item The source code will be kept in a Git repository hosted on the INFN
      Gitlab service; the development will follow a branching model inspired
      at Git-workflow~\cite{ref:gitflow} and already successfully used for other
      components developed by the team (e.g., VOMS, INDIGO IAM, StoRM).

    \item Rhe code for all main components (frontend and backend services,
      CLIs, etc.) will be hosted on a single repository and a single version number
      will be shared for all the components.

    \item A test-driven development approach will be followed, using tools that
      allow to measure the test coverage of the codebase. The objective is to
      ensure high coverage ($>90\%$) on all code.

    \item Whenever possible, the code should be self-documenting; the source code folder
      structure will be documented with README.md files providing a
      description of each folder contents; a CHANGELOG file will provide
      information of new features and bug fixes following established
      industry best practices~\cite{ref:keep-a-changelog}.

    \item The development and testing environment will be containerized, in
      order to ensure a consistent environment definition and avoid "works on my
      machine" issues.

    \item Services should provide monitoring and metrics endpoints to enable the
      collection of status information and performance metrics.

    \item Service should support graceful shutdown and draining.

    \item A CI pipeline will be in place, to build and test continuously the code.

    \item A consistent configuration and logging format will be adopted across
      all the components, to make service operations easier and simplify log
      files interpretation, aggregation and management.

    \item Support for request traceability will be part of the system since its
      inception.

\end{itemize}

The development of StoRM 2 will be organized in SCRUM-like sprints, where each
sprint will be roughly 4-5 weeks long.

The output of each sprint should be a deployable instance of the services
implementing a subset of the whole foreseen StoRM 2 functionality.

\section{The build and test environment}

The build environment heavily relies on container technology~\cite{ref:docker},
both to guarantee full build and test reproducibility and to offer a common
reference platform for development.

Since the code for all components is kept in a single git repository, we have
also opted for a single Docker image to build everything, containing all the
needed build tools (compilers, unit testing frameworks, static and dynamic
analyzers, external dependencies, etc.). The resulting image is large but still
manageable and having one image simplifies the operations.

There are also a couple of other Docker images: one is a specialization of the
build image mentioned above and is dedicated to the build of the Nginx VOMS
module; the other is an image with client tools used during integration testing.

All the image Dockerfiles are kept in a single repository, under continuous
integration, so that every time there is a change the images are rebuilt.

\section{The StoRM 2 frontend component}

The StoRM 2 frontend is composed of a set of stateless Spring Boot 2
applications written in Java that implement the management protocol endpoints,
such as SRM~\cite{ref:srm} and WebDAV~\cite{ref:webdav}. The frontend services
maintain state in an external database.

The main frontend responsibilities are to:
\begin{itemize}

    \item implement consistent authorization, taking as input the
      authentication information exposed by the Nginx TLS terminator and
      matching this information with a common authorization policy;

    \item implement request validation and management, i.e.,
      protocol-specific management of request queuing as well as conflict
      handling;

    \item translate protocol-specific requests to a set of basic storage
      management operations executed by the backend and exposed via a set of
      gRPC services;

    \item provide service management and observability endpoints, to allow
      administrators to get information about the requests currently being
      serviced by the system, drain the service or manually force request status
      transitions.

\end{itemize}

The first frontend service developed in StoRM 2 focuses on the SRM interface,
and at the time of this writing implements support for the SRM \textit{ping} and
\textit{ls} methods.

In the initial development sprints, significant work has been devoted to ensure
the testability of the frontend component in isolation, by leveraging the
powerful testing support provided by Spring~\cite{ref:spring} and the gRPC
frameworks.

\section{The StoRM 2 backend component}

The StoRM 2 backend is a gRPC server that provides multiple
services. One service responds to \textit{version} requests. Another
service responds to storage-related requests, which represent the main
scope of StoRM. In general there is no direct, one-to-one mapping
between SRM requests arriving at the frontend and requests addressed
to the backend; rather, these represent building blocks that the
frontend can compose in order to prepare the responses to SRM clients.

Among the storage requests addressed to the backend, at the moment
only a couple are implemented: \textit{ls}, in its multiple variations
(for a file or a directory, recursive, up to a given depth, etc.),
returns information about files and directories; \textit{pin},
\textit{unpin} and \textit{pin status} manage the
\verb|user.storm.pinned| attribute of filesystem entities, which is
essential for the implementation of the more complex
\textit{srmPrepareToGet} SRM request.

All the backend requests are currently blocking: a response is sent
back to the frontend only when the request has been fully processed.

The backend also incorporates sub-components of more general utility
to operate on Filesystem Extended Attributes and POSIX Access Control
Lists~\cite{p1003.1e}, adding a layer of safety and expressivity on
top of the native C APIs. They allow to define attributes and ACLs
respectively and to apply them to or read them from filesystem
entities.

For example the following sets the attribute \verb|user.storm.pinned|
of file \verb|myFile.txt| to the pin duration:

{\small
\begin{verbatim}
set_xattr(
  storage_dir / "myFile.txt",
  StormXAttrName{"pinned"},
  XAttrValue{duration}
);
\end{verbatim}
}

The following instead extends the ACL currently assigned to
\verb|myFile.txt| with some additional entries:

{\small
\begin{verbatim}
add_to_access_acl(
  storage_dir / "myFile.txt",
  {
    {User{"storm"}, Perms::Read | Perms::Write},
    {Group{"storm"}, Perms::Read},
    {other, Perms::None}
  }
);
\end{verbatim}
}

The backend is implemented in C++, in the latest standard version
supported by the toolset installed in the reference platform
(currently C++17). The build system is based on CMake.

The backend relies on some other third-party dependencies, the most
important being for interaction with the filesystem (Boost
Filesystem~\cite{ref:boost.fs}), for logging (Boost
Log~\cite{ref:boost.log}) and for handling configuration
(yaml-cpp~\cite{ref:yaml-cpp}).

\section{Test suite and continuous integration}

The test suite is based on the Robot Framework~\cite{ref:rf} and is typically
run in a Docker container. A deployment test pipeline~\cite{ref:glcip} runs on
our Gitlab-based continuous integration (CI) system every night (and after any
commit on the master branch) to instantiate the main StoRM 2 services and
execute the SRM testsuite. The reports of the test suite execution are archived
and published on the Gitlab CI dashboard. Services and the test suite are
orchestrated using Docker Compose~\cite{ref:dc}. This approach provides an
intuitive, self-contained testing environment deployable on the CI system and on
the developers workstations.

The test deployment mirrors the architecture shown in
Figure~\ref{fig:storm-arch}, with clients and services placed in different
docker networks to mimic a real-life deployment scenario.

\section{Conclusions and future work}

In this contribution we have described the initial design and development
activities performed during 2018 on StoRM 2, the next incarnation of the StoRM
storage management system.

The main objectives of the StoRM refactoring is to improve the service
scalability and manageability in order to meet the data management requirements
of HL-LHC. The initial work of this year focused  on choosing tools,
methodologies and approach with a strong emphasis on software quality.

In the future we will build on this groundwork to provide a full replacement
for the existing StoRM implementation. The lack of dedicated manpower for this
activity makes it hard to estimate when StoRM 2 will be ready to be deployed in
production.

\section*{References}

\bibliographystyle{iopart-num}
\bibliography{biblio}

\end{document}