main.tex

\documentclass[a4paper]{jpconf}

\usepackage{url}
\usepackage{graphicx}
\usepackage{float}

\newcommand{\quotes}[1]{``#1''}

\begin{document}

\title{StoRM 2: initial design and development activities} 

\author{
  A~Ceccanti,
  F~Giacomini,
  E~Vianello and
  E~Ronchieri
}

\address{INFN-CNAF, Bologna, IT}

\ead{
  andrea.ceccanti@cnaf.infn.it
}

\begin{abstract}
  StoRM is the storage element solution that powers the CNAF Tier-1
  data center as well as more than 30 other sites.  Experience in
  developing, maintaining and operating it at scale suggests that a
  significant refactoring of the codebase is necessary to improve
  StoRM maintainability, reliability, scalability and ease of
  operation in order to meet the data management requirements coming
  from HL-LHC and other communities served by the CNAF Tier-1 data
  center.  In this contribution we highlight the initial StoRM 2
  design and development activities.
\end{abstract}

\section{Introduction}
\label{sec:introduction}

StoRM was first developed between INFN-CNAF, CERN and Trieste's ICTP, to provide a
lightweight storage element solution implementing the SRM~\cite{ref:srm} interface
on top of a POSIX filesystem. 
StoRM has a layered architecture (Figure~\ref{fig:storm-arch}), split between
two main components: the StoRM frontend and backend services. 
The StoRM frontend service implements the SRM interface exposed to client
applications and frameworks. 

The StoRM backend service implements the actual storage management logic by
interacting directly with the underlying file system. Communication between the
frontend and the backend happens in two ways: 
\begin{itemize}
    \item via an XML-RPC API, for synchronous requests;
    \item via a database, for asynchronous requests.
\end{itemize}

Data transfer is provided by GridFTP, HTTP and XRootD services accessing
directly the file system underlying the StoRM deployment. 

StoRM is interfaced with the IBM Tivoli Storage Manager (TSM) via
GEMSS~\cite{ref:gemss}, a component also developed at INFN, to provide optimized
data archiving and tape recall functionality. 

The StoRM WebDAV service provides an alternative data management interface
complementary to the SRM functionality, but which does not yet support tape
operations. 

In the past years, StoRM has powered the CNAF Tier-1 data center as well as
dozens of other sites and proved to be a reliable SRM implementation.
However, 10 years of experience in developing and operating the service at
scale has also shown limitations:

\begin{itemize}
    \item the StoRM code base is not unit-tested; this means that there is no
      quick feedback loop anytime a change is introduced or a refactoring is
      implemented that functionality is not broken by the latest changes; there
      are integration and load test suites that can be used to assess that
      functionality is not broken, but these test suites are more complex to
      instantiate, require a full service deployment and do no provide coverage
      information;

    \item data management responsibilities are scattered among several
      components without clear reasons, increasing maintenance and developments
      costs;

    \item the StoRM backend cannot be horizontally replicated; this causes
      operational problems in production and limits scalability and the ability
      to adapt dynamically to load changes;

    \item Logging among the StoRM services is not harmonized, and limited
      tracing is provided, so that is not trivial to trace the history of an
      incoming request across the services;

    \item core StoRM communication and authentication functionality relies on dated technologies and libraries 
        (e.g., XML-RPC, CGSI-gSOAP);

    \item the codebase is significantly more complex than needed due to the inorganic growth and 
     lack of periodic quality assessment performed on the code base.
\end{itemize}

To address these shortcomings, a  redesign of the StoRM service has been planned
and started this year in parallel with the main StoRM maintenance and
development activities.

\begin{figure}
    \centering
    \includegraphics[width=.6\textwidth]{storm-arch.png}
    \caption{\label{fig:storm-arch}The StoRM 1 architecture.}
\end{figure}

\section{StoRM 2 high-level architecture}

The StoRM 2 architecture is depicted in Figure~\ref{fig:storm2-arch}.

\begin{figure}
    \centering
    \includegraphics[width=.6\textwidth]{high-level-arch.png}
    \caption{\label{fig:storm2-arch}The StoRM 2 high-level architecture.}
\end{figure}
The layered architecture apporach is maintained, so that service logic is again
split between frontend and backend service components.

The frontend responsibility is to implement the interfaces towards the outside
world. In practice, the frontend is implemented by multiple microservices,
each one responsible of a specific interface (SRM, WebDAV, etc.).

TLS termination and client authentication is implemented at the edge of the service 
perimeter by one (or more) Nginx reverse proxy instances.
There are several advantages in this approach:

\begin{itemize} 

  \item the TLS handling load is decoupled from request management load;

  \item VOMS-related configuration and handling is centralized to a single
    component, leading to simplified service operation and troubleshooting;

  \item the TLS terminator becomes a natural place to implement load balancing
    for the frontend services.

\end{itemize}

VOMS authorization support is provided by Nginx VOMS module
~\cite{ref:nginx-voms} developed for this purpose and described in more detail in another contribution
in this report.

Besides implementing the maagement protocol endpoints, the frontends expose other
management and monitoring interfaces that can be consumed by internal services and
may use a relational or in-memory database to persist state information in support
of request management and accounting.

Frontends do not directly interact with the storage, but delegate the interaction to a backend service.

The backend is a stateless service that implements basic management operations on the
storage. The storage management operations implemented are the minimum set of
operations needed to support the data management interfaces exposed by the
frontends. These operations are typically either data object lifecycle
operations (e.g., create or remove a file or a directory, list directory contents) or
metadata operations (e.g., get the size of a file, manage ACLs).

The communication between the frontend and the backend services is implemented
on top of gRPC~\cite{ref:grpc}, a remote procedure call system initially
developed at Google. The actual messages exchanged between them are
synthesized from a description expressed in an interface description language
called \textit{Protocol Buffers}~\cite{ref:protocol-buffers}; from the same
message description, language-specific client and server stubs are generated. As
an example, the following listing shows the description of the messages and of
the service involved in the simple case of the \textit{version} command.

{\small
\begin{verbatim}
message VersionRequest {
  // The version of the client calling the service. 
  string version = 1;
}

message VersionResponse {
  // The version of the service answering the call
  string version = 1;
}

service VersionService {
  rpc getVersion(VersionRequest) returns (VersionResponse);
}
\end{verbatim}
}

\section{Principles guiding the development work}

The following principles have driven the StoRM 2 development work.

\begin{itemize}

    \item The source code will be hosted on a Git repository hosted on the INFN
      Gitlab service, and the development will follow the Git-workflow~\cite{ref:gitflow} inspired
      branching model used with success for other components developed by the
      team (e.g., VOMS, INDIGO IAM, StoRM);
      
    \item the code for all main components (frontend and backend services,
      CLIs, etc...) will be hosted on a single repository and a single version number
      will be shared for all the components;

    \item a test-driven development approach will be followed, using tools that allow 
      to measure the test coverage of the codebase. The objective is to ensure high coverage
       ($>90\%$) on all code;
       
    \item whenever possible, the code should be self-documenting; the source code folder
      structure will be documented with README.md files providing a
      description of each folder contents; a CHANGELOG file will provide
      information of new features and bug fixes following established
      industry best practices~\cite{ref:keep-a-changelog};

    \item the development and testing environment will be containerized, in order to ensure
      a consistent environment definition and avoid "works on my machine" issues; 
      
    \item services should provide monitoring and metrics endpoints to enable the collection
    of status information and performance metrics;
    
    \item service should support graceful shutdown and draining;

    \item a CI pipeline will be in place, to build and test continuously the code;

    \item a consistent configuration and logging format will be adopted across all the components, to
    make service operations easier and simplify log files interpretation, aggregation and management;

    \item support for request traceability will be part of the system since its inception.
\end{itemize}

The development of StoRM 2 will be organized in SCRUM-like sprints, where each sprint will be roughly 4-5 weeks long.
The output of each sprint should be a deploy-able instance of the services implementing a subset of the whole
foreseen StoRM 2 functionality.

\section{The build and test environment}

The build environment heavily relies on container technology~\cite{ref:docker}, both to guarantee full build and test reproducibility and to offer a common reference platform for development.

Since the code for all components is kept in a single git repository, we have also opted for a single Docker image to build everything, containing all the needed build tools (compilers, unit testing frameworks, static and dynamic analyzers, external dependencies, etc.). The resulting image is large but still manageable and having one image simplifies the operations.

There are also a couple of other Docker images: one is a specialization of the build image mentioned above and is dedicated to the build of the Nginx VOMS module; the other is an image with client tools used during integration testing.

All the image Dockerfiles are kept in a single repository, under continuous integration, so that every time there is a change the images are rebuilt.

\section{The StoRM 2 frontend component}

The StoRM 2 frontend is composed of a set of stateless Spring Boot 2 application written in Java that implement
the management protocol endpoints, such as SRM~\cite{ref:srm} and WebDAV~\cite{ref:webdav}. 
The frontend services maintain state in an external database.

The main frontend responsibilities are
\begin{itemize}
    \item implement consistent authorization, taking as input the authentication information exposed
    by the Nginx TLS terminator and matching this information with a common authorization policy;
    \item implement request validation and management, i.e., protocol-specific management of request 
    queuing as well as conflict handling;
    \item translate protocol-specific requests to a set of basic storage management operations executed
    by the backend and exposed via a set of gRPC services;
    \item provide service management and observability endpoints, to allow administrators to get information
    about the requests currently being serviced by the system, drain the service or manually force request status
    transitions;
\end{itemize}

The first frontend service developed in StoRM 2 focuses on the SRM interface, and at the time of this writing 
implements support for the SRM \textit{ping} and \textit{ls} methods.

In the initial development sprints, significant work has been devoted to ensure the testability of the frontend component in isolation,
by leveraging the powerful testing support provided by Spring~\cite{ref:spring} and the gRPC frameworks.

\section{The StoRM 2 backend component}

The StoRM 2 backend is a gRPC server that provides multiple
services. One service responds to \textit{version} requests. Another
service responds to storage-related requests, which represent the main
scope of StoRM. In general there is no direct, one-to-one mapping
between SRM requests arriving at the frontend and requests addressed
to the backend; rather, these represent building blocks that the
frontend can compose in order to prepare the responses to SRM clients.

Among the storage requests addressed to the backend, at the moment
only a couple are implemented: \textit{ls}, in its multiple variations
(for a file or a directory, recursive, up to a given depth, etc.),
returns information about files and directories; \textit{pin},
\textit{unpin} and \textit{pin status} manage the
\verb|user.storm.pinned| attribute of filesystem entities, which is
essential for the implementation of the more complex
\textit{srmPrepareToGet} SRM request.

All the backend requests are currently blocking: a response is sent
back to the frontend only when the request has been fully processed.

The backend also incorporates sub-components of more general utility
to operate on Filesystem Extended Attributes and POSIX Access Control
Lists~\cite{p1003.1e}, adding a layer of safety and expressivity on
top of the native C APIs. They allow to define attributes and ACLs
respectively and to apply them to or read them from filesystem
entities.

For example the following sets the attribute \verb|user.storm.pinned|
of file \verb|myFile.txt| to the pin duration:

{\small
\begin{verbatim}
set_xattr(
  storage_dir / "myFile.txt",
  StormXAttrName{"pinned"},
  XAttrValue{duration}
);
\end{verbatim}
}

The following instead extends the ACL currently assigned to
\verb|myFile.txt| with some additional entries:

{\small
\begin{verbatim}
add_to_access_acl(
  storage_dir / "myFile.txt",
  {
    {User{"storm"}, Perms::Read | Perms::Write},
    {Group{"storm"}, Perms::Read},
    {other, Perms::None}
  }
);
\end{verbatim}
}

The backend is implemented in C++, in the latest standard version
supported by the toolset installed in the reference platform
(currently C++17). The build system is based on CMake.

The backend relies on some other third-party dependencies, the most
important being for interaction with the filesystem (Boost
Filesystem~\cite{ref:boost.fs}), for logging (Boost
Log~\cite{ref:boost.log}) and for handling configuration
(yaml-cpp~\cite{ref:yaml-cpp}).

\section{Test suite and continuous integration}

The test suite is based on the Robot Framework~\cite{ref:rf} and is typically
run in a Docker container.  A deployment test pipeline~\cite{ref:glcip} runs on
our Gitlab-based continuous integration (CI) system every night (and after any
commit on the master branch) to instantiate the main StoRM 2 services and
execute the SRM testsuite. The reports of the test suite execution are archived
and published on the Gitlab CI dashboard. Services and the test suite are
orchestrated using Docker Compose~\cite{ref:dc}. This approach provides an
intuitive, self-contained testing environment deployable on the CI system and
on the developers workstations.

The test deployment mirrors the architecture shown in
Figure~\ref{fig:storm-arch}, with clients and services placed in different
docker networks to mimic a real-life deployment scenario.

\section{Conclusions and future work}

In this contribution we have described the initial design and development
activities performed during 2018 on StoRM 2, the next incarnation of the StoRM
storage management system. 

The main objectives of the StoRM refactoring is to improve the service
scalability and manageability in order to meet the data management requirements
of HL-LHC. The initial work of this year focused  on choosing tools,
methodologies and approach with a strong emphasis on software quality. 

In the future we will build on this groundwork to provide a full replacement
for the existing StoRM implementation. The lack of dedicated manpower for this
activity makes it hard to estimate when StoRM 2 will be ready to be deployed in
production.

\section*{References}

\bibliographystyle{iopart-num}
\bibliography{biblio}

\end{document}