SDDS-XDC.tex

\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\begin{document}
\title{eXtreme-DataCloud project: Advanced data management services for distributed e-infrastructures}

%\address{Production Editor, \jpcs, \iopp, Dirac House, Temple Back, Bristol BS1~6BE, UK}
\author{A. Costantini$^1$, D. Cesini$^1$, D.C. Duma$^1$, G. Donvito$^2$ \dots
        % etc.
}

\address{$^1$ INFN-CNAF, Bologna, Italy}
\address{$^2$ INFN Bari, Bari, Italy}
\address{$^3$ EGI, Netherlands}
\address{$^4$ ECRIN, France}
\address{$^5$ CNRS, France}
\address{$^6$ CERN, Switzerland}
\address{$^7$ IN2P3, France}
\address{$^8$ CSIC, Spain}
\address{$^9$ AGH, Polland}
\address{$^10$ DESY, Germany}
\address{$^11$ Univ. de Cantabria, Spain}		

\ead{alessandro.costantini@cnaf.infn.it}

\begin{abstract}
The develpoment of new data management services able to cope with very large data resources is becaming a key challenge. Such capability, in fact, will allow the future e-infrastructures to address the needs of the next generation extreme scale scientific experiments. 
To face this challenge, in November 2017 the H2020 eXtreme DataCloud - XDC project has been launched. Lasting for 27 months and combining the expertise of 8 large European research organisations, the project aims at developing scalable technologies for federating storage resources and managing data in highly distributed computing environments. The targeted platforms are the current and next generation e-Infrastructures deployed in Europe, such as the European Open Science Cloud (EOSC), the European Grid Infrastructure (EGI), and the Worldwide LHC Computing Grid (WLCG).
The project is use-case driven with a multidisciplinary approach, addressing requirements from research communities belonging to a wide range of scientific domains: High Energy Physics, Astronomy, Photon and Life Science, Medical research.
XDC will implementing data management scalable services, combining already established data management and orchestration tools, to address the following high level topics policy driven data management based on Quality-of-Service, Data Life-cycle management, smart placement of data with caching mechanisms to reduce access latency, meta-data with no predefined schema handling, execution of pre-processing applications during ingestion, data management and protection of sensitive data in distributed e-infrastructures, intelligent data placement based on access patterns.
This contribution will introduce the project, presents the foreseen overall architecture and the developments that are being carried on to implement the requested functionalities.
\end{abstract}


\section{Introduction}
Lead by INFN-CNAF, the eXtreme DataCloud (XDC) project \cite{xdc} develops scalable technologies for federating storage resources and managing data in highly distributed computing environments. The provided services are capable of operating at the unprecedented scale required by the most demanding, data intensive, research experiments in Europe and Worldwide. The targeted platforms for the released products are the already existing and the next generation e-Infrastructures deployed in Europe, such as the European Open Science Cloud (EOSC) \cite{EOSC}, the European Grid Infrastructure (EGI) \cite{EGI}, the Worldwide LHC Computing Grid (WLCG) \cite{wlcg} and the computing infrastructures that will be funded by the upcoming H2020 EINFRA-12 call. XDC is funded by the H2020 EINFRA-21-2017 Research and Innovation action under the topic Platform-driven e-Infrastructure innovation \cite{einfracall21}. It is carried on by a Consortium that brings together technology providers with a proven long-standing experience in software development and large research communities belonging to diverse disciplines: Life Science, Biodiversity, Clinical Research, Astrophysics, High Energy Physics and Photon Science. XDC started on 1st November 2017 and will run for 27 months until January 2020. The EU contribution for the project is 3.07 million euros.
XDC is a use case driven development project and the Consortium has been built as a combination of technology providers, Research Communities and Infrastructure providers.  New developments will be tested against real-life applications and use cases. Among the high level requirements collected from the Research Communities, the Consortium identified those considered more general (and hence exploitable by other communities), with the greatest impact on the user base and that can be implemented in a timespan compatible with the project duration and funding.


\section{Project Objectives}
The XDC project develops open, production quality, interoperable and manageable software that can be easily plugged into the target European e-Infrastructures and adopts state of the art standards in order to ensure interoperability. The building blocks of the high-level architecture foreseen by the project are organized in a manner to avoid duplication of development effort. All the interfaces and links to implement the XDC architecture are developed exploiting the most advanced techniques for authorization and authentication. Services are scalable to cope with most demanding, extreme scale scientific experiments like those run at the Large Hadron Collider at CERN and the Cherenkov Telescope Array (CTA), both of them represented in the project consortium.
The project will enrich already existing data management services by adding missing functionalities as requested by the user communities. The project will continue the effort invested by the now ended INDIGO-DataCloud project \cite{indigo} in the direction of providing friendly, web-based user interfaces and mobile access to the infrastructure data management services. The project will build on the INDIGO-DataCloud achievements in the field of Quality of Services and data lifecycle management developing smart orchestration tools to realize easily an effective policy driven data management.
One of the main objectives of the project is to provide data management solutions for the following use cases:
\begin{itemize}
\item Dynamic extension of a computing center to a remote site providing transparent bidirectional access to the data stored in both locations.
\item Dynamic inclusion of sites with limited storage capacity in a distributed infrastructure, proving transparent access to the data stored remotely.
\item Federation of distributed storage endpoints, i.e. a so-called WLCG Data Lake, enabling fast and transparent access to their data without a-priory copy.
\end{itemize}
These use cases will be addressed implementing intelligent, automatic and hierarchical caching mechanisms.


\section{Overall project structure}
The project is structured in five Work Packages (WPs). In details, there are two WPs devoted to networking activities (NA), one for service activities (SA) and two for development activities (JRA).
The relationships among the WPs are represented in Figure \ref{fig-WP}: NA1 will supervise the activities of all the Work Packages and will deal with the management aspects by ensuring 
a smoothly progress of the Project activities.  NA2, representing communities, will provide requirements that will guide the development activities carried out by JRA1 and JRA2. JRA1 and JRA2 will be responsible to provide new and integrated solutions to address the user requirements provided by NA2. SA1 will provide Quality Assurance policies and procedures and the Maintenance and Support coordination, including Release Management. NA2 will close the cycle validating the released components on the pilot testbeds made available for exploitation by SA1.

%\begin{center}
%\begin{tabular}{ c c c }
%\hline
% WP No & Work Package Title &  Lead Part. Name & Lead Part. Short Name\\
%\hline 
%1 & Project Management (NA1) & Istituto Nazionale di Fisica Nucleare & INFN \\  
% 2 & New functionalities definition and Research Communities support (NA2) & Universidad de Cantabria & UC \\  
% 3 & Software and release management, exploitation and pilot testbeds (SA1) & INFN \\  
% 4 & Orchestration and policy driven data management (JRA1) & DESY \\  
% 5 & Unified cross federations data management (JRA2) & AGH-UST
%\hline 
%\end{tabular}
%\end{center}


\begin{figure}[h]
\centering
\includegraphics[width=10cm,clip]{XDC-WP.png}
\caption{Relationships between the WPs in the XDC project.}
\label{fig-WP}
\end{figure}

{\bf Work Package 1 (NA1 - lead by INFN)} bring  the project towards the successful achievement of its objectives, efficiently
coordinating the Consortium and ensuring that the activities progress smoothly.
WP1 is responsible for the financial administration of the project; it controls the effort reporting and the
progress of the work ensuring that they adhere to the work plan and to the Grant Agreement.
WP1 defines the Quality Assurance plan and reports periodically to the EC about the overall project progress.
It is responsible for the resolution of internal conflicts and for the creation of a fruitful spirit of collaboration,
ensuring that all the partners are engaged to fulfil the project objectives. WP1 communicates the project
vision and mission to the relevant international events and interested institutions.

{\bf Work Package 2 (NA2 - lead by UC)} identify new functionalities for the management of huge data
volumes from the different Research Communities Use Cases providing requirements to the existing tools
developers to enhance the user experience. WP2 is also responsible for testing the new developments and for
providing adequate feedback about the user experience of the different services. WP2 analyzes the scalability
requirements for the infrastructure taking into account the challenges expected in next years and the new
frameworks of scientific data. 
INFN mainly participates tp WP2 activities by sustaining the WLCG community and related experiments 
INFN, together with CERN, participates tp WP2 activities by representing the WLCG community and the LHC experiments,
whose needs are connected with scalability challenges and with novel solutions supporting the infrastructure
evolution towards new computing and storage solutions such as the data lake and the smart caching.

{\bf Work Package 3 (SA1 - leaded by INFN)} provides software lifecycle management services together with pilot e-Infrastructures
\begin{itemize}
\item For the project developers in order to ensure the maintenance of the high quality of the
released software components and services while adding new features
\item Through Continuous Integration and Delivery, followed by deployment and
operation/monitoring in production environment
\item For the User communities, in order to ensure the delivered software consistently pass the
customer acceptance criteria and continually improve its quality
\item For the e-Infrastructure providers, in order to ensure an easy and smooth delivery of
improved software services and components while guaranteeing the stability of their
production infrastructures.
\end{itemize}
The approach envisaged by the WP3 aims to increase the collaboration between the development teams
(Developers) and the e-Infrastructure resource managers (Operators) finding the right solutions to the
challenges faced by both groups.

{\bf Work Package 4 (JRA1 - lead by DESY)} provides the semi or fully automated placement of scientific data in
the Exabyte region on the site (IaaS) as well as on the federated storage level. In the context of this Work
Package, placement may either refer to the media the data is stored on, to guarantee a requested Quality of
Service, or the geographical location, to move data as close to the compute facility as possible to overcome
latency issues in geographically distributed infrastructures. In the latter case, data might either be
permanently moved, or temporarily cached.
INFN, in sricted collaboration with DESY, work on the components related to data management orchestration based on
Policies and Quality of Service. They are the main developers of storage systems and storage resource managers (dCache, StoRM) fundamental to
design and implement the solutions envisioned in WP4, as well as the components brought into the project by
CERN (FTS, EOS, Dynafed). 
In WP4, INFN contributes also in the development of new caching mechanisms to allow transparent data access for the
dynamic extension of data centers to support alternative deployment models for large data centers.

{\bf Work Package 5 (JRA2 - lead by AGH and co-lead by INFN)} provides enriched high-level data management services unifying
access to heterogeneous storage resources and services, enable extreme scale data processing on both private
as well as public Cloud computing resources using established interfaces and allowing usage of legacy
applications without the need for rewriting them from scratch. These functionalities will be provided mainly
by Onedata distributed virtual file-system platform, which will be extended with novel features in order to
support the requirements from use cases identified within WP2 and will be integrated with lower-level storage
and data management solutions provided by WP4. All developments within this Work Package will be
released as open source software and according to the guidelines for Quality Assurance provided by WP3.


\section{General architecture}
The XDC project aims at providing advanced data management capabilities that require the execution of several tasks and the interaction among several components and services. Those capabilities should include but are not limited to QoS management, preprocessing at ingestion and automated data transfers. Therefore global orchestration layer is needed to take care of the execution of those complex workflows.
Figure~\ref{fig-comp} highlights the main components and their role among the three different levels: Storage, Federation, and Orchestration.

\begin{figure}[h]
\centering
\includegraphics[width=8cm,clip]{XDC-comp.png}
\caption{XDC main components and related roles.}
\label{fig-comp}
\end{figure}

Figure~\ref{fig-HLA} highlights the high level architecture of the XDC project by describing the components and the related connections.

\begin{figure}[h]
\centering
\includegraphics[width=12cm,clip]{XDC-HLA.png}
\caption{High level architecture of the XDC project.}
\label{fig-HLA}
\end{figure}


\subsection{XDC Orchestration system}
In the XDC project the global orchestration layer is needed to take care of the execution of those complex workflows. The orchestration covers two essential aspects:
\begin{itemize} 
\item The overall control, steering and bookkeeping including the connection to compute resources
\item The orchestration of the data management activities like data transfers, and data federation.
\end{itemize}
Consequently we have decided to split the responsibilities between two different components: the INDIGO Orchestrator \cite{paasorch} and Rucio \cite{rucio}.
The INDIGO PaaS Orchestrator, the system wide orchestration engine, is a component of the PaaS layer that allows to instantiate resources on Cloud Management Frameworks (like OpenStack and OpenNebula) and Mesos clusters. It takes the deployment requests, expressed through templates written in TOSCA YAML Profile 1.0 \cite{tosca}, and deploys them on the best cloud site available.
The Rucio project, the data management orchestration subsystem, is the new version of ATLAS Distributed Data Management (DDM) system services for allowing the ATLAS collaboration to manage the large volumes of data, both taken by the detector as well as generated or derived, in the ATLAS distributed computing system. Rucio is used to manage accounts, files, datasets and distributed storage systems
Those two components, the PaaS Orchestrator and Rucio, provide different capabilities and can complement each other to offer a full set of features to meet the XDC requirements.
Rucio implements the data management functionalities missing in the INDIGO Orchestrator: the Orchestrator will make use of those capabilities to orchestrate the data movement based on policies. Rucio will be integrated in the INDIGO Orchestrator as a plugin to be used to steer the data movement. The Orchestrator will be the main entry point for the user requests and, in particular, it will also interact directly with the storage backends in order to get information about the data availability in order to trigger the right processing flow

\subsection{XDC Quality-of-Service implementation}
The idea to provide scientific communities or individuals with the ability to specify a particular quality of service when storing data, e.g. the maximum access latency or minimum retention policy, was introduced within the INDIGO-DataCloud project. In XDC, the QoS concept is envisioned to consistently compliment all data related activities. In other words, whenever storage space is requested, either manually by a user or programmatically by a framework, the quality of that space can be negotiated between the requesting entity and the storage provider

\subsection{Caching within XDC}
In this section we consider how the XDC architecture treats the storage and access of data, building a hierarchy of components whose goal is to maximise the accessibility of data to clients while minimising global infrastructure costs. The architecture considers a set of multi-site storage systems, potentially accessed through caches, both of which are aggregated globally through a federation.
While large, multi-site storage systems may hold the majority of a community custodial data, XDC does not foresee that they will necessarily host all the compute capacity. In particular, CPU-only resource centres or cloud procurements must be supported. Provision must therefore be made to ensure wide area data access is as efficient as possible.
Such resource centres may access custodial data through a standalone proxy cache. The simplest cache envisaged by XDC, a kind of "minimum viable product", would have the 
following characteristics: Read-only operation, Fetch data on miss with service credentials, Data can be chunked or full-file, Manage cache residency, evicting data when necessary, HTTP frontend with group-level authorisation. Beyond this scenario, certain extensions will be investigated: Write support, Synchronisation and enforcement of ACLs, Namespace synchronisation (allowing discovery of non-resident data), Call-out to FTS for data transport, QoS support, Integration with notifications, such as primary storage notifying a file deletion
To such purpose, various technologies are available to the project to serve as the basis of an implementation:
\begin{itemize} 
\item The system runs native dCache \cite{dcache} or EOS, but operates in a "caching mode" staging data in when a cache miss occurs.
\item A service such as Dynafed \cite{dynafed} will be augmented to initiate data movement. While it would only hold only metadata, it would use a local storage system for holding the data
\item A standalone HTTP cache could be built from existing web technology, such as nginx \cite{nginx}, modified for horizontal scalability and relevant AAI support.
\end{itemize}


\subsection{XDC data management and new developments}
Data management functionality for end users will be also available via the Onezone component of the Onedata platform \cite{onedata}. Onezone will provide single-sign on authentication and authorization for users, which will be able to create access tokens to perform data access activities via the Web browser, REST API or using Onedata POSIX virtual filesystem. Onezone will enable federating multiple storage sites by deploying Oneprovider services on top of the actual storage resources provisioned by the sites. 
For the purpose of job scheduling and orchestration, Onedata will communicate with Indigo Orchestrator component by means of a message bus, allowing the orchestrator to subscribe for events related to data transfers and data access. This will allow Orchestrator to react to changes in the overall system state (e.g. a new file in a specific directory or space, data distribution changes initiated by manual transfers, cache invalidation or on-the-fly block transfers).
In order to ensure federation level Quality of Service Onedata will expose an interface providing information on data access latency and block based file location. 
Onedata will be also responsible for definition of federation level authentication and authorization aspects of data access, based on OpenID Connect \cite{oidc}. Upon login, each user will be able to generate access tokens, which can be used for accessing and managing data using REST API or mounting the virtual POSIX filesystems on computational nodes. Furthermore, Onedata will provide a federated level encryption key management service, allowing users to securely upload symmetric encryption keys (e.g. AES-256).
On the data access layer, Onedata will provide a WebDAV \cite{webdav} storage interface, to enable integration of other HTTP transfer based components such as FTS \cite{fts} or EOS \cite{eos} to make the data managed by these components to be accessible in a unified manner via the POSIX virtual filesystem provided by Onedata.
Furthermore, Onezone, the entry point to the data management aspects of the platform, will allow for semi-automated creation of data discovery portals, based on metadata stored in the federated Oneprovider instances and on a centralized ElasticSearch engine indexing the metadata. This solution will allow the communities to create custom indexes on the data and metadata, provide customizable styles and icons for their users and defining custom authorization rights based on user classes (public access, access on login, group access, etc.).


\section{Conclusions}                
In the present contribution the XDC objectives, starting from the technology gaps that currently prevent effective exploitation of distributed computing and storage resources by many scientific communities, have been discussed and presented.
Those objectives are the real driver of the project and derive directly from use cases, and the related needs, presented by the scientific communities involved in the project itself, covering areas such as Physics, Astrophysics, Bioinformatics, and others.
Starting from the above assumptions, the overall structure of the project have been presented by emphasizing its components, typically based upon or extend established open source solutions, and the relations among them

Moreover, XDC project can complement and integrate with other running projects and communities and with existing multi-national, multi-community infrastructures. As an example, XDC is collaborating with the Designing and Enabling E-Infrastructures for intensive Processing in Hybrid Data Clouds (DEEP-Hybrid-DataCloud) \cite{deep} project aimed at promoting the integration of  specialized, and expensive, hardware under a Hybrid Cloud platform, and targeting the evolution of the corresponding Cloud services supporting these intensive computing techniques to production level.

As an added value both projects (XDC and DEEP-HDC) have the common objective to open new possibilities to scientific research communities in Europe by supporting the evolution of e-Infrastructure services for exascale computing. Those services are expected to become a reliable part of the final solutions for the research communities available in the European Open Science Cloud Service Catalogue.


\section{References} 

\begin{thebibliography}{}

\bibitem{xdc}
Web site: www.extreme-datacloud.eu
\bibitem{EOSC}
Web site: https://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud
\bibitem{EGI}
Web site: https://www.egi.eu/
\bibitem{wlcg}
Web site: wlcg.web.cern.ch/
\bibitem{einfracall21}
Web site: http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/h2020/topics/einfra-21-2017.html
\bibitem{indigo}
Web site: https://www.indigo-datacloud.eu
\bibitem{lifewatch}
Web site: https://www.lifewatch.eu
\bibitem{cta}
Web site: https://www.cta-observatory.org
\bibitem{ecrin}
Web site: https://www.ecrin.org
\bibitem{xfel}
Web site: https://www.xfel.eu
\bibitem{paasorch}
Web site: www.indigo-datacloud.eu/paas-orchestrator
\bibitem{rucio}
Web site: https://rucio.cern.ch/
\bibitem{tosca}
TOSCA Simple Profile in YAML Version 1.0. Edited by Derek Palma, Matt Rutkowski, and Thomas Spatzier. 27 August 2015. OASIS Committee Specification Draft 04 / Public Review Draft 01
\bibitem{dcache}
Web site: www.dcache.org
\bibitem{dynafed}
Web site: lcgdm.web.cern.ch/dynafed-dynamic-federation-project
\bibitem{nginx}
Web site: https://www.nginx.com/
\bibitem{onedata}
Web site: onedata.org
\bibitem{oidc}
Web site: https://openid.net/connect/
\bibitem{webdav}
Web site: www.webdav.org/
\bibitem{fts}
Web site: information-technology.web.cern.ch/services/file-transfer
\bibitem{eos}
Web site: eos.web.cern.ch
\bibitem{deep}
Web site: https://deep-hybrid-datacloud.eu/
\end{thebibliography}


\end{document}