\documentclass[a4paper]{jpconf} \usepackage{graphicx} \begin{document} \title{eXtreme-DataCloud project: Advanced data management services for distributed e-infrastructures} %\address{Production Editor, \jpcs, \iopp, Dirac House, Temple Back, Bristol BS1~6BE, UK} \author{A. Costantini$^1$, D. Cesini$^1$, D.C. Duma$^1$, D. Michelotto$^1$, A. Falabella$^1$, L. Dell'Agnello$^1$, D.Salomoni$^1$, L. Morgantii$^1$, G. Grandi$^2$ % etc. } \address{$^1$ INFN-CNAF, Bologna, Italy} \address{$^2$ INFN Bologna, Bologna, Italy} \ead{alessandro.costantini@cnaf.infn.it} \begin{abstract} The development of new data management services able to cope with very large data resources is becoming a key challenge. Such capability, in fact, will allow the future e-infrastructures to address the needs of the next generation extreme scale scientific experiments. To face this challenge, in November 2017 the H2020 eXtreme DataCloud - XDC project has been launched. Lasting for 27 months and combining the expertise of 8 large European research organisations, the project aims at developing scalable technologies for federating storage resources and managing data in highly distributed computing environments. The project is use-case driven with a multidisciplinary approach, addressing requirements from research communities belonging to a wide range of scientific domains: High Energy Physics, Astronomy, Photon and Life Science, Medical research. XDC will implementing data management scalable services, combining already established data management and orchestration tools, to address the following high level topics policy driven data management based on Quality-of-Service, Data Life-cycle management, smart placement of data with caching mechanisms to reduce access latency, meta-data with no predefined schema handling, execution of pre-processing applications during ingestion, data management and protection of sensitive data in distributed e-infrastructures, intelligent data placement based on access patterns. This contribution will introduce the project, presents the foreseen overall architecture and the contribution to the activities that are being carried on by INFN-CNAF personnel to achieve the project goals and objectives. \end{abstract} \section{Introduction} Lead by INFN-CNAF, the eXtreme DataCloud (XDC) project \cite{xdc} develops scalable technologies for federating storage resources and managing data in highly distributed computing environments. The provided services will be capable of operating at the unprecedented scale required by the most demanding, data intensive, research experiments in Europe and Worldwide. The targeted platforms for the released products are the already existing and the next generation e-Infrastructures deployed in Europe, such as the European Open Science Cloud (EOSC) \cite{EOSC}, the European Grid Infrastructure (EGI) \cite{EGI}, the Worldwide LHC Computing Grid (WLCG) \cite{wlcg} and the computing infrastructures that will be funded by the upcoming H2020 EINFRA-12 call. XDC is funded by the H2020 EINFRA-21-2017 Research and Innovation action under the topic Platform-driven e-Infrastructure innovation \cite{einfracall21}. It is carried on by a Consortium that brings together technology providers with a proven long-standing experience in software development and large research communities belonging to diverse disciplines: Life Science, Biodiversity, Clinical Research, Astrophysics, High Energy Physics and Photon Science. XDC started on 1st November 2017 and will run for 27 months until January 2020. The EU contribution for the project is 3.07 million euros. XDC is a use case driven development project and the Consortium has been built as a combination of technology providers, Research Communities and Infrastructure providers. New developments will be tested against real-life applications and use cases. Among the high level requirements collected from the Research Communities, the Consortium identified those considered more general (and hence exploitable by other communities), with the greatest impact on the user base and that can be implemented in a timespan compatible with the project duration and funding. \section{Project Objectives} The XDC project develops open, production quality, interoperable and manageable software that can be easily plugged into the target European e-Infrastructures and adopts state of the art standards in order to ensure interoperability. The building blocks of the high-level architecture foreseen by the project are organised in a manner to avoid duplication of development effort. All the interfaces and links to implement the XDC architecture are developed exploiting the most advanced techniques for authorisation and authentication. Services are scalable to cope with most demanding, extreme scale scientific experiments like those run at the Large Hadron Collider at CERN and the Cherenkov Telescope Array (CTA), both of them represented in the project consortium. The project will enrich already existing data management services by adding missing functionalities as requested by the user communities. The project will continue the effort invested by the now ended INDIGO-DataCloud project \cite{indigo} in the direction of providing friendly, web-based user interfaces and mobile access to the infrastructure data management services. The project will build on the INDIGO-DataCloud achievements in the field of Quality of Services and data lifecycle management developing smart orchestration tools to realise easily an effective policy driven data management. One of the main objectives of the project is to provide data management solutions for the following use cases: \begin{itemize} \item Dynamic extension of a computing centre to a remote site providing transparent bidirectional access to the data stored in both locations. \item Dynamic inclusion of sites with limited storage capacity in a distributed infrastructure, proving transparent access to the data stored remotely. \item Federation of distributed storage endpoints, i.e. a so-called WLCG Data Lake, enabling fast and transparent access to their data without a-priory copy. \end{itemize} These use cases will be addressed implementing intelligent, automatic and hierarchical caching mechanisms. \section{Overall project structure} The project is structured in five Work Packages (WPs). In details, there are two WPs devoted to networking activities (NA), one for service activities (SA) and two for development activities (JRA). The relationships among the WPs are represented in Figure \ref{fig-WP}: NA1 is supervising the activities of all the Work Packages and deals with the management aspects in order to ensure a smoothly progress of the Project activities. NA2, representing communities, will provide requirements that will guide the development activities carried out by JRA1 and JRA2. JRA1 and JRA2 are responsible to provide new and integrated solutions to address the user requirements provided by NA2. SA1 provides Quality Assurance policies and procedures and the Maintenance and Support coordination, including Release Management. NA2 closes the cycle validating the released components on the pilot testbeds made available for exploitation by SA1. \begin{figure}[h] \centering \includegraphics[width=10cm,clip]{XDC-WP.png} \caption{Relationships between the WPs in the XDC project.} \label{fig-WP} \end{figure} {\bf Work Package 1 (NA1)} bring the project towards the successful achievement of its objectives, efficiently coordinating the Consortium and ensuring that the activities progress smoothly. Coordinated by INFN-CNAF, WP1 is responsible for the financial administration of the project; it controls the effort reporting and the progress of the work ensuring that they adhere to the work plan and to the Grant Agreement. WP1 defines the Quality Assurance plan and reports periodically to the EC about the overall project progress. It is responsible for the resolution of internal conflicts and for the creation of a fruitful spirit of collaboration, ensuring that all the partners are engaged to fulfil the project objectives. WP1 communicates the project vision and mission to the relevant international events and interested institutions. In particular, INFN-CNAF has been in charge of the organisation of the joint eXtreme-DataCloud (XDC) and the DEEP-HybridDataCloud kickoff meeting \cite{xdc-ko}, hosted in Bologna (Italy) on 23-25 January, 2018. {\bf Work Package 2 (NA2)} identify new functionalities for the management of huge data volumes from the different Research Communities Use Cases providing requirements to the existing tools developers to enhance the user experience. WP2 is also responsible for testing the new developments and for providing adequate feedback about the user experience of the different services. WP2 analyses the scalability requirements for the infrastructure taking into account the challenges expected in next years and the new frameworks of scientific data. INFN-CNAF mainly participates to WP2 activities by supporting the WLCG community and related LHC experiments through the figure of {\bf Champion} who is a member of Research Communities/Infrastructures that understands very well the needs of the use case and has a general understanding of the available solutions and their features. In particular, WLCG Champion at INFN-CNAF has the role to harmonise the XDC development activities with those carried on within the WLCG community having the objective to improve the software solutions in terms of scalability, usability, maintainability, interoperability and cost effectiveness, looking forward the HL-LHC high demanding data taking conditions. {\bf Work Package 3 (SA1)} provides software lifecycle management services together with pilot e-Infrastructures \begin{itemize} \item For the project developers in order to ensure the maintenance of the high quality of the released software components and services while adding new features \item Through Continuous Integration and Delivery, followed by deployment and operation/monitoring in production environment \item For the User communities, in order to ensure the delivered software consistently pass the customer acceptance criteria and continually improve its quality \item For the e-Infrastructure providers, in order to ensure an easy and smooth delivery of improved software services and components while guaranteeing the stability of their production infrastructures. \end{itemize} INFN-CNAF is coordinating WP3 and its main activities are related to: \begin{itemize} \item Software Lifecycle Management: expressed in terms of Software Quality Assurance and Software Release and Maintenance, CNAF is coordinating the management of those software products that became officially part of the first XDC releases, codenamed Pulsar \cite{xdc-pulsar}, foreseen for late 2018 and effectively released in January 2019. CNAF is also coordinating the implementation of the continuous software improvement process, following a DevOps approach, through the definition and realisation of an innovative Continuous Integration (CI) and Delivery (CD) system. \item Pilot Infrastructure Services: CNAF, with the support of the project partners, is providing and maintaining the testbeds dedicated to developers, software integration and software preview. In particular, the activities were focused in implementing the services needed to support the software development and release management and included among others the source code repository, and continuous integration system. \item Exploitation activities: focused in bridging with infrastructure providers which are the targets for the XDC software together with the user communities (WP2). Among the exploitation activities, INFN-CNAF actively participated on a task devoted to creating a Service Providers Board and establish communication channels with providers. \end{itemize} Moreover, INFN-CNAF is hosting and maintaining the XDC collaborative tools put in place for an effective project communication among partners (web site, INDICO agenda, mailing lists, document repository, video conference, issue tracking system, project management and content collaboration ). {\bf Work Package 4 (JRA1)} provides the semi or fully automated placement of scientific data in the Exabyte region on the site (IaaS) as well as on the federated storage level. In the context of this Work Package, placement may either refer to the media the data is stored on, to guarantee a requested Quality of Service, or the geographical location, in order to move data as close to the compute facility as possible to overcome latency issues in geographically distributed infrastructures. In the latter case, data might either be permanently moved, or temporarily cached. In WP4, INFN-CNAF contributes in the development of an HTTP caching system based on NGINX \cite{nginx} web server. Serving as a content cache, the XDC-HTTP caching service, can be deployed in several WLCG data management workflows, given that many of the software solutions support the HTTP protocol for data operations. In particular, this activity carried on in WP4 aims to add support for VOMS proxy certificates exploiting the modularity of NGINX to develop an additional module that inspects the VOMS proxy certificate attributes. INFN-CNAF contributed also to the geographically scalability tests of the XCache system, developed within WP4 activities, in particular in the deployment and related support of the software in a national testbed. {\bf Work Package 5 (JRA2)} provides enriched high-level data management services unifying access to heterogeneous storage resources and services, enable extreme scale data processing on both private as well as public Cloud computing resources using established interfaces and allowing usage of legacy applications without the need for rewriting them from scratch. These functionalities will be provided mainly by Onedata \cite{onedata} distributed virtual file-system platform. In particular, INFN-CNAF is in charge to deploy and test the new features released by WP5 by adopting the services and solution provided by WP3. \section{General architecture} The XDC project aims at providing advanced data management capabilities that require the execution of several tasks and the interaction among several components and services. Those capabilities should include but are not limited to QoS management, preprocessing at ingestion and automated data transfers. Therefore a global orchestration layer is needed to take care of the execution of those complex workflows. Figure~\ref{fig-comp} highlights the main components and their role among the three different levels: Storage, Federation, and Orchestration. \begin{figure}[h] \centering \includegraphics[width=8cm,clip]{XDC-comp.png} \caption{XDC main components and related roles.} \label{fig-comp} \end{figure} Figure~\ref{fig-HLA} highlights the high level architecture of the XDC project by describing the components and the related connections. \begin{figure}[h] \centering \includegraphics[width=12cm,clip]{XDC-HLA.png} \caption{High level architecture of the XDC project.} \label{fig-HLA} \end{figure} \subsection{XDC PaaS Orchestration system} In the XDC project a global orchestration layer is needed to take care of the execution of those complex workflows. The orchestration covers two essential aspects: \begin{itemize} \item The overall control, steering and bookkeeping including the connection to compute resources \item The orchestration of the data management activities like data transfers, and data federation. \end{itemize} Consequently it was decided to split the responsibilities between two different components: the INDIGO Orchestrator \cite{paasorch} and Rucio \cite{rucio}. The INDIGO PaaS Orchestrator, the system wide orchestration engine, is a component of the PaaS layer that allows to instantiate resources on Cloud Management Frameworks (like OpenStack and OpenNebula) and Mesos clusters. It takes the deployment requests, expressed through templates written in TOSCA YAML Profile 1.0 \cite{tosca}, and deploys them on the best cloud site available. The Rucio project, the data management orchestration subsystem, is the new version of ATLAS Distributed Data Management (DDM) system services for allowing to manage the large volumes of data, both provided by the detector as well as generated or derived, in the ATLAS distributed computing system. Rucio is used to manage accounts, files, datasets and distributed storage systems. Those two components, the PaaS Orchestrator and Rucio, provide different capabilities and can complement each other to offer a full set of features to meet the XDC requirements. Rucio implements the data management functionalities missing in the INDIGO Orchestrator: the Orchestrator will make use of those capabilities to orchestrate the data movement based on policies. \subsection{XDC Quality-of-Service implementation} The idea to provide scientific communities or individuals with the ability to specify a particular quality of service when storing data, e.g. the maximum access latency or minimum retention policy, was introduced within the INDIGO-DataCloud project. In XDC, the QoS concept is envisioned to consistently complement all data related activities. In other words, whenever storage space is requested, either manually by a user or programmatically by a framework, the quality of that space can be negotiated between the requesting entity and the storage provider. \subsection{Caching within XDC} In this section we consider how the XDC architecture treats the storage and access of data, building a hierarchy of components whose goal is to maximise the accessibility of data to clients while minimising global infrastructure costs. The architecture considers a set of multi-site storage systems, potentially accessed through caches, both of which are aggregated globally through a federation. To such purpose, various technologies are available to the project to serve as the basis of an implementation: \begin{itemize} \item The system runs native dCache \cite{dcache} or EOS, but operates in a ``caching mode'' staging data in when a cache miss occurs. \item A service such as Dynafed \cite{dynafed} will be augmented to initiate data movement. While it would hold only metadata, it would use a local storage system for this. \item A standalone HTTP cache could be built from existing web technology, such as NGINX, modified for horizontal scalability and relevant AAI support. \end{itemize} \subsection{XDC data management and new developments} Data management functionality for end users will be also available via the Onedata platform \cite{onedata}. Onezone will provide single-sign on authentication and authorisation for users, which will be able to create access tokens to perform data access activities via the Web browser, REST API or using Onedata POSIX virtual filesystem. Onezone will enable federating multiple storage sites through the deployment of Oneprovider services on top of the actual storage resources provisioned by the sites. For the purpose of job scheduling and orchestration, Onedata will communicate with Indigo Orchestrator component by means of a message bus, allowing the Orchestrator to subscribe for events related to data transfers and data access. This will allow Orchestrator to react to changes in the overall system state (e.g. a new file in a specific directory or space, data distribution changes initiated by manual transfers, cache invalidation or on-the-fly block transfers). Onedata will be also responsible for definition of federation level authentication and authorisation aspects of data access, based on OpenID Connect \cite{oidc}. On the data access layer, Onedata will provide a WebDAV \cite{webdav} storage interface, to enable integration of other HTTP transfer based components such as FTS \cite{fts} or EOS \cite{eos} to make the data managed by these components accessible in a unified manner via the POSIX virtual filesystem provided by Onedata. Furthermore, Onezone, the entry point to the data management aspects of the platform, will allow for semi-automated creation of data discovery portals, based on metadata stored in the federated Oneprovider instances and on a centralised ElasticSearch engine indexing the metadata. This solution will allow the communities to create custom indexes on the data and metadata, provide customisable styles and icons for their users and defining custom authorisation rights based on user classes (public access, access on login, group access, etc.). \section{Conclusions} In the present contribution the XDC objectives, starting from the technology gaps that currently prevent effective exploitation of distributed computing and storage resources by many scientific communities, have been discussed and presented together with the activities (and related contributions) carried on by INFN-CNAF for each WP of the project. Those objectives are the real driver of the project and derive directly from use cases, and the related needs, presented by the scientific communities involved in the project itself, covering areas such as Physics, Astrophysics, Bioinformatics, and others. Starting from the above assumptions, the overall structure of the project has been presented emphasising its components, typically based upon or extend established open source solutions, and the relations among them. For the second part of project, the activities carried on at INFN-CNAF will continue to ensure the fulfilment of the project objectives. In particular, the already available software solutions will be enriched by advanced functionalities (provided by JRAs) aimed at addressing the use case requirements provided by NA2. The implementation and related testing of those new solutions will be performed in the testbeds maintained by SA1. SA1 will also continue its activities aimed at further validate the software, its robustness and scalability and will follow the preparation of the second project release, codenamed Quasar, foreseen for the second half of 2019. Moreover, XDC project can complement and integrate with other running projects and communities and with existing multi-national, multi-community infrastructures. As an example, XDC is collaborating with the Designing and Enabling E-Infrastructures for intensive Processing in Hybrid Data Clouds (DEEP-Hybrid-DataCloud) \cite{deep} project aimed at promoting the integration of specialised, and expensive, hardware under a Hybrid Cloud platform, and targeting the evolution of the corresponding Cloud services supporting these intensive computing techniques to production level. \section*{Acknowledgments} eXtreme DataCloud has been funded by the European Commission H2020 research and innovation program under grant agreement RIA 777367. \section{References} \begin{thebibliography}{} \bibitem{xdc} Web site: www.extreme-datacloud.eu \bibitem{EOSC} Web site: https://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud \bibitem{EGI} Web site: https://www.egi.eu/ \bibitem{wlcg} Web site: wlcg.web.cern.ch/ \bibitem{einfracall21} Web site: http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/h2020/topics/einfra-21-2017.html \bibitem{indigo} Web site: https://www.indigo-datacloud.eu \bibitem{xdc-ko} Web site: http://www.extreme-datacloud.eu/kickoff/ \bibitem{xdc-pulsar} Web site: http://www.extreme-datacloud.eu/pulsar-out/ %\bibitem{lifewatch} %Web site: https://www.lifewatch.eu %\bibitem{cta} %Web site: https://www.cta-observatory.org %\bibitem{ecrin} %Web site: https://www.ecrin.org %\bibitem{xfel} %Web site: https://www.xfel.eu \bibitem{nginx} Web site: https://www.nginx.com/ \bibitem{paasorch} Web site: www.indigo-datacloud.eu/paas-orchestrator \bibitem{rucio} Web site: https://rucio.cern.ch/ \bibitem{tosca} TOSCA Simple Profile in YAML Version 1.0. Edited by Derek Palma, Matt Rutkowski, and Thomas Spatzier. 27 August 2015. OASIS Committee Specification Draft 04 / Public Review Draft 01 \bibitem{dcache} Web site: www.dcache.org \bibitem{dynafed} Web site: lcgdm.web.cern.ch/dynafed-dynamic-federation-project \bibitem{onedata} Web site: onedata.org \bibitem{oidc} Web site: https://openid.net/connect/ \bibitem{webdav} Web site: www.webdav.org/ \bibitem{fts} Web site: information-technology.web.cern.ch/services/file-transfer \bibitem{eos} Web site: eos.web.cern.ch \bibitem{deep} Web site: https://deep-hybrid-datacloud.eu/ \end{thebibliography} \end{document}