\documentclass[a4paper]{jpconf} \usepackage[english]{babel} % \usepackage{cite} \usepackage{biblatex} %\bibliographystyle{abnt-num} %%%%%%%%%% Start TeXmacs macros \newcommand{\tmtextit}[1]{{\itshape{#1}}} \newenvironment{itemizedot}{\begin{itemize} \renewcommand{\labelitemi}{$\bullet$}\renewcommand{\labelitemii}{$\bullet$}\renewcommand{\labelitemiii}{$\bullet$}\renewcommand{\labelitemiv}{$\bullet$}}{\end{itemize}} %%%%%%%%%% End TeXmacs macros \begin{document} \title{Evaluating Migration of INFN--T1 from CREAM-CE/LSF to HTCondor-CE/HTCondor} \author{Stefano Dal Pra$^1$} \address{$^1$ INFN-CNAF, Bologna, IT} \ead{stefano.dalpra@cnaf.infn.it} \begin{abstract} The Tier--1 datacentre provides computing resources for a variety of HEP and Astrophysics experiments, organized in Virtual Organization submitting their jobs to our computing facilities through Computing Elements, acting as Grid interfaces to the Local Resource Manager. We planned to phase--out our current LRMS (IBM/Platform LSF 9.1.3) and CEs (CREAM) to adopt HTCondor as a replacement of LSF and HTCondor--CE instead of CREAM. A small cluster has been set up to practice the management and evaluate a migration plan to a new LRMS and CE set. This document reports about our early experience on this. \end{abstract} \section{Introduction} The INFN-T1 currently provides a computing power of about 400KHS06, 35000 slots on one thousand physical Worker Nodes. These resources are accessed through Grid by 24 Grid VOs and locally by 25 user groups. The IBM/Platform LSF 9.1.3 batch system arbitrate access to all the competing users groups, both Grid and local, according to a \tmtextit{fairshare} policy, designed to prevent underutilization of the available resources or starvation of lower priority groups, while ensuring a medium--term share proportional to configured quotas. The CREAM--CEs act as frontend for Grid users to the underlying LSF batch system, submitting their jobs on behalf of them. This setup has proven to be an effective solution for several years. However, the compatibility between CREAM and HTCondor seems to be less tight than with LSF. Moreover, active development of CREAM has recently ceased and thus we cannot expect new versions to be released, nor better HTCondor support to be implemented by an officiale development team. We decided to migrate our batch system solution from LSF to HTCondor, thus we need to also change our CEs. We have selected HTCondor-CE as a natural choice, because it is maintained by the same development team of HTCondor. Following we provide a report about our experience with HTCondor and HTCondor--CE. \section{The HTCondor cluster} To get acquainted with the new batch system and CEs, to evaluate how these can work together, how other components, such as monitoring, provisioning and accounting systems can be integrated with HTCondor and HTCondor--CE and finally to devise a reasonable migration plan, a simple small HTCondor 8.6.13 cluster has been set up during spring 2018. A HTCondor--CE was soon added, in late April. HTCondor is a very mature opensource product, deployed at several major Tier-1 for years, thus we already know that it will certainly fit our use cases. The HTCondor--CE, on the other hand, is a more recent product, and a number of issues might be too problematic for us to deal with. Our focus is about ensuring that this CE implementation can be a viable solution for us. \subsection{The testbed} The test cluster consists of: \begin{itemizedot} \item a HTCondor-CE on top of \item a HTCondor \ Central Manager and Collector \item 3 Worker Nodes (Compute Nodes, in HTCondor terms), 16 slot each. \end{itemizedot} \subsection{HTCondor--CE Installation and setup} The first CE installation was a bit tricky. The RPMs were available from OSG repositories only, meaning that a number of default settings and dependencies were unmet for EGI standards. Short after however, HTCondor--CE RPMs were made available on \ the same official repository of HTCondor. \subsubsection{Setup} To setup the configuration for the HTCondor and HTCondor--CE puppet modules are available. Unfortunately the puppet system at our site is not compatible with these modules as they depend on \tmtextit{hiera}, which is not supported at our site. These were later adapted to make them compatible with our configuration management system. In the meanwhile, the setup was finalized looking at the official documentation. \subsubsection{Configuration.} The first configuration was completed manually. The main documentation source for the HTCondor--CE is that of the OSG website~\cite{OSGDOC}, which refers to a tool \tmtextit{osg-configure} not present on the general HTCondor--CE release. Because of this the setup was completed by trial and error. Once a working setup was obtained, a set of integration notes were added to a public wiki~\cite{INFNWIKI}. This should help other non OSG users to get some supplementary hint to complete their installation. \subsubsection{Accounting} As of 2018, the official EGI accounting tool, APEL~\cite{APEL}, has no support for HTCondor--CE. On the other hand, INFN--T1 has a custom accounting tool in place for several years now~\cite{DGAS}. Thus, it's all about finding a suitable way to retrieve from HTCondor the same information that we retrieve from CREAM--CE and LSF. A working way to do so has been by using python and the \tmtextit{python bindings}, a set of api interfaces to the HTCondor daemons. These can be used to query the SCHEDD at the CE and retrieve a specified set of data\quad about recently finished jobs, which are subsequently inserted to our local accounting database. A noticeable fact to note, is that the grid information (User DN, VO, etc.) are directly available together with all the needed accounting data. This simplifies the accounting problem, as it is no more necessary to collect grid data separately from the BLAH component and then look for matches with the corresponding grid counterpart. This solution have been used during the 2018 to provide accounting for HTCondor--CE testbed cluster. \subsection{Running HTCondor--CE} After some time to become confident with the main configuration tasks the testbed begun working with jobs submitted by the 4 LHC experiments from September 2018. The system proved to be stable and smooth, being able to work unattended. This confirms that this system can be a reliable substitute for CREAM--CE and LSF. \subsection{Running HTCondor} The HTCondor batch system is a mature product with a large user base. We have put less effort at investigating it deeply. We already know that most or all of needed features will work well. Rather, some effort have been put on dealing with configuration management. \subsubsection{Configuration management} Eventhoutgh a standard base of puppet classes have been adapted to our management system, an additional python tool have been written to improve flexibility and readiness. The tool works by reading and enforcing on each node of the cluster a set of configuration directives written on text files accessible from a shared filesystem. The actual set and the read order depends on the host role and name. Doing so, a large cluster can be quite easily managed as a collection of set of host sets. The tool is quite simple and limited but it can be improved as needed when more complex requirements should arise. \subsection{The migration plan} After using the testbed cluster a possible plan for a smooth migration have been devised: \begin{itemizedot} \item Install and setup a new HTCondor cluster, with a few more HTCondor--CE and an initial small set of Worker Nodes \item Enable the LHC VOs on the new cluster \item Add more WN to the new cluster gradually \item Enable other Grid VOs \item Finally, enable submission from local submissions. These are made from a heterogenous set of users, with a potentially rich set of individual needs and can require a considerable administrative effort to meet all of them. \end{itemizedot} \subsection{Conclusion} A testbed cluster based on HTCondor--CE on top of HTCondor batch system has been deployed to evaluate these as a substitute for CREAM--CE and LSF. The evaluation as mostly focused on the HTCondor--CE, as it is the most recent product. Apart for a few minor issues, mainly related to gaps in the available documentation, The CE proved to be a stable component. the possibility to perform accounting has been verified. Using the described testbed we have \section*{References} \begin{thebibliography}{9} \bibitem{OSGDOC} \url{https://opensciencegrid.org/docs/compute-element/install-htcondor-ce/} \bibitem{INFNWIKI} \url{http://wiki.infn.it/progetti/htcondor-tf/htcondor-ce_setup} \bibitem{DGAS} S. Dal Pra, ``Accounting Data Recovery. A Case Report from INFN-T1'' Nota interna, Commissione Calcolo e Reti dell'INFN, {\tt CCR-48/2014/P} \bibitem{APEL} \url{https://wiki.egi.eu/wiki/APEL} \end{thebibliography} \end{document}