Management of the Batch queue: HTCondor

The batch queuing in Bc2S was made with PBS (Torque-Maui). The free version of this product starts to present problems when the size of the cluster to manage becomes of the order of the one foreseen for the ReCaS-Bari Datacenter. This has forced us to change the manager of the batch queues.

Several open source products were evaluated with the aim of providing the best service in terms of functionality, scalability and reliability.

In particular we started with testing SLURM which, while giving satisfactory results, has proved to be most suitable to a "Parallel Computing" environment in which the applications tries to use in parallel the maximum number of available slots.

It was subsequently made a careful assessment of HTCondor.

The choice for the management tool the job queue (Batch System) in ReCaS fell at the end on HTCondor for a number of reasons:

  • It is an Open Source product;
  • It is designed for High Throughput Computing and therefore suitable to manage the kind of  applications that will be executed in the ReCaS Datacenter;
  • It is able to operate with heterogeneous hardware, such the one we have in  the Bc2S, which was acquired in the course of several years and puts together servers with different technical characteristics;
  • It has proven to be stable and able to handle the volume of expected load for a data center the size of ReCaS;
  • It has proven to scale easily in case resources recas were to increase in the future.

 

Management of the Batch queue on the HPC cluster: SLURM

For the job management on the ReCaS-Bari HPC cluster, which is a cluster of modest size, considering also that HTCondor is not particularly suitable for managing queues of parallel jobs, it was chosen to continue to use PBS on the basis of experience gained in the years on this product.

Storage management:

Also for storage management solutions have been chosen in order to satisfy the needs of the main users of the ReCaS DataCenter. The choice fell on:

  • GPFS

    GPFS is the general file system that allows all users to access posix files recorded on the storage system from all compute nodes in the ReCaS farm. GPFS is the only component  used in the ReCaS DataCenter. which is not open source.

  • XRootD

    Xrootd is the file system used by the ALICE experiment. With the set-up of the ReCaS Datacenter it was preferred to provide access to the ALICE storage directly with this component instead of mounting it  on top of GPFS.

Installation and Configuration: Foreman e Puppet

Given the size of the ReCaS data center and in order to handle possible  future expansions, it was decided to carry out centrally operations such as  servers installation and subsequent contextualization.
During the evaluation of the products capable of providing these two functions, particular attention has been paid to their degree of flexibility, easiness of  the configurations language, as well as their ability to scale with the size of the farm.
For installation and configuration of the servers were chosen Foreman and Puppet. Specifically Foreman is used for installing the servers while Puppet ensures their contextualization.
Although the two products are developed and released separately, Foreman is built on top of Puppet and this means that the two products appear strongly integrated.
Both are open source software giving the freedom to modify them, if necessary  to cope with specific requirements.

Monitoring: Zabbix

Zabbix integrates into a single instrument all the desirable features in a monitoring system:

  • the possibility to send alerts via IM, SMS, and email;
  • the availability of graphical representation of the monitored parameters;
  • the large part of the sensors required to monitor the majority of the typical parameters of a data center area easily available,
  • the ease of installation,
  • the availability of documentation and excellent support system.
  • the ability to maintain the history for years with the help of features down-sampling and cleaning.

Ticketing: OpenProject

OpenProject was adopted for the planning and management of activities, sharing of code, manuals, guides and information (wiky).