Saturday, 19 July 2014

OpenStack plays Tetris : Stacking and Spreading a full private cloud

At CERN, we're running a large scale private cloud which is providing compute resources for physicists analysing the data from the Large Hadron Collider. With 100s of VMs created per day, the OpenStack scheduler has to perform a Tetris like job to assign the different flavors of VMs falling to the specific hypervisors.

As we increase the number of VMs that we're running on the CERN cloud, we see the impact of a number of configuration choices made early on in the cloud deployment. One key choice is how to schedule VMs across a pool of hypervisors.

We provide our users with a mixture of flavors for their VMs (for details, see

During the past year in production, we have seen a steady growth in the number of instances to nearly 7,000.

At the same time, we're seeing an increasing elastic load as the user community explores potential ways of using clouds for physics.

Given that CERN has a fixed resource pool and the budget available is defined and fixed, the underlying capacity is not elastic and we are now starting to encounter scenarios where the private cloud can become full. Users see this as errors when they request VMs that no free hypervisor could be located.

This situation occurs more frequently for the large VMs. Physics programs can make use of multiple cores to process physics events in parallel and our batch system (which runs on VMs) benefits from a smaller number of hosts. This accounts for a significant number of large core VMs.

The problem occurs as the cloud approaches being full. Using the default OpenStack configuration (known as 'spread'), VMs are evenly distributed across the hypervisors. If the cloud is running at low utilisation, this is an attractive configuration as CPU and I/O load are also spread and little hardware is left idle.

However, as the utilisation of the cloud increases, the resources free on each hypervisor are reduced evenly. To take a simple case, a cloud with two compute nodes of 24 cores handling a variety of flavors. If there are requests for two 1-core VMs followed by one 24 core flavor, the alternative approaches can be simulated.

In a spread configuration,
  • The first VM request lands on hypervisor A leaving A with 23 cores available and B with 24 cores
  • The second VM request arrives and following the policy to spread the usage, this is scheduled to hypervisor B, leaving A and B with 23 cores available.
  • The request for one 24 core flavor arrives and no hypervisor can satisfy it despite there being 46 cores available and only 4% of the cloud used.
In the stacked configuration,

  • The first VM request lands on hypervisor A leaving A with 23 cores available and B with 24 cores
  • The second VM request arrives and following the policy to stack the usage, this is scheduled to hypervisor A, leaving A with 22 cores and B with 24 cores available.
  • The request for one 24 core flavor arrives and is satisfied by B
A stacked configuration is configured using the RAM weight being negative (i.e. prefer machines with less RAM). This has the effect to pack the VMs. This is done through a nova.conf setting as follows


When a cloud is initially being set up, the question of maximum packing does not often come up in the early days. However, once the cloud has workload running under spread, it can be disruptive to move to stacked since the existing VMs will not be moved to match the new policy.

Thus, it is important as part of the cloud planning to reflect on the best approach for each different cloud use case and avoid more complex resource rebalancing at a later date.


  • OpenStack configuration reference for scheduling at

Thursday, 20 March 2014

CERN Cloud Architecture - Update

In the last OpenStack Design Summit in Hong Kong I presented the CERN Cloud Architecture with the talk “Deep Dive into the CERN Cloud Infrastructure” . Since then the infrastructure grown to a third cell and we enabled ceilometer compute-agent. Because of that we needed to perform some architecture changes to cope with the number of nova-api calls that ceilometer compute-agent generates.

The cloud infrastructure has now more than 50000 cores and after the next hardware delivers expected during the next months, more than 35000 new cores will be added most of them in the remote Computer Centre in Hungary. Also, we continue to migrate existing servers to OpenStack compute nodes at an average of 100 servers per week.

Fig. 1 – High-level view of CERN Cloud Infrastructure
We are using Cells in order to scale the infrastructure and for project distribution.
At the moment we have three Compute Cells. Two are deployed in Geneva, Switzerland and the other in Budapest, Hungary.

All OpenStack services running in the Cell Controllers are behind a Load Balancer and we have at least 3 running instances for each of them. As message broker we are using RabbitMQ clustered with HA queues.

Metering is an important requirement for CERN to account resources and ceilometer is the obvious solution to provide this functionality.
Considering our Cell setup we are running ceilometer api and ceilometer collector in the API Cell and the ceilometer agent-central and ceilometer collector in the Compute Cells.

Fig.2 – OpenStack services that are running in different Cell layers at CERN Cloud infrastructure. At green the new configured components.
In order to get information about the running VMs ceilometer compute-agent calls nova-api. In our initial setup we used the simple approach to use nova-api already running in the API Cell. This means all ceilometer compute-agents will authenticate with keystone in the API Cell and then call nova-api running there. Unfortunately this approach doesn’t work using cells because the bug:
Even if ceilometer was not getting the right instance domain and failed to find the VMs in the compute nodes we noticed a huge increase in the number of nova-api calls that were hitting the nova-api servers on API Cell that could degrade user experience.

We then decided to move this load to the Compute Cells enabling nova-api compute there and point all ceilometer compute-agents to them instead. This approach has 3 main advantages:
1) Isolation of nova-api calls per Cell allowing a better dimensioning of Compute Cell controllers and separation between user and ceilometer requests.
2) nova-api on the Compute Cells uses nova Cell databases. Distribute the queries between databases not overloading API Cell database.
3) Because point 2) the VM domain name is now reported correctly.

However to deploy nova-api compute at Compute Cell level we also needed to configure other components: keystone, glance-api and glance-registry.

- Keystone is configured per Compute Cell with the following endpoints (local nova-api and local glance-api). Only service accounts can authenticate with the keystones running in the Compute Cells, users are not allowed.
Configuring keystone per Compute Cell allows us to distribute the nova-api load at Cell level. For ceilometer only Cell databases are used to retrieve instance information. Keystone load is also distributed. Instead using the API Cell keystone used by every user, ceilometer only uses the Compute Cell keystones that are completely isolated from the API Cell. This is especially important because we are not using PKI and our keystone configuration is single threaded.

- From the beginning we are running glance-api at Compute Cell level. This allows us to have image cache in the Compute Cells, which is especially important for the Budapest Computer Centre since Ceph deployment is at Geneva.
Ceilometer compute-agent also queries for image information using nova-api. However we can’t use the existing glance-api because it uses the API Cell keystone for token validation. Because of that we setup other glance-api and glance-registry at Compute Cell level but listening a different port and using the local keystone.

- Nova-api compute is enabled at Compute Cell level. All nova services running in the Compute Cell controllers use the same configuration file “nova.conf”. This means that for nova-api service we needed to overwrite the “glance_api_servers” configuration option to point to the new local glance-api, but keeping the old configuration that is necessary to spawn instances. Nova-api service is using the local keystone. Metadata service is not affected because of that.

We expect that with this distribution if ceilometer starts to overload the infrastructure, user experience will not be affected.
With all these changes in the architecture we now enabled ceilometer compute-agent in all Compute Cells.

Just out of curiosity I would like to finish this blog post with the plot showing the number of the nova-api calls before and after enabling ceilometer compute-agent in all infrastructure. In total it increased more than 14 times.

Friday, 7 March 2014

Enable Cinder-multi-backend with an existing Ceph backend.

CERN IT is operating a 3 PetaByte Ceph cluster and one of our use-cases is to store our OpenStack volumes and images. For more details on Ceph cluster, Dan van der Ster's presentation is available at the following link.

After the migration to Havana, we started to provide the volume service to a wider audience and therefore needed to tune our cinder configuration.

This post will show you how we enabled the multi-backend in Cinder. We dealt with the migration of our Ceph volume to the new volume type. And finally we will look at the quality of service we want to enable on our newly created volume type.

We had already, Ceph configured as Default:

We added the following option in /etc/cinder/cinder.conf to enable the multi backend support:


Create the volume type:
# cinder type-create standard
# cinder type-key standard set volume_backend_name=standard

To verify the type has been created you can run:

# cinder extra-specs-list
|                  ID                  |   Name   |              extra_specs              |
| c6ad034a-5d97-443b-97c6-58a8744bf99b | standard | {u'volume_backend_name': u'standard'} |

A restart is needed after these changes:

# for i in volume api schedule; do service restart openstack-cinder-$i ; done

At this point it is not possible to attach/detach the volumes you created without a type.
Nova will fail with the following error:

nova.openstack.common.notifier.rpc_notifier ValueError: Circular reference detected

To fix it, we had to update manually the database, with two steps:
mysqldump cinder (don't forget :)
  • Update the volume_type_id column with the output of cinder extra-specs-list :
update volumes set volume_type_id="c6ad034a-5d97-443b-97c6-58a8744bf99b" where volume_type_id is NULL; 
  • For each controller the host column needs an update:
update volumes set  host='p01@standard' where host='p01';
update volumes set  host='p02@standard' where host='p02'; 

All volume are now of type standard and can be operate as usual. Be sure to have "default_volume_type" defined in your cinder.conf otherwise it will default to 'None' and these volumes will not be functional.

The last step is to delete from your DEFAULT section the old volume settings.

Since we have a large number of disks in the ceph store, there is a very high potential capacity for IOPS but we want to be sure that individual VMs cannot monopolise this capacity. This involves enabling the Quality-of-Service features in cinder.

Enabling QoS is straight forward:
# cinder qos-create standard-iops consumer="front-end" read_iops_sec=400 write_iops_sec=200
# cinder qos-associate 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 c6ad034a-5d97-443b-97c6-58a8744bf99b
If you want to add additional limits:
# cinder qos-key 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 set read_bytes_sec=80000000
# cinder qos-key 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 set write_bytes_sec=40000000
# cinder qos-list
For the QoS parameters to be activated for an existing volume, you need to detach and reattach the old volumes.

This information is collected from the cloud team from CERN and a big thank you to the Ceph team for the help.

Monday, 24 February 2014

Our Cloud in Havana

TL;DR Upgrading a nearly 50,000 core cloud from Grizzly to Havana can be done with a series of steps, each of which can have short periods of reduced functionality but with constant VM availability.

At CERN, we started our production cloud service on Grizzly in July 2013. The previous OpenStack clouds had been pre-production environments with a fixed lifetime (i.e. they were available for use with an end date to be announced where the users would move to the new version via re-creating instances with tools such as Puppet or snapshot/upload instances).

With the Grizzly release, we made the service available with an agreement to upgrade in place rather than build anew. This blog details our experiences moving to the Havana release.

High Level Approach

We took a rolling upgrade approach, component by component, depending on need and complexity.
Following the release notes and operations guide, the order we chose was the following
  1. Ceilometer
  2. Glance
  3. Keystone
  4. Cinder
  5. Client CLIs
  6. Horizon
  7. Nova
The CERN cloud is based on the RDO distribution (see The majority of the servers are running Scientific Linux 6.5 with hypervisors running KVM. We follow a multi-hyper-visor approach so we also have Windows 2012 R2 with Hyper-V. The cloud databases are using MySQL.

Other configurations and different sites may see other issues than listed here so please check that this approach is appropriate for your environment before execution.


While the cloud was in production in July, we had problems getting ceilometer to work well with cells and with the number of hypervisors we have (over 2,000). Thus, we chose to upgrade early to Havana as this provided much of the functionality we needed and avoided needing to backport.

Havana Ceilometer worked well with Grizzly Nova and allowed us to progress further with detailed metering of our cloud. We are still needing the patch (which has subsequently been included in Havana 2013.2.2 stable release after we upgraded).


The CERN Glance environment is backed by Ceph. We run multiple glance servers behind an HA Proxy load balancer.

For the upgrade, we
  • Stopped the Glance service at 16h00
  • Performed the database upgrade steps
  • Installed the Havana packages in all top and cell controllers
  • Glance was re-enabled at 16h45
One issue was spotted where access to non-public images was possible when using the nova image-list command. The images were not visible when using glance image-list. As a mitigation, access to the Nova image API was blocked and users were told to use the glance command (as was already recommended).

The root cause was related to the bug The fix for the problem was to add the policy statement for the parameter context_is_admin which is needed to limit access to images for projects.


The overall approach taken was to try an online upgrade. We have a keystone in each of our cells (so we can talk to each of them independently of the top level cell API nodes) so these were good candidates to upgrade first. We use Active Directory for the user credentials so the Keystone service itself has a limited amount of state only (related to EC2 credentials and token management).

There were some significant additional functionalities such as tokens based on PKI (to allow validating tokens without calling home) and the V3 API which adds lots of functionality we are interested in such as Domains. We chose to take the incremental small step approach of migrating with the same function and then enabling additional function once the code had been deployed.

For the main keystone instance, we were aiming to benefit from the load balancing layer and that there were minimal database changes in Keystone. The largest problem foreseen was the EC2 credentials since these, in Grizzly, are stored in a different column to those in Havana (just Credentials). These columns had to be kept in sync with a script during the phase where both versions were running.

In practice, we performed the upgrade more rapidly than originally planned as some of the clients gave errors when they had received a token from a Grizzly keystone and were authenticating against a Havana one. Thus, we upgraded all the keystone servers as soon as the final functional tests in the production environment were completed.

The detailed error message of the error on a Grizzly client was

2014-01-23 09:10:02    ERROR [root] 'token'
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/keystone/common/", line 265, in __call__
    result = method(context, **params)
  File "/usr/lib/python2.6/site-packages/keystone/token/", line 541, in validate_token
    self._assert_default_domain(context, token_ref)
  File "/usr/lib/python2.6/site-packages/keystone/token/", line 482, in _assert_default_domain
    if (token_ref['token_data']['token']['user']['domain']['id'] !=
KeyError: 'token'

The solution was to complete the upgrade to Keystone on all the servers.

During the upgrade, we also noticed a database growth from expired tokens without a purge from the past few months which is now possible to manage using keystone-manage token-flush ( This is not a specific upgrade problem but it is worth doing to keep the database manageable.


We had backported some Cinder cells functionality to Grizzly so that we could launch our Ceph volume storage function before we migrated to Havana. CERN currently runs a 3.5 PB Ceph service including OpenStack images, OpenStack Volumes and several other projects working with the object and block interfaces into Ceph. We had to take a careful migration approach for Cinder to protect the additional columns which were not in the standard Grizzly tables.

The impact for the user community was that volume creation/deletion was not possible for around 15 minutes during the upgrade. Existing VMs with attached storage continued to run without problems.

Following the standard upgrade procedure of
  • save a list of all volumes before starting using nova volume-list --all-t
  • stop the daemons for cinder
  • Perform full DB backup of cinder
  • update all RPMs
  • cinder-manage db sync
  • restart daemons
  • check new nova volume-list --all-t to see it matches
One minor issue we found was on the client side. Since the nova client depends on the cinder client, we had to upgrade the nova client to get to the latest cinder client version. With the nova client package being backwards compatible, this was not an issue but forced an earlier than planned update of nova client.

Client CLIs

We use the RDO client packages on our Linux machines. The upgrade was a change of repository and yum update.

Likewise, the Windows and Mac clients were upgraded in a similar way using pip. 


We have customised Horizon slightly to add some basic features (add help and subscribe buttons to the login page, disable some features which aren't currently supported on the CERN cloud such as security groups). These changes were repeated.

The migration approach was to set up the Horizon web server on a new machine and test out the functionality. Horizon is very tolerant of different versions of code, including partial upgrades such as the CERN environment with Nova downlevel. We found a minor bug with the date picker when using Firefox which we'll be reporting upstream.

Once validated, the alias and virtual host definition in Apache were changed to point to the new machine.

Once under production load, however, we started to see stability issues with the Apache server and large numbers of connections. The users saw authentication problems on login and HTTP 500 errors. In the short term, we put a web server restart on a regular basis to allow us to analyse the problem. With memcached looking after the session information, this was a work around we could use without user disruption but is not a long term solution (from the comments below, this is a reported bug at


The Nova migration testing took the longest time for several reasons.

It is our most complex configuration with many cells in two data centres, in Geneva and Budapest. With nearly 50,000 cores and thousands of hypervisors, there is a lot of machines to update.

We have customised Nova to support the CERN legacy network management system. This is a CERN specific database which contains the hostnames, MAC addresses and IPs that is kept up to date by the nova network component.

To test, we performed an offline database copy and stepped through the migration scripts in a cloned environment. The only problems encountered were due to the local tables we had added. We performed a functional and stress test of the environment. With the dynamic nature of the cloud at CERN, we wanted to be sure that there would not be regression in areas such as performance and stability.

During the testing, we found some minor problems
  • euca2ools did not work on our standard SL 6 configuration (version 2.1.4). An error instance-type should be of type string was raised on VM creation. The root cause was a downlevel boto version (see bugzilla ticket)
  • When creating multiple machines using --num-instances on the nova command line with the cells configuration, the unique name created was invalid. A launchpad report was raised but this was a non-blocking issue as the feature is not used often.
We took a very conservative approach to ensure clear steps and post step checkout validation. We had investigated doing the migration cell by cell but we took a cautious approach for our first upgrade. For the actual migration, the steps were as follows:
  • Disable Puppet automatic software updating so we would be in control of which upgrades occurred when.
  • Move the Havana repository from the test/QA environment to the master branch in Puppet
  • Run "yum clean --all" on all the nodes with mcollective 
  • Stop Puppet running on all nodes
  • Starting with the top API controllers, then the cell controllers and finally the compute nodes
    • Block requests to stop new operations arriving
    • Disable automatic restart of daemons by the monitoring system
    • Stop the daemons and disable RabbitMQ
  • Back up all DBs and create a local dump of each in case we need a rapid restore
  • Update all the Nova RPMs on top controllers
  • Update the top DB
  • Reboot to get the latest Linux kernel
  • Update all the Nova RPMs on the cell controller
  • Update the cell DB
  • Reboot for new kernel
  • Update the RPMs on the compute nodes
  • Enable RabbitMQ
  • Enable services on top controllers
  • Enable services in child cell controllers
  • Enable nova-compute
  • Check out the service communications
  • Update the compute nodes using mcollective
  • Enable monitoring/exceptions on all nodes
  • Enable Puppet on all nodes
  • Perform check out tests
  • Enable user facing APIs at the top cells
Total time to run these steps was around 6 hours with the VMs running throughout. The databases back up, cloning and final migration script testing on the production data took a couple of hours. The software upgrade steps were also significant, checking RPMs are deployed across thousands of hypervisors is a lengthy process. Although there were a lot of steps, each one can be performed and checked out before continuing.

Post Production Issues

  • The temporary table containing quota cache data was cleaned as part of the upgrade but does not appear to be 100% recreated. seems to describe the problem. Logging in with the dashboard fixes the problem in most cases.


Upgrading OpenStack in production needs some careful planning and testing but it is a set of standard upgrade steps. There are lots of interesting features in Havana to explore for both the operations team and the end users of the CERN cloud.


This information is collected from the cloud team from CERN which performed the upgrade (Belmiro, Jose, Luis, Marcos, Thomas, Stefano) while also providing user support and adding new hardware.

Monday, 9 December 2013

Swiss and Rhone Alpes User Group Meeting at CERN

A combined meeting of the Swiss OpenStack and Rhone Alpes OpenStack user groups was held at CERN on Friday 6th December 2013. This is the 6th Swiss User Group meeting and the 2nd Rhone Alpes one.

Despite the cold and a number of long distance travelers, 85 people from banking, telecommunications, Academia/Research and IT companies gather to share OpenStack experiences across the region and get the latest news.

The day was split into two parts. In the morning, there was the opportunity for a visit to the ATLAS experiment, see a 3-D movie on how the experiment was built and visit the CERN public exhibition points at the Globe and Microcosm. Since the Large Hadron Collider is currently under going maintenance until 2015, the experimental areas are accessible for small groups with a guide.

At the ATLAS control room, you could see a model of the detector


The real thing is slightly bigger and heavier... 100m underground, only one end is accessible currently from the viewing gallery.

The afternoon session presentations are available at the conference page.

After a quick introduction, I gave feedback on the latest board meeting in Hong Kong with topics such as the election process and the defcore discussion to answer the "What is OpenStack" question.

The following talk was an set of technical summaries of the Hong Kong summit from Belmiro, Patrick and Gergely. Belmiro covered the latest news on Nova, Glance and Cinder along with some slides from his deep dive talk on CERN's openstack cloud.


Patrick from Bull's xlcloud covered the latest news on Heat which is rapidly becoming a key part of the OpenStack environment as it not only provides a user facing service for orchestration but also is now a pre-requisite for other projects such as Trove, the database as a service.


In a good illustration of the difficulties of compatibility, the open office document failed to display the key slide on PowerPoint but Patrick covered the details while the PDF version was brought up. Heat currently supports AWS Cloud Formations but is now adding a native template language, HOT, to cover additional functions. The icehouse release will add more auto scaling features, integration with ceilometer and a move towards the TOSCA standard.

Gregely covered some of the user stories and the latest news on ceilometer as it starts to move into alarming on top of the existing metering function.


Alessandro then covered the online clouds at CERN which are opportunistically using the 1000s of servers attached to the CMS and ATLAS experiments when the LHC is not running.  The aim is to be able to switch as fast as possible from the farm being used to filter the 1PB/s from the LHC to performing physics work. Current tests show it takes around 45 minutes to intantiate the VMs on the 1,400 hypervisors.


Jens-Christian gave a talk on the use of Ceph at SWITCH. Many of the aspects seem similar to the block storage that we are looking at within CERN's cloud. SWITCH are aiming a 1,000 core cluster to serve the Swiss academic community including dropbox, IaaS and app-store style services. It was particularly encouraging to see that SWITCH have been able to perform online upgrades between versions without problems.... the regular warning to be cautious with CephFS is also made, so using rbd rather than filesystem backed storage makes sense.


Martin gave us a detailed view on the options for geographically distributed clouds using OpenStack. This was intriguing on multiple levels in view of CERN's ongoing work with the community on federated identity along with some useful hints and tips on the different kinds of approaches. Martin converged onto using Regions to achieve the ultimate goals but there were several potential useful intermediate configurations such as cells which CERN is using extensively in the multi-data centre cloud with Budapest and Meyrin. I fully agree with Martin's perspective on the need for cells to become more than just a nova concept as we require similar functions in Glance and Cinder for the CERN use case. Martin had given the same talk in Paris on Thursday and was giving it again on Monday in Israel so he is doing a fine job in re-use of presentations.


Sergio described Elastic Cluster which is an open source tool for provisioning clusters for researchers. He illustrated the function with a youtube video which demonstrates the scaling of the cluster on top of a cloud infrastructure.


Finally, Dave Neary gave an introduction to Open Shift and how to deploy a Platform-as-a-Service solution. Using a simple git/ssh model, various PaaS instances can be deployed easily on top of an OpenStack cloud. The talk included a memorable demo where Dave showed the depth of his ruby skills but was rescued by the members of the audience and the application deployed to rounds of applause.


Many thanks to the CERN guides and administration for helping with the organisation, all of the attendees for coming and making it such a lively meeting and to Sven Michels and Belmiro Morerira for the photos.

Thursday, 17 October 2013

Log handling and dashboards in the CERN cloud

At CERN, we developed a fabric monitoring tool called Lemon around 10 years ago while the computing infrastructure for the Large Hadron Collider was being commissioned.

As in many areas, modern Open Source tools are now providing equivalent function without the effort of maintaining the entire software stack and benefiting from the experience of others.

As part of the ongoing work to align the CERN computing fabric management to open source tools with strong community support, we have investigated a monitoring tool chain which would allow us to consolidate logs, perform analysis for trends and provide dashboards to assist in problem determination.

Given our scalability objectives, we can expect O(100K) endpoints with multiple data sources to be correlated from PDUs, machines and guest instances.

A team looking into a CERN IT reference architecture have divided the problem into 4 areas

  • Transport - how to get the data from the client machine to where can be analysed with the aim of a low overhead per client. Flume was chosen as the implementation for this part after a comparison with ActiveMQ (which we use in other areas such as monitoring notifications). Flume was chosen as there were many pre-existing connectors for sources and sinks.
  • Tagging - where we can associate additional external metadata with the data so each entry is self describing. Elastic Search was the selection here.
  • Dashboard - where queries on criteria are displayed in an intuitive way (such as time series or pie charts) on a service specific dashboard. Kibana was the natural choice in view of the ease of integration with Elastic Search.
  • Long term data repository to allow offline analysis and trending when new areas of investigation are needed. HDFS was the clear choice for unstructured data and we had a number of other projects at CERN that require it's function too.

Using this architecture, we have implemented the CERN private cloud log handling, taking the logs from the 700 hypervisors and controllers around the CERN cloud and consolidating them into a set of dashboards.

Our requirements are
  • Have a centralized copy of our logs to ease problem investigation
  • Display OpenStack usage statistics such as management dashboards
  • Show the results of functional tests and probes to the production cloud
  • Maintain a long term history of the infrastructure status and evolution
  • Monitor the state of our databases
The Elastic Search configuration is a 14 node cluster with 11 data nodes and 3 HTTP nodes configured using Puppet and running, naturally, on VMs in the CERN private cloud. We have 5 shares per index with 2 replicas per shard.

Kibana is running on 3 nodes, with Shibboleth authentication to integrate to the CERN Single Sign On system.

A number of dashboards have been created. A Nova API dashboard shows the usage

An Active User dashboard helps us to identify if there are heavy users causing disturbance.

The dashboards themselves can be created dynamically without needing an in-depth knowledge of the monitoring infrastructure.

This work has been implemented by the CERN monitoring and cloud teams and thanks for the support of the Flume, Elastic Search and Kibana communities.

Saturday, 21 September 2013

A tale of 3 OpenStack clouds : 50,000 cores in production at CERN

I had several questions at the recent GigaOm conference in London regarding the OpenStack environments at CERN. This blog explains the different instances and their distinct teams of evaluation and administration.

The CERN IT department provides services for around 11,000 physicists from around the world. Our central services provide the basic IT infrastructure that any organisation needs from e-mail, web services, databases and desktop support.

In addition, we provide computing resources for the physics analysis of the 35PB/year that comes from the Large Hadron Collider. The collider is currently being upgraded until 2015 in order to double the energy of the beams and enhance the detectors which are situated around the ring.

During this period, we have a window for significant change to the central IT services. While the analysis of the data from the first run of the LHC continues during this upgrade, the computing infrastructure in CERN IT is moving towards a cloud model based on popular open source technologies used outside of CERN such as Puppet and OpenStack.

As we started working with these communities through mailing lists and conferences, we encountered other people from High Energy Physics organisations around the world who were going through the same transition, such as IN2P3, Nectar, IHEP and Brookhaven. What was most surprising is when we found out that others on the CERN site were working on OpenStack and Puppet and that the CERN IT cloud was actually the smallest one!

The two largest experiments, ATLAS and CMS, both have large scale farms close to filter the 1PB/s of data from each detector before sending it to the CERN computer centre for recording and analysis.
These High Level Trigger farms are over 1000 servers, each typically 12 core systems. During the upgrade of the LHC, these servers would be idle as there are no collisions. However, the servers are attached to CERN's technical network, which is isolated from the rest of CERN's network as this network is used for systems which are closely associated with the accelerator and other online systems. Thus, they could not easily be used for running physics programs since the data is not accessible from the technical network.

In view of this, the approach taken was to use a cloud with the virtual machines not being able to access the technical network. This allows strong isolation and makes it very easy to start/stop large numbers of programs at a time. During CMS's tests, they were starting/stopping 250VMs in a 5 minute period.

For the software selection, each experiment makes its own choices independent of the CERN IT department. However, both teams selected OpenStack as their cloud software. The ATLAS cloud was set up with the help of Brookhaven National Laboratories who are also running OpenStack in their centre. The CMS cloud was set up and run by two CMS engineers . 

For configuration management, ATLAS online teams were already using Puppet and CMS migrated from Quattor to Puppet during the course of the project. This allowed them to use the Stackforge Puppet modules, as we do in the CERN IT department.

Both the experiment clouds are now in production, running physics simulation and some data analysis programs that can fit within the constraints of limited local storage and network I/O to the computer centre.

Thus, the current CERN cloud capacities are as follows.

Cloud Hypervisors Cores
28,800 (HT)
CMS OOOO cloud
CERN IT Ibex and Grizzly clouds
20,952 (HT)

This makes over 60,000 cores in total managed by OpenStack and Puppet at CERN.  With hyper-thread cores on both AMD and Intel CPUs, it is always difficult to judge exactly how much extra performance is achieved, thus my estimate of an effective delivery of 50,000 cores in total.

While the CERN IT cloud is currently only 20,000 cores, we are currently installing around 100 hypervisors a week in the two data centres in Geneva and Budapest so we would expect the cores in this area to significantly grow in the next 18 months as we are aiming for 300,000 cores in 2015.