Guest post by Federico De Guio.
Sixty-four petabytes or about 64 million gigabytes. That is how much data the CMS detector has collected from three years of proton collisions at the LHC. These data are processed and analysed using a complex software framework called CMSSW, which was recently showcased by GitHub, the world’s largest code host.
The LHC is the most energetic particle accelerator ever built and smashes together protons in the centre of CMS 40 million times each second. The CMS detector takes “snapshots” of these collisions to look for signs of physics that has not been observed before. Of course, CMS physicists don’t look at these individual snapshots by eye: CMSSW sifts through the data for you, combining the low-level information coming from 60 million channels of CMS to “reconstruct” particle candidates that are characterised by their energy, direction of flight, electrical charge and so forth. These particle candidates are then used as inputs in the final analyses. In addition to data processing for analysis, CMSSW is also used for simulating collision events.
CMSSW is written in C++ and Python, and has several million lines of code. CMSSW must evolve constantly, reflecting both the improvements being made to the code itself as well as the changing calibrations and other conditions of the CMS sub-detectors. A robust revision-control system is needed to aggregate and book-keep the code, which receives contributions from a few hundred CMS collaborators. CMSSW was initially hosted on CVS, but CERN support for this platform came to an end last year. CMS therefore performed several feasibility studies before migrating to Git a little over a year ago.
Git was invented in 2005 by Linus Torvalds and provides many advantages for collaborative coding: it allows you to review the code easily, it provides a simple way to share developments with your colleagues, and it helps automatise the testing and integration of changes.
On the other hand, as it is a relatively new technology, only a tiny fraction of the CMS Collaboration knew how to use Git at the time of the transition from CVS. In a collaboration with more than 3000 scientists, some inertia is to be expected when migrating away from an old and well-known system to a new one. Fortunately, everyone involved learnt to use Git rather quickly and the migration wasn’t so painful, thanks in large part to the efforts of Giulio Eulisse of the CMS Offline team. In one year, nearly 900 people have forked the CMSSW repository and more than 200 are actively contributing to it. Indeed, in the last month alone over 350 commits were merged to the main branch of the code.
The collaboration decided to host the CMSSW code publicly on GitHub, who provided excellent support for the migration of the code. This popular platform hosts several high-profile research and commercial projects, including the Linux kernel. The choice to make the CMSSW code completely public makes CMS a mostly open-source* experiment! We are therefore very pleased that GitHub has chosen to showcase our software and we hope it will inspire future generations of scientists.
* CMSSW uses some closed-source code, such as Oracle databases.