Guest post by Sudhir Malik, University of Puerto Rico at Mayagüez
The CMS Physics Analysis Toolkit (PAT) tutorial, the fifteenth edition of which ran from 30 June to 5 July, has come a long way. Started January 2009, these tutorials have trained around 500 CMS physicists to use CMSSW, the CMS software framework for physics analyses.
The week-long programme includes lectures and hands-on sessions conducted by a team of 8-12 volunteer tutors. The team of volunteers is motivated by their desire to share their knowledge and to learn more in the process. Almost all the recent tutors were students in previous PAT tutorials and some of them have even become PAT developers. Besides the hard work and dedication of the PAT team, the course has been successful thanks to careful analysis of the feedback we receive from the participants. Brieuc Francois, a participant at the latest session said:
The PAT tutorial was really helpful. We learn a lot, the atmosphere is nice and the team you build is excellent. It was really a pleasure and I hope you will continue to provide such opportunities to the collaboration.
What is PAT and why do we need it?
The LHC provides CMS with millions of collisions each second. CMS must then select the collisions of interest from the multitude produced and store them for analysis. Thousands of physicists in the CMS Collaboration then analyse the data and publish papers. This huge amount of data and the complexity of the detector require a flexible data model that serves all the needs of the collaboration. The data are therefore provided in different tiers.
The most important ones for the user are called RECO and AOD. RECO contains a complete set of analysis-relevant object information that is “reconstructed” from the collision debris. AOD, or Analysis Object Data, is a subset of RECO, optimised to provide information sufficient for the majority of physics analyses in CMS at a small size of around 140 kb per collision event. The corresponding data formats of both RECO and AOD are optimised for performance and flexibility of the reconstruction but are difficult to access for the end user's analysis.
This is where the Physics Analysis Toolkit (PAT) comes in. PAT provides common data formats and analysis algorithms. The common data formats help avoid the adoption of different formats by different physics groups and simplifies controlling the event size. It facilitates transfer and comparison of analyses by common skims within physics analysis groups. Further, PAT provides the analysis groups with a common interface to the algorithms developed by physics objects groups. Approved algorithms and agreed-upon defaults enable beginners to jump right into analyses and allow everyone to profit from latest developments. In addition to learning to manipulate what gets stored in the PAT format, a central part of PAT tutorial is learning how to configure ones analysis in Python.
PAT tutorials for Run 2 and beyond
The upcoming LHC run beginning early next year presents a huge challenge to CMS computing and software, with collisions taking place at higher energies and also more collisions taking place simultaneously. This can lead to large sizes for each collision event. CMS must take many steps to mitigate the large increase in computing needs. This will be accomplished by reducing the disk usage, reducing the size of the AOD data tier and so forth.
A new data tier called miniAOD has been introduced in spring 2014 that reduces the AOD event size to about 30-50 kb/event and is sufficient for 80% of analyses. It will also be produced centrally in order to optimise computing resources, both CPU and disk storage. This new data tier is in fact an implementation of PAT workflow with well-defined collections of physics objects. High-level physics objects in miniAOD are saved in the PAT format that in turn derives from the RECO/AOD formats. The knowledge and experience gained from the PAT tutorials will continue to play a significant function in the future of CMS physics analyses.