Alice Off-line FAQ

Analysis Topics

How to obtain a new GRID certificate
Vertex
- Methods to study the relation of a given ESD track to the reconstructed primary vertex

Kinematics Tree
- Linking reconstructed particles and generated particles

Analysis on the Grid

Data Challenge
- Productions
- Restrictions on data analysis

Question of the day

How to obtain a new GRID certificate

What you need to do in order to get a GRID certificate is described in User Registration.
You have to coplete all 5 steps of the registration procedure.

Detailed instructions for CERN users can be found here.
Vertex
- Methods to study the relation of a given ESD track to the reconstructed primary vertex
  - Q: Which primary? In AliESD I see GetPrimaryVertex (primary from ESD tracks) and GetPrimary (from SPD). Which one should we use nowadays? (PDC06, for example, pp with standard reconstruction). In AliESDVertex how do I get the coordinates X, Y, Z?
  - A (Yuri Belikov): This is exactly like you are saying. There are two "versions" of the primary vertex stored in the ESD. One is reconstructed using the SPD RecPoints (before the tracking), and another one is reconstructed using the ESD tracks (after the tracking). Potentially, the first "version" of the vertex has somewhat worse precision (as compared with the second option), however the reconstruction efficiency of the first method is, again potentially, somewhat higher. The choice is yours.
  - Q: AliESDtrack::RelateToVertex(const AliESDVertex *vtx, Double_t b, Double_t maxd) tells me only if the tracks has a distance from the vertexless than maxd or not.
  - A (yuri Belikov): Not only. The methods does much more. It gives you the possibility to "re-attach" an ESD track to an arbitrary vertex. You can create your own vertex, pass the pointer to it to the method, and the method will propagate this track to DCA to your vertex and try to "constrain" this track to this vertex. (The method is not const ! It changes the track parameters). If, for some reason, this is not possible, the method returns kFALSE.
  - Q:the methods GetImpactParameters return the impact parameters in xy and z (and their errors) referred to what? If RelateToVertex has not been called, is it referred to (0,0,0) and after it has been called, is it to the actual vertex used there?
  - A (yuri Belikov): Referred to the primary vertex stored in the same ESD. (With the current version, the vertex reconstructed with the SPD). RelateToVertex is anyway called in the reconstruction, with the pointer to reconstructed SPD vertex.
  - Q:ALiExternalTRackParam offers the various PropagateTo and PropagateToDCA . These offer then more information than the pure impact parameters. Any reccomandation here, what to use best? Are there other methods? Which are the most used?
  - A (Yuri Belikov): PropagateTo propagates this track to an "reference plane" given by the argument. The most typical case when you need it is the tracking. The two PropagateToDCA methods propagate this track to the Distance of Closest Approach either to an arbitrary vertex (given by the argument), or to an arbitrary track (given by the argument). The most typical case when it is used is the secondary vertex reconstruction.
  - Q: I am not able to find in the ESD one information which I think is extremely important: what and where is the first MEASURED point of a track. I see that parameters are offered at various positions and extrapolations can be made. But it is still essential, sometimes, to know where the first used cluster lays. For example, for a track nicely extrapolated back to the vertex, I would like to know if the first point is on the innermost layer of the ITS or only on the 2nd or the 3rd ..... Same for tracks of particles which have been produced inside the TPC.
  - A (Yuri Belikov): Yes, this is useful information. In principal, there is the way to recalculate this information. Or, you can figure it out looking at the stored "Track Points". This is not very "natural", but the problem is that direct storing it in the ESD would additionally increase the size of the ESD, and we are already above the limit.

Kinematics Tree
- Linking reconstructed particles and generated particles
  - Q: There has been a production of jet-kinematics in the PDC06, only 2 files were generated: galice.root and Kinematics.root. I would like to extract the header and kinematics information using the TSelectors in Alien.
  - A (Jan Fiete Grosse-Oetringhaus): look at AliSelectorRL in the directory PWG0 of a HEAD AliRoot. It accesses header as well as the particle stack.

Analysis of data on the grid
- Access to data
  - Q: How can I get access too data on the grid and how can I copy data to my local storage.
  - A (Yves Schutz): You can best do that using the ROOT API. Have a look at the macros in AnalysisGoodies.C and an example of TSelector esdAna.C esdAna.h
- Analysis with Tselector
  - Q: How do I implement an analysis following the selector model to perform analyses in PROOF and in the Grid.
  - A (Jan Fiete Grosse-Oetringhaus): The base classes to be used are AliSelector that can be used for analyses that only accessed the ESD, and AliSelectorRL that can be used to access the RunLoader, Kinematics, MC Header etc. An empty sceleton that can be adapted for your own analysis can be found in PWG0/AliEmptySelector.h/.cxx.
- New Analysis Framework
  - Q: How can I read ESDs within the new analysis framework
  - A (Yves Schutz): Examples on how to implement an analysis task can be found in $ALICE_ROOT/ESDCheck
- Analyze many data
  - Q: How can I access and analyze, let'say, a few 10^6 min.bias events
  - A (Panos Christakoglou): I would suggest you to try to run it on 1M events by splitting your batch job accordingly. In your jdl you should also define the field SplitMaxInputFileNumber in such a way so that every sub-job that will be created, will analyze a small number of files (~100). One other piece of advice would be the following: try to define in your executable (stored inyour AliEn $HOME/bin) something like export XROOTD_MAXWAIT=10. This you do in case you don't want to wait for an infinite amount of time for a requested file to be staged .
- Selection of event type
  - Q: How can I know the kind of event (kPyMbNoHvq, kCharmpp14000wmi, etc.)
  - A (Panos Christakoglou): It is part of the metadata fields at the run level (metadata associated to the file catalog): generator, version, parameters etc. There is a nice internal note on this written by Markus Oldenburg (http://oldi.web.cern.ch/oldi/MetaData/MetaData.doc ) describing all the corresponding fields.
- Output of splitted jobs
  - Q: you send a job that is going to be splited in many subjobs and you request the Outputdir some where in you home directory. So I was expecting a feature as you get in /proc///..... but you just get in your outputdir a copy of the output from the first (or last) subjob that finish. In /proc you get anyway the nice structure with the subjobs, but since it is a temporal place and you dont get a mail when the job has finished the most probable is that you lost your output (it is cancel after a while).
  - A (Yves Schutz): use the alien counter #alien_counter_s# in the definition of the Output directory, e.g. OutputDir="/alice/cern.ch/user/s/schutz/analysis/output/QA/$1/#alien_counter_s#"

Data Challenge
- Productions
  - Q: 1. How can I find out which kind of events have been produced during the Physics Data Challenge 06
  - A (Latchezar Betev): consult http://pcalimonitor.cern.ch:8889/job_events.jsp
- Restrictions on data analysis
  - Q: When I try to analyse the data online from run 5113 to 5199 I get the error in split. Fri Jul 20 10:32:56 2007 [state ]: Job 4449059 inserted from amarin@pcapiserv04.cern.ch Fri Jul 20 10:34:29 2007 [error ]: There are 968 files in the collection /alice/sim/2006/pp_minbias/collections/collection.5196_001-5196_999.968.xml. Putting the job to error Fri Jul 20 10:34:29 2007 [error ]: Error splitting: The job can't be split Fri Jul 20 10:34:29 2007 [state ]: Job state transition to ERROR_SPLT
  - A (Pablo Saiz): The message is slightly wrong: you job is trying to access 2903 files (and not 968). At the moment, we are killing all the jobs that try to access more than 1000 files. Could you please submit your analysis on a smaller collection?

Question of the day
- Framework-related
  - Q: When do I have to call PostData in my task ?
  - A: (Andrei Gheata) The documentation tells you that that you have to call PostData in UserExec after filling your histograms. Note that PostData is not like TH1::Fill. It just makes the pointer to your output data (which in fact never changes) available so that the framework knows to write this data to the output file(s). PostData does also a notification that make all the client tasks of a given output container active (so that their Exec gets called)
    
    Failing to ever call PostData() will typically end-up in missing your output file (or your folder in it).
    Posting data during execution is required indeed only for those events which are interesting for the task (or its possible client tasks).
    
    Now there is a symptom that can happen since we use cuts (namely the physics selection), especially in grid. A given subjob may be assigned with a data sample that contains no interesting events, therefore the task UserExec is never called (neither PostData). If this task runs alone what happens is that the job is not validated since there is no output file. If the task runs in a train, nothing will be visible runtime because the task just fails to fill its folder in the common output file. Still there will be jobs that did selected events and produced output. The analysis will die in this case during merging, since the file merger fails to match the data from different output files.
    
    To cure this feature, we had to change the policy so that the output object DO get written even if no event was selected. That is: better empty histograms than no histograms. The simplest way to achieve is to call PostData for every output slot of your task at the end of UserCreateOutputObjects (which is called once per session on the worker).
    
    All tasks must do this systematically.
- Grid-related
  - Q: How does job splitting work in grid ? How should I chose the splitting parameters for my grid analysis ?
  - A: (Latchezar Betev) The long version with all options is available here :
    http://alien2.cern.ch/index.php?option=com_content&view=article&id=52&Itemid=100#Splitting_jobs
    (documentation section on http://alien.cern.ch)
    
    -> The short version is as follows:
    
    1. The predominantly used option in user analysis is 'splitting by SE',
    (Split = "SE";) whereby the AliEn Job Optimizer is splitting the master
    jobs in chunks which match the data distribution at the site(s) which
    have replicas of the input files.
    2. The limiting argument SplitMaxInputFileNumber = "x"; instructs the
    optimizer to use maximum 'x' files for a given sub-job. This argument
    should be used only if the user knows why the splitting should be limited.
    3. For the majority of the current analysis jobs,'x' can be large or
    better yet, the argument should not be used altogether.
    
    A: (Andrei Gheata)The "SE" option is the default for the alien plugin, while the SplitMaxInputFileNumber is set by default to 100. For no limitation of the number of files, one can in principle give a big number as argument to: plugin->SetSplitMaxInputFileNumber(nfiles). Limitations of the number of files to be processed per subjob may be needed if:
    - the time to process these files exceeds TTL (time to live). Putting a TTL bigger that 12*3600 seconds will not help - from experience, the success rate goes significantly down beyond 6 hours.
    - the memory leaks cumulate and you start getting watchdog messages - in this case one should really start looking at the code.
- Alien plugin
  - Q: What is the logic which is behind the usage of the plugin in "terminate" mode. How does the system know which files should be merged ?
  - A: (Andrei Gheata) The "terminate" mode should be run using exactly the same analysis macro as in "full" mode but using:
    plugin->SetRunMode("terminate")
    
    Note that when you quit the alien shell opened by the "full" mode, the plugin will try to go through the "terminate" phase anyway. This may lead to potential problems: if at least one of the subjobs get done, the plugin will merge their outputs and you will get a partial result. Then, when running in terminate mode these files will be skipped. One should, before running in "terminate" mode, make sure that no output file of the analysis is present in the current directory (!) . To have this automatically done and avoid surpises, one should configure the plugin using:
    
    plugin->SetOverwriteMode(kTRUE)
    
    Note that this has recently became the default. The plugin knows what output files your analysis should produce (if you used SetDefaultOutputs, it loops the files pointed by the output containers connected to the analysis manager) and where the output directory in grid is so it can check in "terminate" phase their location using the 'find' command and merge them using a file merger. It does not matter how many runs you processed or the splitting parameters. This can only produce a different number of files to merge, sometimes too big to merge locally. For such case the plugin has the default behavior to resume merging after a failure, but it is recommendable to instruct it to merge via jdl:
    
    plugin->SetMergeViaJDL();
    
    This will run one merging jdl per masterjob, that you can tune via: SetNrunsPerMaster(). The problem with that is currently that there is no "supermerge" that is done at the final stage to merge those. It currently works correctly only if there is only one master job. Still under development.

CERN Accelerating science

Analysis Topics

How to obtain a new GRID certificate

Vertex

Kinematics Tree

Analysis of data on the grid

Data Challenge

Question of the day

Framework-related

Grid-related

Alien plugin