Distributed Tracking, Storage, and Re-use of Job State Information on the Grid

Kouril, D.; Krenek, A.; Matyska, L.; Mulac, M.; Pospíšil, J.; Ruda, M.; Salvet, Z.; Sitera, J.; Škrabal, J.; Vocu, M.; Andreetto, P.; Borgia, S.; Dorigo, A.; Gianelle, A.; Mordacchini, M.; Sgaravatto, M.; Zangrando, L.; Andreozzi, S.; Ciaschini, V.; Di Giusto, C.; Giacomini, F.; Medici, V.; Ronchieri, E.; Avellino, G.; Beco, S.; Maraschini, A.; Pacini, F.; Terracina, A.; Guarise, A.; Patania, G.; Marchi, M.; Mezzadri, M.; Prelz, F.; Rebatto, D.; Monforte, S.; Pappalardo, M.

doi:10.5170/CERN-2005-002.798

The Logging and Bookkeeping service tracks jobs passing through the Grid. It collects important events generated by both the grid middleware components and applications, and processes them at a chosen L&B server to provide the job state. The events are transported through secure and reliable channels. Job tracking is fully distributed and does not depend on a single information source, the robustness is achieved through speculative job state computation in case of reordered, delayed or lost events. The state computation is easily adaptable to modified job control flow. The events are also passed to the related Job Provenance (JP) service. Its purpose is a long-term storage of information on job execution, environment, and the executable and input sandbox files. The data can be used for debugging, post-mortem analysis, or re-running jobs. The data are kept by the job-provenance storage service in a compressed format, accessible on per-job basis. A complementary index service is able to find particular jobs according to configurable criteria, e.g. submission time or "tags" assigned by the user. Both the L&B and JP index server provide web-service interfaces for querying. Those interfaces will eventually evolve to comply with the On-demand producer specification of the R-GMA infrastructure. Hence R-GMA capabilities will be available to perform complex distributed queries across multiple servers. Also, aggregate information about job collections can be easily provided. The L&B service was deployed in the EU DataGrid and Cern LCG projects, both L&B and JP will be deployed in the EGEE project.