Automated Hardware Events Organizer in the ACA Correlator.

The \ac{ACA} Correlator defines a data communication interface that is based on a central machine that captures events in all associated hardware. This information comes from a variety of independent hardware modules (52) with several thousands of messages per second. In order to accurately determine a specific problem, a defined time-stamp range should be studied by retrieving failure information and translating each error code from a database of 425 failure codes. This operation was initially done line by line, where the experienced engineer should construct a picture of failure events by studying all of these hardware events for a certain amount of time (usually 30 seconds before the report of a problem). That tiresome task took several hours to complete and results could be sometimes misleading if the user did not translate a single code correctly. With the aim to speed up how those errors are interpreted, I developed a new automated solution.

The Python-based code handles this work by retrieving a pre-defined list of hardware errors, called “hardware errors dictionary”. After this, a file search is performed on the system according to the time stamp provided by the user, if the time stamp refers to an old event, the search continues under the archive server where the compressed file contents are extracted. According to the length required in the investigation of hardware events, a perimeter of logs is established, and then all events are collected. This collection is analyzed line by line, translating the error message and the possible workaround to resolve the problem.

At a later stage, a summarization is obtained by associating common hardware resources in a particular module, a group of modules or the entire quadrant. The results are later saved in two files: one related to the summary of the most common failures; and a second file containing the full translation of all events recorded. The program can increase or decrease the size of the search according to the time specified by the user (30 seconds by default), the direction of the search in time can be specified (forward, backward or dual direction). A timing study was done under a year of usage, and we find out that we reached a yearly savings of 400 hours of engineering time.

Scroll to Top