Notes on System Failures, System Hangs and Memory Dumps for MPE/iX
 

Written By: Stan Sieler

1995-05-11
(Updated 2004-10-11)
0. Introduction
Sometimes, the computer "dies"... so this note discusses system failures, system hangs, memory dumps, subsystem numbers, and interpreting a system abort number. Sometimes, the system is alive ... so the free speedometer is discussed.

There are two basic kinds of system failure that an MPE/iX (or MPE XL) user will encounter: a "System Failure" and a "system hang". The former is easily identfied by the "System Failure" message that appears on the hardware console (ldev 20). The latter is typified by users complaining that the machine is "hung". Each will be discussed below.


1. System Failure

A System Failure reports the following information on the hardware console:

SYSTEM ABORT 504 FROM SUBSYSTEM 143
SECONDARY STATUS: INFO = -34, SUBSYS = 107
SYSTEM HALT 7, $01F8

Additionally, the hex display (enabled on the console by typing control-B) displays something like:

B007 0101 02F8 DEAD

Note that the "504" and "$1F8" above are the same value, shown in decimal and in hex. Further, the hex display shows "0101" and "02F8". These two numbers are reporting the following:

0101 02F8

The bold (and, depending on your Web browser, underlined) portions indicate packets 1 and 2 of the System Abort number (01F8) (i.e., the first two hex nibbles (01 and 02 above) of each 4-digit hex number are "packet numbers").

Note: if the System Abort number is in the range 0 to $FF (decimal 255), only one "Part" will be needed to represent it, and no "Part 2" will be shown.


1.1 Interpreting the System Failure Number

The System Failure number (504 in the example above) can be converted to "english" by doing the following on a live MPE/iX machine:

:hello manager.sys (or any user with PM capability)
:debug
= errmsg (#504, #98)
(the "#"signs are required!)
'Prefetch of needed data for a READ/WRITE request could not be made.'
c

The above "= errmsg" command looks up the System Failure number (message #504) in the system error catalog (set #98...a magic number). This catalog is not complete, and some System Failure numbers are not in the catalog.

1.2 Interpreting the Subsystem Number

If the System Failure reported a subsystem (143 in the example above), the following might convert it to a subsystem name:

:debug
= errmsg (#32765, #143)

c

Note: In the above example, the "#"signs are required, and the 32765 is "magic" number.

Here are two examples, one which succeeds, and one which fails:

:debug
= errmsg (#32765, #143)
'File System'

= errmsg (#32765, #129)
'External error - subsys: #129 info: #32765'

If the above doesn't produce a useful string, you can try two other approaches, both of which use the appropriate SYMOS file after loading the DAT macros.

The first uses a macro called "subsysstr", which knows about 30 hand-coded subsystem numbers, and also knows how to use the "errmsg" function (shown above):

:debug
use datinit.dat.telesup
macstart , '1'
= subsysstr (#129)
'7978 Tape Device Mgr'

If that doesn't work, then you probably are looking at a relatively unusual subsystem number. The slowest, but most reliable, method of translating a subsystem number into something is to search the SYMOS for a constant of the form SUBSYS_xxxxx. Here's an example, using subsystem number #129.

:debug
use datinit.dat.telesup
macstart , '1'
env filter '129'
set dec       /* important, because "129" is decimal, not hex */
symlist subsys@ ,,c
   SUBSYS_7978_DM                          CONST   INTEGER      #129
   SUBSYS_TAPE                             CONST   INTEGER      #129
env filter ''

In the above example, two lines matched the filter, showing that subsystem #129 is either "SUBSYS_TAPE" or "SUBSYS_7978_DM". Since a 7978 is a tape drive, I'd suspect that SUBSYS_7978_DM is the most likely "answer" to the "what is subsystem 129" question. (I also submitted a bug report to HP: no two *different* SUBSYS constants should ever have the same value!)

An optional step to dramatically improve the performance of the SYMLIST command is to prefetch the SYMOS file into memory. An example is:

:fetch symos.osb79.telesup

The SYMOS file you should fetch is the one that was opened by the MACSTART command above. You can see which one this is by doing a SYMINFO command:

symf

1.3 Interpreting the Secondary Status Number

The Secondary Status line may provide some additional information about the System Failure, if the INFO and SUBSYS values are not 0.

Take the two numbers (in the above example, INFO = -34, and SUBSYS = 107), and use the "errmsg" function as follows:

:debug
=errmsg (-#34, #107) /* "#"s are necessary */
'The length specified was beyond the bounds of the specified object.'
c

Not all Secondary Status messages are in the catalog. If you had tried one that is not, you would see:

:debug
= errmsg (-#51, #107)
'External error - subsys: #107 info: #51'

Note: I recommend submitting a bug report to HP for any System Failure or Secondary Status values that are not in the catalog!


1.4 Other System Halts

The most common type of system failure is a deliberate call to an internal MPE routine called system_abort. When this kind of system failure occurs, the three line message shown above is printed. Note the third line, that said "SYSTEM HALT 7, $01F8". The "7" means that system_abort was called. At least seven other kinds of system halts are defined (SYSTEM HALT 0 through SYSTEM HALT 6).

The SYSTEM HALTS 0..6 represent system failures for problems other than system_abort, and usually reflect a problem "lower" in the operating system (e.g.: in the interrupt handling code).

SYSTEM HALTS 1..7 should produce a multi-line printout on the console. SYSTEM HALT 0 does not.

If the console output is missing, or corrupted, you can determine the type of SYSTEM HALT that occurred by looking at the hex display.

You can think of the hex display as presenting a series of 16-bit numbers (4 hex digits) in a sequence. The sequence is repeated over and over, with a pause of about 1/2 second between each number.

The last number in the sequence is usually $DEAD. The first number is usually of the form $Bnxx. The "xx" portion (the bottom two hex digits) reports the type of SYSTEM HALT that occurred.

In the example at the start of this note, the hex display is showing:

B007 0101 02F8 DEAD
The "07" means: SYSTEM HALT 7 (system_abort was called_


2. System Hangs

Sometimes, the system seems to "hang", and little or no response is seen by the users. When this happens, it is important to characterize what is hung, and what isn't. The following questions should be asked before stopping the machine and taking a memory dump:

  1. What does the hex display show? (See: Speedometer in section 4 below)
  2. Does any terminal get a response from the Command Interpreter?
    (If a terminal is sitting with a ":" prompt, hit return. Does another ":" prompt come out?)
  3. Is the hardware console (ldev 20) hung?
  4. If a terminal can be found that is working, does a :SHOWPROC command hang the terminal?
  5. Does a control-A at the hardware console (ldev 20) result in an "=" prompt?
  6. Are the disc drives active?

The answers to these questions will aid the person who analyzes the dump.


3. Dump Loading

Once a memory dump has been taken, and the system rebooted, you will probably want to load the dump for analysis. The following steps should be done:

  1. Logon as MGR.TELESUP, DUMPS

    (Note: if the DUMPS group does not exist, logon as MGR.TELESUP and do: NEWGROUP DUMPS, and then CHGROUP DUMPS)

  2. Enter: DAT.DAT

    This will run the DAT (Dump Analysis Tool) program.

  3. Enter: GETDUMP FOO

    "FOO" will be the name of the dump. This name must begin with a letter, and be 1 to 5 letters and/or digits long. One recommendation is to call the dump S#### where #### is the System Abort number (e.g.: S0504).

    DAT will request a tape whose formal name is DUMPTAPE (this may be file equated before running DAT, if necessary).

  4. REPLY to the tape request.

    DAT will read the first few records of the tape, and report how much disc storage will be required to hold the dump. DAT will then allocate all of the necessary disc storage "up front", before reading the rest of the tape.

    If DAT is able to allocate enough disc space, and if the dump is on a single tape (or DDS), you can now walk away for awhile.

    *** Please do the next two steps even if you think you don't want to analyze the dump yourself! It saves 5 to 15 minutes for the next person who analyzes the dump!

  5. Enter: MACSTART "FOO", "1"

    This will tell DAT that you want to start analyzing the dump. The "FOO" (in quotes!) is the name of the dump you used on the earlier GETDUMP command (which didn't use quotes!). The extra "1" tells DAT that you are only interested in "macros" for the operating system.

    If this process encounters a few errors, please do a screen capture (e.g.: PSCREEN) so we can analyze them later.

  6. Enter: PROCESS_WAIT ; UI_SHOWJOB

    These two commands may take up to 15 minutes to run.

  7. Enter: EXIT

You have now loaded a dump, FOO, and "prepared" it. If you want to send the dump to anyone for analysis, use STORE to store it as follows:

STORE FOO@

The "@" is important, because the dump is actually stored on disc as FOOMEM and FOOVAR, where "FOO" is the name you picked for the dump. Someday, dumps may be stored as even more files (e.g.: FOO001, FOOMEM, FOO002, FOOVAR), so the "@" will always be needed.


4. Speedometer

The HP 3000, running MPE/iX (or MPE XL) has a free "speedometer", which tells us how busy the computer is.

When the system is alive, the hex display on the hardware console functions as a speedometer, reporting how busy the system is. (Remember: some machines have LED hex displays, and all have the ability to put the hex display on the status line of the hardware console, when control-B is hit.)

The speedometer will typically cycle between two values: FxFF and FFFF.
Ignore the FFFF value.
The "x" digit in the FxFF value reports what percentage busy the CPU is. The number should be multiplied by 10 to obtain the percentage.

Examples:

  • F4FF

    means: 40% busy.

  • FAFF

    means: 100% busy ("A" is the hex value for decimal 10).

  • F0FF

    means: idle (0% busy).

Note: on newer HP 3000s, you will have to interact with the GSP (Guardian Service Processor) to see the speedometer. A typical scenario is:

  1. Connect to the GSP (press control-B at the hardware console, or telnet to the GSP port, or use a browser and logon to the GSP via a Secure Web Server);


  2. Login to the GSP (often just by pressing <return> twice);


  3. Get to the Virtual Front Panel by entering: VFP <return>


  4. If asked, say "Yes" to the "Proceed with Live Mode of VFP? (Y/[N]) y" question;


  5. Watch for a few updates:
    unknown, no source stated legacy PA HEX chassis-code FFFF
    unknown, no source stated legacy PA HEX chassis-code F0FF
    unknown, no source stated legacy PA HEX chassis-code FFFF
    unknown, no source stated legacy PA HEX chassis-code F0FF
                    
    (The above says: F0FF, which is 0% busy)


  6. Exit the VFP by typing "q": q


  7. (optional) exit the GSP by typing "co": co
 
     
  www.resource3000.com   Resource 3000 - Technical Paper