Written by: Stan Sieler - Allegro Consultants
With the release of MPE/iX (now called MPE/iX) on HPPA (Hewlett Packard Precision Architecture), many new features have arrived for the programmer. These include mapped files and a very large address space. One new feature overlooked by many is the RISC architecture. Although RISC means "reduced complexity", optimizing performance on RISC is paradoxically more complex than on the classic HP3000. This paper asks: "what can we do to maximize performance?" Some answers are presented, and particular attention is given to the characteristics of mapped files, the file system, and Native Mode versus Compatibility Mode.
1. Mapped Files
This section will introduce mapped files and discuss their performance characteristics.
1.1 Mapped File Introduction
From a programmer's viewpoint, MPE/iX has two basic types of files: the ordinary, record-oriented files that have existed since the birth of MPE, and mapped files.
A mapped file is an MPE ordinary file that is going to be accessed via virtual memory loads and stores instead of (or in addition to) via file system intrinsics. Instead of calling FOPEN, a programmer can call the new HPFOPEN intrinsic, and specify that a file is to be opened for "mapped" access. This will result in two pieces of information being returned to the program: a file number (like FOPEN would have returned), and a virtual memory address. The virtual memory address returned is the address of the first byte of data in the file. If the address is stored in a pointer, as shown in the following example, and the pointer is then "de-referenced", the first byte from the file is brought into memory.
HP Pascal/XL SPLash!
var filedata : ^char; virtual byte pointer filedata;
firstbyte : char; byte firstbyte;
double filedata'spaceid = filedata;
hpfopen (..., filedata, ...); hpfopen (..., filedata'spaceid, ...);
firstbyte := filedata^; firstbyte := filedata;
Note: the above example was done with HP Pascal/XL, but most of the rest of the examples in this paper will be done in SPLash!, a native mode version of SPL/V, which allows easy manipulation of 32 bit and 64 bit virtual addresses. Mapped file access is also available in HP C/XL. (The SPLash! example shows the need to get the pointer passed by reference...the intrinsic declaration of HPFOPEN has no way of telling a compiler that that parameter expects to be a pointer-by-reference, so SPLash! would treat "filedata" as a request to pass the address that filedata *points to*, instead of the address of the filedata pointer itself.)
With the above fragment of code, let's look at fetching the first two 80-byte records.
rec0' (0 : 79),
rec1' (0 : 79);
move rec0' := filedata, (80); ! get first 80 bytes
move rec1' := filedata (80), (80); ! get second 80 bytes
If the file system had been used to access the first two records, as in:
fread (fid, rec0', -80);
fread (fid, rec1', -80);
then the total CPU utilized by the FREADs would be much greater than the CPU used by the two "move" statements.
1.2 How are Mapped Files Implemented?
In MPE/iX, all files are stored on disc as an array of bytes. A file is called a "mapped file" if it happens to have been opened by a user who requested its virtual address be returned as a result of the HPFOPEN intrinsic. At the lowest level of MPE/iX, ALL disc files are always opened as mapped files. Usually, we call a file a "mapped file" if we intend to access its data via virtual memory along with (or instead of via) the file system intrinsics.
Two aspects of disc files have changed from MPE V to MPE/iX:
- The file label is not stored as part of the file.
- There is no wasted space between records or between blocks.
The first change is a decade overdue. The second change is a direct result of the virtual memory system of HPPA.
When any disc file is opened in MPE/iX, a module called the "Virtual Space Manager" allocates a range of virtual addresses sufficient to cover the entire file. The process is called "mapping", as in: mapping the file into virtual memory. "Mapping" provides a one-to-one correspondence between a virtual memory address and a byte of disc data for every byte in the file.
If a program tries to use a virtual address that has been mapped onto a file to fetch a byte of data, the following is done by hardware:
- Extract the upper 53 bits of the 64 bit virtual address, calling it the VPN (Virtual Page Number).
- Is the virtual page "in" memory. (I.e.: is there a physical page of 2,048 bytes that has been assigned to that VPN?)
- If yes, then using the bottom 11 bits (the page "offset") of the original 64 bit virtual address, index into the physical page, fetch the byte, and return.
- If no, interrupt and ask the software to bring our page into physical memory.
- When our page arrives in memory, our process will be restarted at step #1 above.
- The above process can be phrased in a simpler manner:
- If the virtual address is in real memory, fetch the data;
otherwise do a "page fault" and swap the page into memory and then fetch the byte.
Note: this description of virtual memory is simplified, and omits features such as the Translation Lookaside Buffer (TLB).
Thus, to fetch the first byte of the 100th record of an 80-byte record file, we can simply take the virtual address of the first byte of the file, add 8000 to it, and then fetch a byte from that address. Sooner, or later, the byte will appear in the register that we asked it to be loaded into.
- The detailed workings of virtual memory are quite complex, and beyond the scope of this paper. For now, let's just remember:
- When bytes of a file are accessed via a virtual address, the data is brought into memory as needed by the operating system via "page faults". Once a page is in memory, its data can be accessed at main-memory speeds. On a typical MPE/iX machine, many millions of bytes of mapped files could be in memory all at the same time.
If anything is stored into the virtual address, the physical page is marked dirty. Dirty pages are eventually written out to disc, but this process might not occur for quite some time.
When we talk about a "page" in reference to the CPU hardware, we generally mean a "physical page" of 2,048 bytes. At most other times, "page" refers to a "logical page" (sometimes incorrectly called a "virtual page") of 4,096 bytes. When a logical page is brought into memory, it will occupy two consecutive physical pages.
"Prefetching" is the act of bringing more data from disc into memory than was immediately requested by a user, in an attempt to prevent a second disc read shortly after the first.
The disc caching code on MPE V had two "dials" the system manager could twist to control the amount of data prefetched. One dial to control the size of cache domains created for sequential disc reads, and another to control the size of domains created for random disc reads.
On MPE/iX, the system manager has no such controls. Instead, the prefetch size is determined (at present) by one primary factor: what subsystem is asking for the data to be read from disc. If the request to read data from disc is from the memory manager (due to a page fault), one logical page is read. If the request is from the file system, several logical pages are read.
Clearly, this has enormous performance implications. Consider a program accessing a file of 256 byte records in a sequential manner. Assuming the file has about 90,000 records, and assuming that the file system requests 4 logical pages at a time, then the memory mapped access will have 5,625 page faults versus 1,406 for the file system accessor. (Remember: a logical page is 4,096 bytes, and a physical page is 2,048 bytes. Unless dealing with the lowest levels of MPE/iX, we normally refer to logical pages.)
As a test of the above, a program was run that did a simple sequential read of the file SL.PUB.SYS (89,867 records of 256 bytes). This file takes about 22 megabytes of disc space. The following table show the CPU and Elapsed times required to read the file. In between each run, a separate 16 megabyte file was read in an attempt to flush as much of the SL.PUB.SYS file data from memory as possible (see the section: Measurement Problems).
The following table shows the time the test program needed to read SL.PUB.SYS. The test program was running in Native Mode.
SL.PUB.SYS sequential read (times in milliseconds)
CPU Elapsed Delta Access Method
----- ------- ------ -------------
19686 146298 126612 Memory Mapped
35398 44361 8963 FreadDir
36590 44957 8367 Fread
39465 46802 7337 FreadDir & FreadSeek
48650 51949 3299 Fread & FreadSeek
The "Delta" column shows the amount of time the program was presumably waiting for the data to come from disc.
The "FreadDir" access method consisted of using the FREADDIR intrinsic with ascending record numbers, which results in reading exactly the same records as the FREAD intrinsic. The last two rows added a call to the FREADSEEK intrinsic in an attempt to have MPE/iX prefetch data before it was read. For those two tests, FREADSEEK was called once every 4 reads, with a request to prefetch the fourth record following the current.
- Use sequential FREADDIR to sequentially read a file that is not already in memory (see note below);
- Don't use FREADSEEK. At least in these tests, it never seems to help, and only costs extra CPU time.
Taking the first delta figure, 126,612, and guessing that we can do a disc read in 22.5 milliseconds, we get an estimate of 5,627 disc reads, which matches our prediction.
If we take the delta for the FREAD test, 8,367, and using the same estimate of 22.5 milliseconds per disc read, we see 372 disc reads. This implies that FREAD is prefetching in chunks of 15 or 16 logical pages, not the 4 originally assumed.
Note that with the FREAD & FREADSEEK test the delta was cut about in half, at the cost of greatly increased CPU time.
A second large file was tested, NL.PUB.SYS (64,275 records of 256 bytes each, 16 megabytes):
CPU Elapsed Delta Access Method NL.PUB.SYS
----- ------- ------ ------------- (sequential)
11507 74920 63413 Memory Mapped
22109 26240 4131 FreadDir
23857 27364 3507 Fread
25735 28124 2389 FreadDir & FreadSeek
28887 31151 2264 Fread & FreadSeek
These results mirror those for reading SL.PUB.SYS.
1.4 Memory Resident Data
The previous section examined the performance of mapped files versus the file system for data that was out on disc. Frequently, the data for a file will happen to be resident in memory. This is the case when a file is accessed multiple times in a relatively short period. This section examines the performance of accessing file data that is already in memory. Using the same Native Mode program (an SPL/V program compiled with SPLash!), the file CATALOG.PUB.SYS was sequentially read. This file has 7040 records of 80 bytes each for a total of 0.5 megabytes.
CPU Elapsed Access Method
---- ------- -------------
181 182 Mapped File
1660 1677 FreadDir
1678 1680 Fread
1959 1976 FreadDir & FreadSeek
1977 1994 Fread & FreadSeek
The file CATALOG was read once to bring it into memory. The time to do this is not reflected in the above table.
Note that the elapsed time is just slightly more than the CPU time. This is because the process is never paused to wait for disc I/O.
- If the file's data is likely to be in memory, use mapped file access!
- FREADSEEK should not be used for files where the data is in memory already.
1.5 NM vs CM vs OCT
MPE/iX can execute in any of three modes: Native Mode (executing RISC instructions), Compatibility Mode (emulating classic HP3000 CISC instructions), and a blend of the two produced by the Object Code Translator (OCT). Briefly, a Compatibility Mode (CM) program can be run through the OCT to produce a hybrid program file that contains the original CISC instructions as well as their translation into RISC instructions. OCT'ed programs must obey ALL the same restrictions as CM programs (e.g.: 16-bit wide stack of 65,535 bytes). (For more information on OCT, CM, and NM, the reader is directed to the book "Beyond RISC" from Software Research Northwest.)
The data in the preceding tests was obtained from a Native Mode program. This section examines the performance of the file system when called from the three types of program code: NM, OCT, and CM. As a reminder of what can be accomplished by what my partner, Steve Cooper, calls the "second migration", mapped file access is also shown in the table. The "second migration" is the process of adapting programs to take advantage of the new features in MPE/iX. The "first migration" is the one HP talks about: porting a program to Native Mode (which usually means minimal changes).
The file CATALOG.PUB.SYS was sequentially read in the same manners as before, with the IDENTICAL program compiled in SPL/V (CM), run through the Object Code Translator (OCT), and compiled by SPLash! (NM). The following table shows the results:
CATALOG.PUB.SYS (times in milliseconds)
CPU Elapsed Mode Access Method
---- ------- ---- ---------
181 182 NM Mapped (requires NM)
1660 1677 NM FreadDir
1678 1680 NM Fread
1959 1976 NM FreadDir & FreadSeek
1977 1994 NM Fread & FreadSeek
3326 3343 OCT FreadDir
3838 3854 OCT Fread
4196 4214 CM FreadDir
4850 4881 CM Fread
5196 5216 OCT FreadDir & FreadSeek
5670 5690 OCT Fread & FreadSeek
6471 6493 CM FreadDir & FreadSeek
7473 7493 CM Fread & FreadSeek
- NM is far faster than CM or OCT.
- Calling FREADSEEK from CM or OCT programs is even more of a penalty than calling it from NM programs.
- FREADDIR is still slightly faster than FREAD.
The test program was produced from the source file "READER" with the following commands:
CM: spl reader, $newpass, $null
prep $oldpass, reader.cm
OCT: octcomp reader.cm, readero.cm, , noovf
NM: splasm reader
Note that the "noovf" option on the "octcomp" command tells the OCT that the program does not expect to generate arithmetic overflows and to optimize its translation with that in mind. This results in slightly faster OCT'ed programs.
The basic reason that the CM and OCT programs are so much slower is that simple disc files are handled by Native Mode portions of MPE/iX. Some types of disc files are still handled by Compatibility Mode portions of MPE/iX, ported from MPE V/E. These include message files, RIO files, Circular files, and KSAM files.
When a CM or OCT program calls the FREAD intrinsic to read a record from an ordinary disc file, the FREAD intrinsic must "switch" to Native Mode and call the Native Mode FREAD intrinsic. This switch is not inexpensive. OCT programs pay the same switch overhead as CM programs because they are still emulating the Classic instruction set, albeit faster than the emulator. NM programs (e.g.: HP Pascal/XL and SPLash!) are already in Native Mode when they call FREAD, so no switch is necessary.
The next test shows the results of serially reading a KSAM file of 1,000 80 byte records from NM, OCT, and CM programs. As in the CATALOG test, the file was brought into memory before the start of the test.
CPU Elapsed Mode Access Method
---- ------- ---- -------------
2475 2494 OCT Fread
2677 2696 CM Fread
3239 3257 NM Fread
Note that the FREAD intrinsic returns the records in key order, not the chronological order in which they were written.
Note that the FreadDir test was dropped. The FREADDIR intrinsic cannot be used on KSAM files.
The mapped file test was dropped because it reads the data in chronological order, not key order.
If KSAM is being used heavily, don't migrate the programs into NM until a native mode version of KSAM is available (from HP or another vendor).
[Update 96/03/07: KSAM/iX is in Native Mode]
2. Memory & Disc Utilization
In MPE V, stacks were limited to a maximum of 65,535 bytes. In MPE/iX, the limitation is 1 gigabyte (1,073,741,824 bytes). (This limit includes the CM stack & heap, the NM stack, the NM heap, and the XRT.)
In MPE V, if any part of the stack was in memory, then the entire stack was in memory. In MPE/iX, only the logical pages recently referenced are likely to be in memory at any time. Additionally, only those pages that have EVER been referenced are allocated disc storage. As more and more stack/heap pages are touched, more and more pages are allocated on disc. This means that having an array of 1,000,000 bytes in SPLash! (or Pascal/XL, or any NM language) is not expensive...until you use it. A megabyte array will have 1 million bytes of virtual address assigned to it, but the disc storage will range from 0 to 256 logical pages!
Disc files are allocated storage exactly like the stack/heap: only those pages ever touched are allocated disc sectors. (Since extents may be allocated several logical pages at a time, some rounding-up does occur.) This means that it is feasible to have "sparse" files. For example, a file with 1 byte for every possible Social Security number would have a limit of 999,999,999 bytes. If a single write is done to record 2345, then a single extent will be allocated. A test done on MPE XL 1.1 resulted in an extent of 2,048 sectors being allocated. This does not mean that all future extents will be of equal size. Unfortunately, the programmer has no control over the extent size.
3. Data Alignment
On the Classic HP3000, the natural data alignment was 16 bits. With rare exceptions, 32-bit and 64-bit data could be placed at any 16-bit boundary with impunity and no performance ramifications.
On the HPPA HP3000s, the natural data alignment is 32 bits for 32-bit data, and (sometimes) 64-bits for 64-bit data. (The 64- bit alignment applies primarily to IEEE 64 bit floating point numbers.)
As a result, if code is ported from a CM language to its NM equivalent, one of two problems can result: program aborts (or other errors) due to misaligned data; or performance slowdowns.
Most NM compilers provide a means of specifying that certain variables are only 16-bit aligned. When this is done, then the compilers will typically emit 3 instructions to load a 32 bit variable instead of the 1 that would have been required if the variable was aligned on a 32-bit boundary. This is necessary because the RISC hardware does not allow the LDW (Load 32-bit Word) instruction to be given an address that is not a multiple of 4 bytes (32 bits). Instead, 2 LDH (Load 16-bits) instructions and one DEP (deposit) instruction must be used to build the 32 bit value in a register.
No performance data is shown here because the implications are clear from the instruction count: 1 versus 3.
4. SORT vs HPSORT
Compatibility Mode programs that call the SORT intrinsics still get the old sort package, running in OCT.
Native Mode programs have a choice of two intrinsics to do sorting: SORT and HPSORT. These two intrinsics are interfaces to a new sort package which runs in Native Mode. The native mode sort package lacks some of the features of the CM sort facility (e.g.: the ability to pass procedures to do the comparison), and has one additional wrinkle: sometimes it calls the CM sort to do the sort!
In its present incarnation, NM Sort will call CM Sort when it gets a "difficult" sort. This includes sorts that specify an alternate collating sequence.
Additionally, when NM Sort does stay in NM, it does NOT open a temporary file called SORTSCR. Instead, it uses two temporary files that are either nameless or have a name like HPSORT1 and HPSORT2 (?), depending on the release of MPE/iX. This means that if a fairly simple sort is requested from a NM program, the programmer cannot point the sort scratch file to a disc drive he/she knows is separate from the input and output data.
In short, NM Sort is still evolving. Test runs should be made before converting to NM simply to call NM sort.
5. System Performance
The overall system performance can still be affected by proper tuning of the C, D, and E subqueues via the TUNE command.
The choice of disc drives for a file can also be controlled in the usual manner (e.g.: BUILD FOO;DEV=3). However, the number of extents cannot be easily controlled any more. The basic choice is one extent or many extents.
Main memory is vital to the performance of the system. Unlike MPE V, which tended to degrade slowly, MPE/iX will suffer a very sharp drop in performance when not enough memory is available. Economize on everything else ... and buy memory.
A 950 (and 955) will support up to 256 megabytes (128 per memory controller). Three vendors offer memory for the machine: HP, Kelly Computer Systems (the first to put 256 megabytes in a user's computer), and EMC. Sites with Classic HP3000s may be interested in Kelly's RAMDISC for the 3000, which can be traded in on HPPA memory when needed.
6. NM vs. CM : Intrinsics
In an earlier section, we determined that some types of files are still implemented with Compatibility Mode code.
File system intrinsics are not the only ones that might actually be implemented in CM. The ASCII, BINARY, DASCII and DBINARY Native Mode intrinsics currently switch to CM to do their work. Although this may change in the future, the performance implications are still interesting today.
Porting a program into Native Mode may reveal other intrinsics that are still implemented in Compatibility Mode.
The following table shows the result of calling the ASCII intrinsic a large number of times from programs written in NM, OCT, and CM:
CPU Elapsed Mode
---- ------- ----
9051 9084 NM
11688 11728 OCT
12211 12252 CM
Although the Native Mode program was the fastest, it is by a very narrow margin.
The ASCII/BINARY/etc. intrinsics have always been a performance bottleneck on MPE V. They haven't changed in MPE/iX. The following table shows the results of calling the ASCII intrinsic versus calling a "clone" of the intrinsic:
CPU Elapsed Mode Procedure
---- ------- ---- ---------
457 471 NM ASCII clone
9009 9040 NM ASCII intrinsic
Similar savings can be obtained for BINARY, DASCII, DBINARY, and CTRANSLATE. Contact the author for NMOBJ files that can be used as replacement intrinsics.
7. Measurement Problems
Measuring performance on MPE/iX is extremely difficult. Unlike MPE V, MPE/iX provides no control over what disc data is (or is not) in memory. As a result, tests must be run multiple times with best-case (or average-case) times used.
(Note: the program FLUSH can be used to flush pages of closed files out of memory.)
The difficulty of measuring performance is at its worst when looking at disc I/O. The following is a partial list of features that would aid this type of analysis:
- An intrinsic that will make free all pages of memory that are not marked memory-resident or locked.
- An intrinsic that will force all pages that are dirty to disc.
- An intrinsic that would return for a virtual address information like: size of object and number of logical pages currently in memory.
The first feature would allow the system to be returned to a known "blank slate" state, allowing repeatable performance testing.
Note: an intrinsic allows the system manager/performance tuner/ software developer the ability to exercise the above functions programmatically. This is clearly superior to simply having a command for two reasons:
- A command can be written by the user which simply calls the intrinsic. The opposite is not inexpensively true.
- Intrinsics are not as easy to abuse by the casual user.
[Update 96/03/07: the author has a utility program, FFLUSH, which flushes the pages for all closed files out of memory. This allows for relatively easy and repeatable testing for many of the timing questions considered in this paper.]
One valuable tool used in this paper is DEBUG. Given a virtual address associated with a mapped file, the debugger can be used to determine the number of logical pages that are currently in memory. Assuming the file starts at virtual address $123.0, then the debugger command:
= vainfo ($123.0, "pages_in_mem"), #
will report (in decimal) the number of 4,096 byte logical pages that are currently in memory.
Obtaining optimum performance with MPE/iX is more difficult than on MPE V ... there are more things to tune, with much less knowledge. Things to remember:
- The amount of memory on the machine is critical;
- Migration to Native Mode is important, but should not be done blindly. If an application is a heavy KSAM or message file user, do some timing tests first.
- The "second migration" is more important ... it means taking advantage of the new features.
Perhaps, when MPE/iX begins to stabilize, and third-party performance tools are developed and marketed, the folklore on how to maximize performance will begin to grow as it did under MPE V. In the meantime, keep the faith!
NOTE: All timings in this paper were obtained running under MPE XL 1.1. Initial testing on MPE XL 1.2 shows no major differences.
FREADSEEK has been given a bad name in this article. Well, like the "goto", it has its uses. Further testing (and a lot of thought) resulted in a modification to the test program that was reading SL.PUB.SYS with the results:
SL.PUB.SYS sequential read (times in milliseconds)
CPU Elapsed Delta Access Method
----- ------- ----- -------------
15273 49525 34252 Memory Mapped & FreadSeek <---
19686 146298 126612 Memory Mapped
35398 44361 8963 FreadDir
35903 65936 30033 FreadDir & FreadSeek
36590 44957 8367 Fread
39769 64529 24760 Fread & FreadSeek
Notice the incredible change in the "Memory Mapped & FreadSeek" numbers (first line). The crucial difference here is in the timing and quantity of calls to the FREADSEEK intrinsic. Earlier testing showed that the best case "throughput" for reading data with a mapped file (where the data was already memory resident) was about 3,111 bytes per millisecond. (Obtained from the memory-resident speed of reading CATALOG.PUB.SYS (80 * 7040 bytes) in 181 milliseconds.) Clearly, any prefetch should be done far enough ahead of time that the data is in memory by the time it is needed. The above calculation showed that if we assume it takes 30 milliseconds to read data from disc, then it must be requested 30 * 3,111 bytes before it is needed.
The test program was adjusted to prefetch 128 records ahead (instead of 4 records ahead). The next round of timings showed a gain, but not as much as hoped for. Then, we realized that the prefetch was reading 8 logical pages. So, after processing 24 logical pages (100,000 bytes) the test program was prefetching 8 logical pages instead of 24! The program was modified again, to fetch 24 logical pages at a time, resulting in the times shown above.
Moral: prefetching via FREADSEEK is worth the time, but ONLY after careful analysis. Failure to prefetch at the right time, or not enough data, is worse than not prefetching at all.
(back to Table of Contents)