.nr LL 8i
.ND September 13, 1985
.RP
.TL
A MULTIUSER BENCHMARK
.br
TO COMPARE
.br
UNIX SYSTEMS
.AU
Philip M. Mills
.AI
NCR Corporation
3325 Platt Springs Road
West Columbia, SC 29169
phone 803 791 6377
.2C
.SH
Unix Benchmarks Inadequate
.PP
Despite all the evaluation of Unix systems using benchmarks,
no readily available benchmark today provides an accurate
estimate of a multiuser Unix system's overall
processing effectiveness.
Published articles showing benchmark results as well as
the results from executing benchmarks,
typically provide tables of data containing
measurements of different parts of the system,
leaving one with the job of interpreting the results
as best they can.
What one really wants to know is will a set of application programs
execute faster on one Unix system compared
to another Unix system.
.PP
The problem with the results provided by most
benchmarks is that one cannot accurately infer
which system can execute a set of application programs
the best for the following reasons:
.IP (1)
The benchmark does not contain a workload that provided
an adequate simulation of a Unix multiuser application environment.
.IP (2)
The benchmark ignores the system's ability to overlap operations.
.IP (3)
One is unable to correlate the time to perform simple operations with the
system's overall ability to process work.
.IP (4)
The benchmark cannot evaluate Unix systems with multiple main processors.
.PP
To overcome the difficulties just described,
a self-measuring portable multiuser multiprocessor unix
benchmark has been developed.
This benchmark provides:
.IP (1)
A reasonable approximation to a multiuser multiprocessor
unix application environment.
.IP (2)
A single measure of the overall system's processing effectiveness.
.IP (3)
An easy way to compare the relative overall system's
processing effectiveness between systems.
.IP (4)
A method of relating application requirements to the benchmark results.
.SH
Defining Performance
.PP
Whether buying computers or using them,  one
of the qualities of major importance is performance.
Since performance may be inadvertently confused with other attributes
of a computer system it is useful to describe
performance as it applies to computers and other mechanisms.
.PP
Performance may be described as
the result of carrying out a set
of actions in a specific environment to satisfy an objective
or goal.
In track and field each event has a
different environment and goal and the
results may be running a distance in
some amount of time or jumping
so many feet.
In performance the results of carrying out the prescribed set
of actions is usually
described in terms of measurable quantities.
.PP
There are two main results that are important
to measure in a Unix based computer system:
responsiveness and throughput.
Responsiveness
measures the system's ability to respond to a
work request in a given amount of time.
Throughput measures the total amount of
work performed over a given period of
time.
The work on a computer system
consists of the actions of electronic and electro-mechanical
processing.
Performance results are often measured by executing a given workload and
measuring the time spent processing the work.
.SH
Functionality
.PP
A computer has three major areas of functionality
that a workload should contain.
The first is the processing by the central processing unit (CPU)
where all the calculations are performed;
the second is the disk subsystem
where the computer stores data files; the third is serial I/O subsystem which
controls such items as video displays, terminals and printers.
.SH
Overlapped Operations
.PP
In a multiuser Unix system there
are many user processes or programs executing at
the same time.
This type of processing
allows the I/O processing of some user programs
to be overlapped with the CPU processing of other user programs.
The result
of overlapping the I/O operations with
CPU operations depends on the amount of intelligence
built into the I/O processors.
.PP
An intelligent serial I/O subsystem
may handle interrupts from multiple
on-line users, provide a slave
serial I/O processor, handle buffering for both
input and output and may even execute
parts of the Unix kernel's tty code.
Besides providing overlapped serial I/O operations for
the terminals and printers
with the main CPU processing, the intelligent serial I/O
subsystem may also greatly reduce
the amount of processing required by the main CPU.
The reason for this reduction is that functions or processing
that use to be done by the main CPU are now done by the serial
I/O subsystem.
.PP
An intelligent disk subsystem will have one or more
disk controllers and disk drives and may also have
a slave processor.
The disk subsystem
handles the control of the
disk drive operations, handles interrupts, handles buffering
and provides a high speed DMA data transfer in and out of main memory.
If a slave processor is available it may be used to execute some of
the Unix kernel file code.
Intelligent disk subsystems provide overlapped operations with the main CPU
and also reduces the load on the main processor.
In some disk subsystems, operations on separate disk drives can
also be overlapped with each other producing a much higher effective transfer
rate for the disk subsystem.
.SH
Overlapped Performance
.PP
The effects of overlapped operations
are extremely important to computer system performance.
This point is illustrated
by the following example.
Assume a workload
is developed that contains 100 seconds
of CPU processing, 100 seconds of disk file
processing and 100 seconds of terminal I/O
processing on computer system A and also on computer
system B.
System A has intelligent
disk and tty subsystems
which can overlap operations with the main CPU and each other
and system A also supports the common Unix multiuser mode.
Therefore system A can execute the workload in 100 seconds.
Although system B has the same CPU power it cannot overlap I/O operations.
System B will take 300 seconds of real time to 
execute the workload.
For the workload
described system A 
is three times more powerful than system B
in that it will have a throughput rate
three times that of system B.
Another way of putting
it is that for a fixed amount of time
system A will accomplish three times as
much work as system B.
.PP
Although the example was developed
to illustrate a point, a typical unix
workload contains a significant amount of disk and serial I/O
and the overlapping
of these operations by the hardware/software can make a
substantial improvement in performance.
.SH
Application
Benchmarks
.PP
A sound approach to evaluating the
performance of various computer systems is to
take the application or sets of application
programs that make up the user environment
and execute them on each computer system
to be evaluated.
It is most important that the set of application
programs be truly representative of the overall
user environment.
This approach is often not feasible because the
set of applications do not exist or it would
take too much manpower to port the application
programs to run on each computer system being
evaluated.
Also it may take too much time to execute a large
set of applications on a number of computer systems.
.SH
Portable Benchmarks
.PP
For the reasons just given portable 
benchmarks that are easy to obtain and
execute in a relatively short period of time
are often used to simulate the application program
environment.
These benchmarks should measure the results needed in performance evaluation.
The most important results
that should be provided are measurements
of throughput and responsiveness.
.PP
A well designed general purpose
benchmark used to simulate the workload
of the Unix application environment
should have the following features:
.IP (1)
A broad weighted mix of basic CPU operations used with different data
referencing modes.
.IP (2)
A number of user processes executing in a multiprocess mode.
.IP (3)
Terminal I/O taking place with a number of terminals at the same time.
.IP (4)
Disk I/O taking place with a number of files scattered 
over the disk drive surface to cause head movement.
The benchmark should have the ability to use
multiple disk drives.
.IP (5)
Results that can provide an accurate
measure of the computer system's ability to execute work
over time.
The workload producing the results should contain all the
processing operations on the system so the system's ability
to overlap operations will be taken into account.
.IP (6)
Benchmark should be portable across various Unix systems.
.SH
Evaluating Existing Benchmarks
.PP
Before developing a
new benchmark an evaluation of existing
available benchmarks was performed.
The benchmarks are of two types.
The first type evaluates the time to perform a set of
some simple basic operations such as add integer by placing
each operation in a loop and measuring the time to complete the loop.
The second type of benchmark consists of one or more
programs that are designed to be appropriate for a benchmark.
These programs are run and the time to execute the programs is measured.
.PP
The AIM Benchmark Suite II is a well known benchmark of the first type
used to evaluate Unix computer system's
performance (see Vol 11 No. 3 Page 89 for an example).
Given the objectives of providing an
accurate estimate of throughput or responsiveness
through the simulation of a Unix multiuser
application environment this benchmark did not meet our
goals for the following reasons:
.IP (1)
The benchmark failed to include the basic operations for comparisons,
logical, bit and branch operations which
find significant use in the Unix environment.
.IP (2)
The benchmark uses mostly register variables.
Where data references across
the various basic operations, such as, automatic,
static, pointer and array references
are commonly used in the Unix environment.
Array referencing is included as a separate test.
.IP (3)
The tty test only involves one or two terminals.
The use of multiple tty lines all active at the same time
represents the actual user
environment and its need can best be illustrated
by an example.
Assume the tty peripheral controller can handle 3200 characters per
second total on output and it is connected to 8 tty lines at 9600 baud.
For a single user the tty controller can maintain 960 characters per second
to that user terminal.
However, for 8 heavily active users the tty controller
can only maintain an average of only 400 characters per second per tty
line.
For certain workloads this limitation will affect
the overall throughput of the computer system.
.IP (4)
The disk test consists of reading, writing and copying a single file.
In a multiuser Unix environment a number of separate file disk I/O requests
are made which are scattered randomly over the disk drive surface.
This sequence of random disk I/O requests requires significant disk head movement.
.IP (5)
Unix computer systems
are often purchased with multiple disk drives.
It is therefore  necessary to be able to include
in the benchmark multiple disk requests at the same
time for all the disk drives in the system.
It is important to evaluate the system in this mode
because it more accurately simulates the user environment.
Many systems are able to overlap some disk I/O operations
between drives as well as with the main CPU.
The AIM benchmark did not provide this capability.
.IP (6)
There is no way to correlate the point system used by the AIM benchmark
in a multiuser environment with throughput or responsiveness.
The AIM point system cannot take into account the overlapped operations of tty 
and disk I/O and the reduction in the main CPU
processing time due to the intelligent I/O processors.
Even a simple floppy disk controller can overlap operations with the
main CPU.
.PP
Although the AIM Benchmark Suite II was developed to execute in a single user
mode it has been used extensively to evaluate multiuser Unix systems.
For our purposes of evaluating Unix multiuser systems the AIM benchmark
does not provide an adequate simulation of the Unix user environment,
and second, does not provide an accurate measure of the Unix computer
system's ability to process work.
.PP
The AIM benchmark does provide information that is quite useful in tuning
unix systems. This information can best be used (and is) by organizations
marketing Unix computer systems who are able to modify the hardware and
unix software to improve performance.
.PP
The Dhrystone is another 
benchmark which provides a good mix of statement
types, operators and data reference types but it does not
provide any I/O operations which are needed for the multiple
tty lines and multiple disk drives used by the Unix multiuser environment.
.SH
Throughput Vs. Responsiveness
.PP
In the development of the 
benchmark a decision had to be made to
measure throughput or responsiveness.
Although response time is a very important performance
measure in an interactive Unix multiuser
system it requires the use of a remote
terminal emulator computer to provide tty inputs
to obtain repeatable results.
.PP
Because responsiveness and throughput are
both very much related to a computer system's ability
to process work the relationship 
between the two was investigated.
The results of this investigation showed that a 20% increase in throughput would
provide a 20% or greater improvement
in response time.
How much greater than 20% the improvement in response time would be
depends on the ratio of the time to process an interactive user
request to the average user think time.
If the think time goes to zero, throughput and
responsiveness vary the same from system to system
for a given workload.
.PP
In an interactive system like Unix the user think time
tends to create idle time on the various system components.
Given that the user is not a part of the computer system that is to be purchased,
throughput is really the best measure of the computer system's
ability to process work.
For the reasons just stated the benchmark
measures throughput as a result of
executing a particular workload.
.SH
The System Characterization Benchmark
.PP
The purpose of the benchmark is
to compare Unix systems on their ability
to do work (throughput) using
a simulated Unix multiuser environment with
multiple tty lines, multiple disk drives and one or more main processors.
For a fair comparison each computer system should
execute the same workload with the same number
of tty lines and the same number of disk drives.
.PP
The application requirements for a computer
system will vary in each environment such as
word processing, accounting, graphics, data base management, spread sheet, compilers and
scientific computations.
For this reason the
benchmark (SCB) executes a number of different
workloads each with a different mix of I/O operations.
Because of the many different workloads used by the benchmark,
it is more of a system characterization than a single benchmark.
For this reason the benchmark is called the
System Characterization Benchmark (SCB).
.PP
The SCB has the following characteristics:
.IP (1)
The SCB measures the rate (throughput) at which fixed
units of work can be executed in
a unit of time (minutes).
.IP (2)
The SCB takes into account the effects of all
overlapped operations executed in parallel,
including multiple main processor units, if they exist.
.IP (3)
The SCB simulates a multiuser Unix application
environment with
.RS
.IP (a)
A broad mix of main processor operations and data references
.IP (b)
Multiple tty lines with terminals
.IP (c)
Multiple disk drives each with its own set of data files
.RE
.IP (4)
The SCB consists of 50 different runs to provide
a system characterization that will cover the many
different application environments.
.IP (5)
The SCB provides graphical output for a quick analysis
and comparisons on the basis of throughput.
.IP (6)
The SCB provides information on the effective rates
of the main central processing unit (CPU),
the disk subsystem and the tty subsystem.
.PP
The first part of the SCB consists of executing three programs
crun, drun and trun to produce the report shown in Figure 1 that
shows the effective capabilities of the one or more main CPU , the disk subsystem
and the tty subsystems respectively.
.PP
The purpose of crun is to obtain the relative processing power of
the one or more main CPUs.
The crun program also provides for multiprocessor systems
an estimate of the ratio of the multiprocessor's throughput
to a single processor's throughput.
If there is only one CPU this ratio is one.
.PP
The crun code consists of a weighted mix of basic operations
on various types of data references.
The frequency of execution of the basic operations involving the
different types of data references are weighted according to an
internal unpublished compiler study of Unix based C code.
Detail information on the mix of operations and data references
is contained in the SCB documentation.
The time to execute the crun code is measured and used to
determine the user CPU time per work unit process.
.PP
In setting up the SCB each disk drive
specified by the user has 200 files of
20K bytes each placed on it.
In the execution
of drun the user may specify the percent of
files to be read and written on each disk drive.
The disk program drun produces the
information on the disk subsystem in the
subsystem report by reading and writing file blocks
randomly over the different disk drive surfaces specified.
.PP
The serial I/O subsystem is usually made up
of one or more tty controllers where each
tty controller handles a fixed number of
tty lines/terminals.
The tty controller typically
can handle some maximum level
of characters per
second for tty output and input across
all lines.
The information in the subsystem report is obtained by writing 32
character lines
to all
tty ports as fast as the tty controllers can handle them.
.PP
The second part of the SCB consists of 50
separate runs controlled by a shell script.
Each run is made up of 200 work units.
The user processes execute 8 or some setable
multiuser level at a time.
Each work unit consists of the following components.
.IP (1)
The execution of a copy of xrun code with a
CPU user time proportional to the system's crun time.
.IP (2)
A given number of 1K disk block reads and writes.
.IP (3)
A given number of 32 character line writes
to a tty line/terminal.
.PP
Each work unit process in the run is
identical.
The disk and tty requests are equally spaced in the
CPU user time to provide more consistent and repeatable results.
Each work unit has its own unique disk read file and write file.
The disk files are preassigned to each work unit at random.
A work unit process has a unique tty line during its execution.
On completion the tty line is given to the next work unit process started.
.PP
The 50 runs made by a shell script uses all the combinations of
the following disk and tty levels:
.IP
disk blocks 0, 4, 8, 12, 16, 20, 24, 28, 32, 36
.IP
tty lines  0, 30, 60, 90, 120
.PP
The disk operations are divided evenly between reads and writes of existing
files.
The work units executed per minute are calculated for each of the 50 runs and
the results are presented in a graphical form.
.SH
The Results Of The SCB Benchmark
.PP
The results of executing the SCB are presented in the graphical
form shown in Figure 2.
The left axis represents the number of work units
that can be executed per minute (throughput).
The bottom axis is the number of disk requests
in a unit of work process.
.PP
The letters on the graph (A B C E F) represent the different levels
of tty line writes contained in each work unit process.
.LD
Letter        Number of tty 
              line writes

   A                  0
   B                 30
   C                 60
   D                 90
   E                120

.DE
.PP
The graphical results are produced on a 130 column printer and
the border on the graphical report is 8 1/2" x 11" so that it
may be placed into reports when desired.
.PP
There are 50 points on a graph each specified by a letter.
Each point represents the throughput rate in work units per minute.
Each point is determined by executing 200 work unit processes each with
a unit of CPU user processing and the number of disk and tty requests
specified on the graph.
.PP
The SCB results show how a computer system performs under a variety
of loads.
On moving to the right on the graph, the effect of executing more
disk requests is indicated by the steepness of the curve towards
the bottom axis.
The more limited the disk subsystem in capability the steeper
the downward slope to the right.
The different letters show the effect of increasing the
tty load per work unit process.
The more limited the tty subsystem in capability the greater the downward
distance between the curves of the same letter.
Point A next to the left axis shows the system's
ability to execute CPU user processing with no I/O.
.PP
Facilities are also provided for comparing two different computer
systems by taking the ratio of the throughputs of all 50 points and
graphing the results shown in Figure 3.
The left axis shows the ratio and the bottom axis and the
letters (A B C E F) are the same as Figure 2.
This graph shows in which area or areas (main CPU, tty, disk) one system
is superior to the other.
Ratios greater than one show an increase in capability in percent above one
and ratios less than one show less capability in percent under one.
.PP
The ratio graph of Figure 3 shows machine A is superior in handling disk requests.
Machine B has a cpu/compiler speed that is about five percent faster.
The two machines are equivalent in handling tty write requests.
.SH
Obtaining And Running The SCB
.PP
The software for the system characterization benchmark (SCB)
may be obtained free on IBM PC or NCR Tower compatible floppy from NCR
by writing on company letterhead to:
.ID

SCB Distribution
NCR Corporation
3325 Platt Springs Road
West Columbia, SC  29169
Attention: Ms. Darlene Amick
.DE
.PP
If you wish the software on nine track tape you need to send one
with your request.
The set up and operation is described in the help
file on the media.
Once a simple config file is set up, two shell scripts that are provided
handle the rest.
.SH
Relating User Application To The SCB
.PP
Using the ratio graph system A can now be compared
with system B across a broad mix of different workloads.
However system A may be better than system B (ratio greater than one)
over certain regions of the graph
and worse than system B (ratio less than one)
on other regions of the graph.
For this reason users may wish to identify the region
of the SCB graph for their typical applications.
.PP
To obtain information on the application,
execute the application with the unix
timex command.
The timex command with the o option
gives the CPU user time, the total number of blocks read or written to disk
and the total characters transferred to the terminals.
Divide the total number of
CPU user seconds into total disk blocks
read and written and into the total
characters transferred to the terminal.
This gives two quantities:
the number of disk transfers per second of user CPU time and
the number of tty characters transferred per second of user CPU time.
.PP
These two quantities can be
related to a location on the SCB graphs.
First obtain the user CPU time per work unit process by taking it
from the subsystem report
and calling this quantity C.
Multiply C by the number of disk requests per second
of user CPU time from the application.
This gives the number of disk requests on the bottom
axis of the SCB graphs.
Next multiply C
times the number of characters transferred
per second of user CPU time from the application
divided by 32 giving the number of tty lines on the SCB graphs.
Having related the application to a location on the SCB graphs
the relative performance for the application can now be estimated
for every system on which the SCB was run.
The run time for the application on a new system can be estimated
by multiplying the run time on the old system by the ratio on the SCB
ratio graph at the application point.
.PP
The Unix timex command may also be used to execute shell commands
which means whole sets of applications may be related to locations on the
SCB graphs.
Using the SCB ratio graphs the percent improvement
in throughput can be estimated for a number of new systems
for different types of applications.
These estimates can be obtained without the need to port any of
the application programs.
.SH
Conclusions
.PP
By the very nature of benchmarks they are
an approximation of the user application environment.
While there are many different benchmarks measuring various elements
of Unix computer systems, a decision to buy a certain computer should be based
on the system's ability to handle the workload in the form
of electronic and electro-mechanical processing.
The time to perform an integer add  or read a single
disk file is useful information but is certainly not sufficient information
to base a computer purchase on.
.PP
The problem with gathering a lot of details on Unix computer systems
in the form of specifications and the results of running various
benchmarks is that it is extremely difficult, if not impossible,
because of the system's internal complexity to accurately
relate this information to a single measure of processing effectiveness.
The SCB uses the system itself to relate all the various operations
and
provide one measure of processing effectiveness.
Because the processing effectiveness varies with the nature of
the workload the SCB estimates performance
for a number of different workloads.
Providing workloads with various I/O levels is an important part
of the SCB design because many Unix systems are I/O intensive.
For a number of applications I/O turns out to be the limiting factor
that determines throughput.
.PP
The SCB provides a reasonable approximation to a multiuser Unix application
environment, provides an easy way to compare the relative processing
effectiveness between systems and provides a method of narrowing
performance estimates to  specific or sets of application programs.
.PP
The SCB is easy to set up and run and it provides a comprehensive and
accurate benchmark.
.SH
Acknowledgment
.PP
I wish to thank Myron Merritt for his many hours of discussion and his ideas used in the development of the system characterization benchmark.
