.HS I
.if "\*(.T"mag" \{\
.	nr PS 12
.	nr VS 14
.\}
.pl 9.75i
.TR
.DR
.de CN
..
.TL
Design Considerations for Remote File Systems
(Extended Abstract)
.AU
T. Brunhoff
.AI
Computer Environments Group
Applied Research Group
Tektronix, inc.
.AB
There have been several remote file systems written,
including one written by the author called \fBRemotefs\fR.
This paper covers the design choices
that can be made at several software levels,
from where the hooks for a remote file system
lie in the operating system,
on up to the user interface,
and reveals those made by \fBRemotefs\fP.
The reader should have a strong familiarity
with the 4.2 BSD kernel function \fInamei()\fP,
the concept of mount points,
the system call interface and
the 4.2 BSD socket paradigm.
.AE
.SH
History
.PP
The Computer Research Labs within the Applied Research Group
has approximately forty-five internally\-designed workstations,
called
.I Magnolias,
twenty newly announced Tektronix 4404 AI workstations,
called \fIPegasus\fP, a
.I VAX 11/780
and a
.I VAX 11/750.
The Computer Environments Group,
within the Computer Research Labs
cares for most of these machines and the software that runs on them.
.PP
After porting 4.2 BSD Unix to the
.I Magnolia,
the amount of software available quickly outgrew
the capacity of its 35-megabyte winchester drives.
To alleviate this,
the author designed and began in December of 1984 to write a
remote file system based on a implementation paradigm used
by K. McKusick in his implementation of the 4.2 BSD file system;
i.e., "write it in user-mode to fit in the kernel".
This paper is in part
about that implementation,
and about design and
implementation in general to achieve a remote (or distributed) file system.
At this writing,
the design still lies mostly in the user level,
linked in by the \fBld(1)\fP flag
\fI\-lremote\fP,
with a few new system calls.\(dg
.FS \(dg
This remote file system,
known simply as \fBRemotefs\fP,
should not be confused with another,
more complete remote file system,
called \fBDFS\fR.
The latter is available on the Tektronix 6000 series workstations.
.FE
.NH 1
Choosing a Springboard for the Software
.PP
The focus of I/O activity on 
.B UNIX
is the inode;
each time a file is open, read, written, locked, closed, etc.,
the inode is referred to.
These system calls
converge on the system call interface which dispatches calls to
the appropriate internal routine.
Any system calls that involve a path name must call \fInamei()\fP
for the inode information, and similarly, any system calls that
deal with file descriptors must refer to the inode information
generated by an \fIopen()\fP or \fIcreate()\fR.
Only then can the data on the disks be accessed.
.so fig1.\*(.T
.PP
For example, in Figure 1,
\fIopen()\fP makes a request to the system call interface;
the system call interface determines that the \fIopen()\fP system call must
be executed (the kernel \fIopen()\fP is just a call to \fIcopen()\fP).
\fICopen()\fP then calls \fInamei()\fP to get the inode information
which in turn calls the appropriate disk device driver to get the inode
from the correct disk.
Subsequent \fIread()\fP
or \fIwrite()\fP calls use this information to access the disk.
It makes sense to make \fInamei()\fP the focal point for the remote
file system implementation
because of its critical role.
But there are other approaches.
.NH 2
\&\.\.\. From the Device Driver
.PP
Since \fInamei()\fP gets its information from the
disk via the disk driver,
we have only to
replace the disk driver with a \fIremote\fP disk driver.
This remote disk driver would be designed to send requests for disk
blocks directly to a remote host to be satisfied
from a single partition on its own disk.
.so fig2.\*(.T
.PP
Now, following the previous example in Figure 2,
\fInamei()\fP may instead encounter a mount point while
trying to find an inode for a file,
and will get its inode information from the remote disk driver.
Similarly,
reads and writes
request blocks from the \fIremote\fP disk driver
using this inode information.
.PP
This is where early implementations put remote file systems.
It offers speed and a good deal of portability with
the kernel changes limited to the device driver,
but it limits usefulness because each partition
on every remote system must be mounted,
and access can only be read\-only.
.NH 2
\&\.\.\. In \fInamei()\fP
.PP
There are two ways of checking for ``remoteness''
in \fInamei()\fP,
but the key change to \fInamei()\fP is that it must fail in
its inode lookup when it encounters a path name component
on a remote machine; then it must return with a special error.
This \fInamei()\fP failure mechanism will be alluded to later.
.PP
One method, depicted in Figure 3,
is to catch any reference to a special
syntax of path name,
such as \fI/\.\.\|/host/pathname\fP,
\fI/net/host/pathname\fP
or \fI//host/pathname\fP.
This is a cue to \fInamei()\fP
to return a special error code to the invoking system call.
It is then
the responsibility of that system call to send
a request to a server on the remote host.
This special syntax is very convenient because the
\fIhost\fP component of the path
need not correspond to some existing ``mount point''.
Hence, hosts can be mounted and unmounted on demand
if the implementor cares to take the trouble.
.so fig3.\*(.T
.PP
A second strategy is very similar
except that it uses a more natural
syntax of \fI/host/pathname\fP
(without needing symbolic links).
An important point is that the host cannot be ``mounted''
on a directory,
but rather on a special mount point,
or even a plain file.
The reason for this is a bit obscure,
but will be clarified shortly.
.PP
The special path names like \fI/\.\.\|/host/pathname\fP
or mount points like \fI/host\fP are needed partly because no
UNIX program should ever find these gateways through normal perusal of
a file system.
Imagine how long the command ``\fIfind / -print\fP'' would take
if it traversed every remote host as well as itself!
For this reason, using a directory for a mount point would
not be appropriate.
.PP
\fBRemotefs\fP uses a plain file as a mount point because of some extra
benefits:
the simplicity of the code changes to \fInamei()\fP,
and not having to add another file type for \fBUNIX\fP utilities to learn.
The path name \fI/host\fP remains a valid local filename,
but \fI/host/\fP or anything longer results in
a special case, which \fInamei\fP labels with the error
\fBENOTDIR\fP
(See Appendix A).
It is this place in the \fBUNIX\fP kernel
that \fBRemotefs\fP detects all remote file references.
.NH 3
An Aside: When to Follow a Symbolic Link in \fINamei()\fP.
.NH 2
\&\.\.\. At the User Level.
.PP
A slight variation of the above,
shown in Figure 4,
is to simply place the check for ``remoteness'' in
appropriate system calls with in the C runtime library,
\fIlibc.a\fP.
(see Appendix B for this list).
Unfortunately,
this requires the user level software to duplicate
what \fInamei()\fP does
whenever a system call involving a path name returns the
error \fBENOENT\fR or \fBENOTDIR\fR.
This implementation approach is typically slower,
but very portable.
.so fig4.\*(.T
.NH 1
File Descriptors
.PP
Once an \fIopen()\fP or \fIcreat()\fP has succeeded
on the remote host and returned a file descriptor, say \fIi\fP,
we must allocate a real file descriptor, \fIj\fP, on the local machine.
This may be done in the kernel or user level code,
but it is most important that the user's idea of the ordinate
value of \fIj\fP remain inviolate.
.NH 2
File Descriptors Handled at User Level
.NH 2
File Descriptors Handled at Kernel Level
.NH 2
Inheriting File Descriptors Across a \fIfork()\fP or \fIexec()\fP
.NH 2
Reading Directories on a Remote Host
.NH 1
Changing Directories
.PP
Implementing the ability to change
directories is a big win for any implementation
because interactive shells will then allow you to
peruse directories on remote hosts.
However, inheritance of file descriptors must
be implemented,
as explained in section 2.3.
The \fIchdir()\fP executing on the remote host
does nothing special.
If it succeeds, all is well.
But on the local side,
the software cannot change state (the current
working directory)
to match what has occurred on the remote machine.
Instead,
it must simply be remembered it in some way.
.NH 2
Interpreting Pathnames
.PP
If the remote file system software lies entirely in user\-level code,
then the only solution is for the software
to remember \fIchdir()\fP's path name argument
and that the current directory is on a remote host.
Then when a new path name is passed to a system call,
the software need only to check to see if it is absolute
or relative (with or without a leading '/' character, respectively).
If it is relative,
then the request must be sent to the remote host.
.PP
On the other hand if the software uses a special
mount point like \fI/host\fP,
the kernel can arrange
to make the process's working directory inode
to be the mount point's inode.
This is very convenient because no absolute vs. relative
checks are necessary
and nothing need be added to \fInamei()\fP.
For example,
absolute path names in a system call will still cause the mechanism
to function normally.
And relative path names will immediately fail in \fInamei()\fP
(remember our key change in section 1.2).
See Appendix A.
.NH 2
Pwd(1) and Changing to ``/\.\.'' on the Remote Host
.NH 1
Special Problems
.NH 2
\fIExec()\fP
.NH 2
\fIFork()\fP and \fIvfork()\fP
.NH 2
\fISelect()\fP
.NH 2
Uniqueness of Files Across Hosts
.NH 1
Permissions Across Hosts
.NH 2
Database Model
.NH 2
Dynamic Model
.NH 1
Server Design
.NH 2
When to \fIfork()\fP
.NH 2
How to Change Uid/Gid permissions
.NH 2
File Descriptor Overload
.NH 2
Communication Model
.NH 3
Protocol
.NH 3
Who Answers the Phone?
.NH 3
Speed Improvements
.PP
Imagine
a very loose view of the protocol (moving downward):
.so fig5.\*(.T
A remote file system implementation
has a decidedly synchronous flavor to it,
and for most system calls,
nothing else is appropriate.
But \fIRead()\fP and \fIwrite()\fP system calls
lend themselves very well to optimization,
specifically, lookahead.
.PP
A change in the protocol could be made
based on expected requests.
After,
say, two consecutive \fIread()\fP requests
on the same file descriptor for the same number of bytes,
the local host could ask the server to continue servicing
the same request until further notice.
The response would contain the same information
that would be expected on a normal request,
and would, of course,
terminate on an error or end\-of\-file.
The remote host, could easily detect and recover from a termination
of this kind, too.
The difficult part would be for the local host to try to stop
the servicing before end\-of\-file.
So,
our protocol now would be
.so fig6.\*(.T
Notice that the responses may continue on beyond
the request to stop,
but the acknowledgment of the request to stop would put
the hosts back in sync.
The remote host has only to reset its read pointer
back to the point where it had serviced
\fIn\fP requests,
and the local host must read the responses up to and including
the acknowledgment.
.PP
The protocol also may
have to refuse continuation service for file descriptors
that read from devices.
.PP
\fIWrite()\fP is similar,
but recovery when the remote host
reaches an end\-of\-file or encounters an error
would be much more complicated, and in some cases impossible.
The local host,
on receipt of a request to stop from the remote host,
would have to not only reset its idea of the write pointer,
but perhaps the read pointer from which the data was gathered
to do the write.
Considering that the reading may have been done from multiple
files or the data was transformed in some way,
the remote file system software may not be able to accomplish the task.
It appears to be only feasible if the implementor is willing to sacrifice
identical behavior of user\-level software on remote vs. local file
systems.
.NH 1
Status of \fBRemotefs\fP
.NH 1
Appendix A
.so appendixB.out
.NH 1
Appendix C
