/* * lzcomp [-options] infile outfile */ #ifdef DOCUMENTATION title lzcomp File Compression index File compression synopsis .s.nf lzcomp [-options] [infile [outfile]] .s.f description lzcomp implements the Lempel-Ziv file compression algorithm. (Files compressed by lzcomp are uncompressed by lzdcmp.) It operates by finding common substrings and replaces them with a variable-size code. This is deterministic, and can be done with a single pass over the file. Thus, the decompression procedure needs no input table, but can track the way the table was built. Options may be given in either case. .lm +8 .p -8 -B Input file is "binary", not "human readable text". This is necessary on Dec operating systems, such as VMS and RSX-11M, that treat these files differently. (Note that binary support is rudamentary and probably insufficient as yet.) (On VMS version 4, this is ignored unless the -x option is specified or the input file is record-oriented.) .p -8 -M bits Write using the specified number of bits in the code -- necessary for big machines making files for little machines. For example, if compressing a file on VMS which is to be read on a PDP-11, you should select -M 12. .p -8 -V [n] Verbose if specified. If a value is specified, it will enable debugging code (if compiled in). .p -8 -X [n] "Export" -- write a file format that can be read by other operating systems. Only the bytes in the file are copied; file attributes are not preserved. If specified, the value determines the level of compatiblity. If not specified, or specified with an explicit value of zero, and lzcomp is running on Vax/VMS version 4 under VaxC and the input file is a disk or magtape file (block-oriented), a VMS-private output format is used which is incompatible with the Unix compress utility, but which preserves VMS file attributes. -X may take on the following values: .lm +4.s .i -4;#0##Choose VMS private format. See restrictions below. .i -4;#1##Compatible with Unix compress version 3.0: this is the default if -x is given without a value. .i -4;#2##As above, but supress "block compression" .i -4;#3##Supress block compression and do not output a compress header block. This is for compatiblity with a quite early version of Unix compress (and requires conditional-compilation to use). .lm -4.s Note that the -B (binary) option is ignored unless the input file is "record-oriented", such as a terminal or mailbox. .lm -8.s The other two arguments are the input and output filenames respectively. Redirection is supported, however, the output must be a disk/tape file. The file format is almost identical to the current Unix implementation of compress (V4.0). Files written by Unix compress should be readable by lzdcmp. Files written by lzcomp in export (-x) format will be readable by Unix compress (except that lzcomp outputs two "clear" codes to mark EOF. A patch to Unix compress is available.) VMS Restrictions VMS Private mode stores the true name and attributes of the input file into the compressed file and lzdcmp restores the attributes (and filename if requested). The following restrictions apply -- they may be lifted in the future as they are primarily due to the author's lack of understanding of the intricacies of of VMS I/O: All files must be stored on disk. The lzcomp output file must be specified directly. Also, for all usage on VMS, the compressed file must be written to, and read from disk. LZW compression algorithm This section is abstracted from Terry Welch's article referenced below. The algorithm builds a string translation table that maps substrings in the input into fixed-length codes. The compress algorithm may be described as follows: 1. Initialize table to contain single-character strings. 2. Read the first character. Set (the prefix string) to that character. 3. (step): Read next input character, K. 4. If at end of file, output code(); exit. 5. If K is in the string table: Set to K; goto step 3. 6. Else K is not in the string table. Output code(); Put K into the string table; Set to K; Goto step 3. "At each execution of the basic step an acceptable input string has been parsed off. The next character K is read and the extended string K is tested to see if it exists in the string table. If it is there, then the extended string becomes the parsed string and the step is repeated. If K is not in the string table, then it is entered, the code for the successfully parsed string is put out as comprssed data, the character K becomes the beginning of the next string, and the step is repeated." The decompression algorithm translates each received code into a prefix string and extension [suffix] character. The extension character is stored (in a push-down stack), and the prefix translated again, until the prefix is a single character, which completes decompression of this code. The entire code is then output by popping the stack. "An update to the string table is made for each code received (except the first one). When a code has been translated, its final character is used as the extension character, combined with the prior string, to add a new string to the string table. This new string is assigned a unique code value, which is the same code that the compressor assigned to that string. In this way, the decompressor incrementally reconstructs the same string table that the decompressor used.... Unfortunately ... [the algorithm] does not work for an abnormal case. The abnormal case occurs whenever an input character string contains the sequence KKK, where K already appears in the compressor string table." The decompression algorithm, augmented to handle the abnormal case, is as follows: 1. Read first input code; Store in CODE and OLDcode; With CODE = code(K), output(K); FINchar = K; 2. Read next code to CODE; INcode = CODE; If at end of file, exit; 3. If CODE not in string table (special case) then Output(FINchar); CODE = OLDcode; INcode = code(OLDcode, FINchar); 4. If CODE == code(K) then Push K onto the stack; CODE == code(); Goto 4. 5. If CODE == code(K) then Output K; FINchar = K; 6. While stack not empty Output top of stack; Pop stack; 7. Put OLDcode,K into the string table. OLDcode = INcode; Goto 2. The algorithm as implemented here introduces two additional complications. The actual codes are transmitted using a variable-length encoding. The lowest-level routines increase the number of bits in the code when the largest possible code is transmitted. Periodically, the algorithm checks that compression is still increasing. If the ratio of input bytes to output bytes decreases, the entire process is reset. This can happen if the characteristics of the input file change. VMS Private File Structure In VMS Private mode, the compressed data file contains a variable-length (but compressed) file header with the file "attributes" needed by the operating system to construct the file. This allows the decompression program to recreate the file in its original format, which is essential if ISAM databases are compressed. The overall file format is as follows: .lm +8 .p -8 LZ_SOH "start of header" signal (this value cannot appear in user data). A variable-length data record (maximum 256 bytes) containing the header name, followed by whitespace, followed by header-specific information. In this case, the name record will contain the string "vms$attributes" followed by the number of bytes in the attribute data block. (I assume that the name record will consist of a facility name, such as "vms", followed by a dollar sign, followed by a facility-unique word.) .p -8 LZ_EOR Signals "end of record". This is followed by a VMS file attributes record (generated by a VMS system library routine). .p -8 LZ_ETX Signals "end of segment". .p -8 ST_STX Signals "start of text" (i.e., start of data file). This is followed by the user data file. .p -8 LZ_ETX Signals "end of segment" .p -8 LZ_ETX Two in a row signals "end of file". .s.lm -8 Note that this format can easily be extended to include trailer records (with file counts and checksums) and/or multiple data files in one compressed file. Note also that the LZ_CLEAR code may appear in headers or data files to cause the decompression program to "readapt" to the characteristics of the input data. LZ_STX and LZ_SOH reset the compression algorithm. LZ_EOR does not. Authors The algorithm is from "A Technique for High Performance Data Compression." Terry A. Welch. IEEE Computer Vol 17, No. 6 (June 1984), pp 8-19. This revision is by Martin Minow. Unix Compress authors are as follows: .s.nf Spencer W. Thomas (decvax!harpo!utah-cs!utah-gr!thomas) Jim McKie (decvax!mcvax!jim) Steve Davies (decvax!vax135!petsd!peora!srd) Ken Turkowski (decvax!decwrl!turtlevax!ken) James A. Woods (decvax!ihnp4!ames!jaw) Joe Orost (decvax!vax135!petsd!joe) .s.f #endif /* * Compatible with compress.c, v3.0 84/11/27 */ /*)BUILD $(PROGRAM) = lzcomp $(INCLUDE) = lz.h $(CPP) = 1 $(FILES) = { lzcmp1.c lzcmp2.c lzcmp3.c lzio.c lzvio.c } */ #include "lz.h" #ifdef unix #include #include #endif /* * These global parameters are written to the compressed file. * The decompressor needs them. The initialized values are defaults * and are modified by command line arguments. */ short maxbits = BITS; /* settable max # bits/code */ code_int maxmaxcode = 1 << BITS; /* Totally largest code */ code_int hsize = HSIZE; /* Actual hash table size */ /* * Flags (command line arguments) to control compression. */ #if VMS_V4 flag export = 0; /* Assume vms "private" mode */ #else flag export = 1; /* Assume Unix compatible mode */ #endif flag block_compress = TRUE; /* Assume block compression */ flag binary = FALSE; /* Reading text file if FALSE */ flag noheader = FALSE; /* No magic header if TRUE */ flag verbose = VERBOSE_DEFAULT; /* Non-zero for status/debug */ flag background = FALSE; /* TRUE (Unix) if detached */ readonly flag is_compress = TRUE; /* for lzio.c (needed?) */ long fsize; /* Input file size in bytes */ char *infilename = NULL; /* For error printouts */ char *outfilename = NULL; /* For openoutput and errors */ int firstcode; /* First code after internals */ count_int tot_incount = 0; /* Total number of input bytes */ count_int tot_outcount = 0; /* Total number of output codes */ extern count_int in_count; extern count_int out_count; static long start_time; /* Time we started (in msec) */ extern long cputime(); /* Returns process time in msec */ STREAM instream; STREAM outstream; char_type inbuffer[MAXIO]; char_type outbuffer[MAXIO]; static STREAM mem_stream; jmp_buf failure; #if VMS_V4 #include types #include stat #include descrip #ifndef FDLSTUFF #define FDLSTUFF char #endif FDLSTUFF *fdl_input; FDLSTUFF *fdl_output; static struct dsc$descriptor fdl_descriptor; #endif main(argc, argv) int argc; char *argv[]; /* * Compress mainline */ { #ifndef decus /* * background is TRUE if running detached from the command terminal. */ background = (signal(SIGINT, SIG_IGN) == SIG_IGN) ? TRUE : FALSE; if (!background) background = !isatty(fileno(stderr)); if (!background) { if (verbose > 0) signal(SIGINT, abort); else { signal(SIGINT, interrupt); signal(SIGSEGV, address_error); } } #endif if (setjmp(failure) == 0) { setup(argc, argv); /* Command line parameters */ openinput(); /* Open input, set instream */ getfilesize(); /* Get input file size */ gethashsize(); /* Get actual hash table size */ initialize(); /* Set maxbits and the like */ openoutput(); /* Open output file */ if (verbose > 0) start_time = cputime(); put_magic_header(); init_compress(TRUE); compress(&instream); #if VMS_V4 if (export == 0) { outputcode((code_int) LZ_ETX); outputcode((code_int) LZ_ETX); fdl_close(fdl_input); } else #endif if (block_compress) { outputcode((code_int) LZ_CLEAR); outputcode((code_int) LZ_CLEAR); } outputcode((code_int) -1); /* Flush output buffers */ #if VMS_V4 if (export == 0) fdl_close(fdl_output); else { fclose(stdout); } #else fclose(stdout); #endif if (verbose > 0) { start_time = cputime() - start_time; tot_incount += in_count; tot_outcount += out_count; fprintf(stderr, "%ld chars in, %ld bytes out, ", tot_incount, tot_outcount); if (tot_outcount > 0) { divout("compression ratio: ", (long) tot_incount, (long) tot_outcount, ""); divout(" (", ((long) tot_incount - (long) tot_outcount) * 100, (long) tot_incount, "%)\n"); } fprintf(stderr, "%ld.%02ld seconds (process time) for compression.\n", start_time / 1000L, (start_time % 1000L) / 10L); if (start_time > 0) { divout(" ", (long) tot_incount * 10L, (start_time + 50L) / 100L, " input bytes per second.\n"); } } exit(IO_SUCCESS); } else { fprintf(stderr, "Error when compressing \"%s\" to \"%s\"\n", (infilename == NULL) ? "" : infilename, (outfilename == NULL) ? "" : outfilename); if (errno != 0) perror("lzcomp fatal error"); exit(IO_ERROR); } } divout(leader, numer, denom, trailer) char *leader; long numer; long denom; char *trailer; /* * Print numer/denom without floating point on small machines. */ { fprintf(stderr, "%s%ld.%02ld%s", leader, numer / denom, ((numer % denom) * 100L) / denom, trailer); } static initialize() /* * Mung some global values. */ { if (maxbits < INIT_BITS) /* maxbits is set by the -M */ maxbits = INIT_BITS; /* option. Make sure it's */ if (maxbits > BITS) /* within a reasonable range */ maxbits = BITS; maxmaxcode = 1 << maxbits; /* Truly biggest code */ if (export == 0) firstcode = LZ_FIRST; /* VMS private */ else if (block_compress) firstcode = LZ_CLEAR + 1; /* Default */ else firstcode = 256; /* Backwards compatible */ } put_magic_header() /* * Write the magic header bits. */ { #ifndef COMPATIBLE if (export && !noheader) { PUT(HEAD1_MAGIC, &outstream); PUT(HEAD2_MAGIC, &outstream); PUT(maxbits | ((block_compress) ? BLOCK_MASK : 0), &outstream); } #if VMS_V4 else if (export == 0) { char text[256]; /* * VMS private mode (with attribute block) */ PUT(HEAD1_MAGIC, &outstream); PUT(VMS_HEAD2_MAGIC, &outstream); PUT((char) (maxbits | BLOCK_MASK), &outstream); PUT(firstcode - 0x100, &outstream); init_compress(); outputcode(LZ_SOH); #if DEBUG if (strlen(ATT_NAME) != ATT_SIZE) { fprintf("\"%s\", expected %d, got %d\n", ATT_NAME, ATT_SIZE, strlen(ATT_NAME)); } #endif sprintf(text, "%s%d;", ATT_NAME, fdl_descriptor.dsc$w_length); mem_compress(text, strlen(text)); outputcode(LZ_EOR); mem_compress(fdl_descriptor.dsc$a_pointer, fdl_descriptor.dsc$w_length); fdl_free(&fdl_descriptor); outputcode(LZ_ETX); outputcode(LZ_STX); } #endif #endif } mem_compress(datum, length) char_type *datum; int length; /* * Compress from memory */ { mem_stream.bp = mem_stream.bstart = datum; mem_stream.bsize = length; mem_stream.bend = datum + length; mem_stream.func = lz_eof; compress(&mem_stream); } /* * This routine is used to tune the hash table size according to * the file size. If the filesize is unknown, fsize should be * set to zero. */ typedef struct TUNETAB { long fsize; code_int hsize; } TUNETAB; static readonly TUNETAB tunetab[] = { #if HSIZE > 5003 { 1 << 12, 5003 }, #endif #if HSIZE > 9001 { 1 << 13, 9001 }, #endif #if HSIZE > 18013 { 1 << 14, 18013 }, #endif #if HSIZE > 35023 { 1 << 15, 35023 }, { 47000, 50021 }, #endif { 0, 0 }, }; static gethashsize() /* * Tune the hash table parameters for small files. * We don't have a good way to find the file size on vms V3. * fsize is set to zero if we can't find it. */ { register TUNETAB *tunep; hsize = HSIZE; if (fsize > 0) { for (tunep = tunetab; tunep->fsize != 0; tunep++) { if (fsize < tunep->fsize) { hsize = tunep->hsize; break; } } } } static getfilesize() /* * Set fsize to the input filesize (in bytes) if possible. * Magic for all operating systems. */ { #ifdef rsx extern char f_efbk; /* F.EFBK -- highest block in file */ #define fdb(p,offset) (stdin->io_fdb[((int) &p + offset)] & 0xFF) #define efbk(offset) fdb(f_efbk, offset) extern char f_rtyp; /* F.RTYP -- Record type */ extern char f_ratt; /* F.RATT -- Record attributes */ /* * Note: Block number is stored high-order word first. */ fsize = efbk(2) + (efbk(3) << 8) + (efbk(0) << 16) + (efbk(1) << 24); fsize *= 512; #endif #ifdef rt11 fsize = stdin->io_size; /* Set by Decus C */ fsize *= 512; #endif #ifdef vms #if VMS_V4 struct stat statbuf; fsize = 0; if (export != 0) { if (fstat(fileno(stdin), &statbuf) == 0) fsize = (long) statbuf.st_size; } else { fsize = (long) fdl_fsize(fdl_input); } #else fsize = 0; /* Can't find filesize */ #endif #endif #ifdef unix struct stat statbuf; fsize = 0; if (fstat(fileno(stdin), &statbuf) == 0) fsize = (long) statbuf.st_size; #endif } static readonly char *helptext[] = { "The following options are valid:", "-B\tBinary file (important on VMS/RSX, ignored on Unix)", "-M val\tExplicitly set the maximum number of code bits", "-V val\tPrint status information (or debugging) to stderr", "-X val\tSet export (compatiblity) mode:", #if VMS_V4 " -X 0\tExplicitly choose VMS Private mode", #endif " -X 1\t(default if -X specified, output format is compatible", "\twith Unix compress V3.0", " -X 2\tCompatible with Unix compress 3.0, block compression", "\tsupressed.", #ifdef COMPATIBLE " -X 3No header (file is readable by old compress)", #endif NULL, }; static setup(argc, argv) int argc; char *argv[]; /* * Get parameters and open files. Exit fatally on errors. */ { register char *ap; register int c; char **hp; auto int i; int j; #ifdef vms argc = getredirection(argc, argv); #endif for (i = j = 1; i < argc; i++) { ap = argv[i]; if (*ap++ != '-' || *ap == EOS) /* Filename? */ argv[j++] = argv[i]; /* Just copy it */ else { while ((c = *ap++) != EOS) { if (islower(c)) c = toupper(c); switch (c) { case 'B': binary = TRUE; break; case 'M': maxbits = getvalue(ap, &i, argv); if (maxbits < MIN_BITS) { fprintf(stderr, "Illegal -M value\n"); goto usage; } break; case 'V': verbose = getvalue(ap, &i, argv); break; case 'X': export = getvalue(ap, &i, argv); if (export < 0 || export > 3) { fprintf(stderr, "Illegal -X value: %d\n", export); goto usage; } block_compress = "\1\1\0\0"[export]; noheader = "\0\0\0\1"[export]; export = "\0\1\1\1"[export]; break; default: fprintf(stderr, "Unknown option '%c' in \"%s\"\n", *ap, argv[i]); usage: for (hp = helptext; *hp != NULL; hp++) fprintf(stderr, "%s\n", *hp); FAIL("usage"); } /* Switch on options */ } /* Everything for -xxx */ } /* If -option */ } /* For all argc's */ /* infilename = NULL; */ /* Set "stdin" signal */ /* outfilename = NULL; */ /* Set "stdout" signal */ switch (j) { /* Any file arguments? */ case 3: /* both files given */ if (!streq(argv[2], "-")) /* But - means stdout */ outfilename = argv[2]; case 2: /* Input file given */ if (!streq(argv[1], "-")) { infilename = argv[1]; } break; case 0: /* None! */ case 1: /* No file arguments */ break; default: fprintf(stderr, "Too many file arguments\n"); FAIL("too many files"); } } static int getvalue(ap, ip, argv) register char *ap; int *ip; char *argv[]; /* * Compile a "value". We are currently scanning *ap, part of argv[*ip]. * The following are possible: * -x123 return (123) and set *ap to EOS so the caller * ap^ cycles to the next argument. * * -x 123 *ap == EOS and argv[*ip + 1][0] is a digit. * return (123) and increment *i to skip over the * next argument. * * -xy or -x y return(1), don't touch *ap or *ip. * * Note that the default for "flag option without value" is 1. This * can only cause a problem for the -M option where the value is * mandatory. However, the result of 1 is illegal as it is less * than INIT_BITS. */ { register int result; register int i; i = *ip + 1; if (isdigit(*ap)) { result = atoi(ap); *ap = EOS; } else if (*ap == EOS && argv[i] != NULL && isdigit(argv[i][0])) { result = atoi(argv[i]); *ip = i; } else { result = 1; } return (result); } openinput() { #ifdef decus if (infilename == NULL) { infilename = malloc(256 + 1); fgetname(stdin, infilename); infilename = realloc(infilename, strlen(infilename) + 1); } else { if (freopen(infilename, (binary) ? "rn" : "r", stdin) == NULL) { perror(infilename); FAIL("can't reopen input"); } } #else #ifdef vms #if VMS_V4 if (export == 0) { char *fname; char filename[256]; if ((fname = infilename) == NULL) { fgetname(stdin, filename); fname = filename; } if ((fdl_input = fdl_open(fname, &fdl_descriptor)) == NULL) { if ((fdl_status & 01) == 0) { fdl_message(NULL, fname); FAIL("can't fdl_open"); } fprintf(stderr, "Cannot open \"%s\" in vms private format,", fname); fprintf(stderr, " trying export format.\n"); export = TRUE; goto try_export; } fclose(stdin); stdin = NULL; infilename = malloc(256 + 1); infilename = realloc(fname, strlen(fname) + 1); if (verbose > 1) { fprintf(stderr, "FDL information for \"%s\"\n", filename); fdl_dump(&fdl_descriptor, stderr); } goto opened; } try_export: #endif if (infilename == NULL) { infilename = malloc(256 + 1); fgetname(stdin, infilename); infilename = realloc(infilename, strlen(infilename) + 1); } else { #if VMS_V4 if ((stdin = freopen(infilename, "r", stdin)) == NULL) { #else if (freopen(infilename, "r", stdin) == NULL) { #endif perror(infilename); exit(IO_ERROR); } } #else if (infilename == NULL) infilename = "stdin"; else { if (freopen(infilename, "r", stdin) == NULL) { perror(infilename); exit(IO_ERROR); } } #endif #endif opened: instream.bp = instream.bend = NULL; instream.bstart = inbuffer; instream.bsize = sizeof inbuffer; instream.func = lz_fill; } openoutput() /* * Open the output file (after the input file has been opened). * if outfilename == NULL, it's already open on stdout. */ { if (outfilename == NULL) { #if VMS_V4 #if 0 /* The following doesn't work */ outfilename = malloc(256 + 1); fgetname(stdout, outfilename); outfilename = realloc(outfilename, strlen(outfilename) + 1); if (export == 0) { fclose(stdout); stdout = NULL; /* Can't do terminal test below */ if ((fdl_output = fdl_create(NULL, outfilename)) == NULL) { if ((fdl_status & 01) == 0) fdl_message(NULL, outfilename); fprintf(stderr, "Can't create \"%s\"\n", outfilename); FAIL("can't fdl_create"); } } #else fprintf(stderr, "Restriction: The output file must be specified.\n"); FAIL("can't redirect on VMS V4"); #endif #else #ifdef vms outfilename = malloc(256 + 1); fgetname(stdout, outfilename); outfilename = realloc(outfilename, strlen(outfilename) + 1); #else #ifdef decus outfilename = malloc(256 + 1); fgetname(stdout, outfilename); outfilename = realloc(outfilename, strlen(outfilename) + 1); #else outfilename = ""; #endif #endif #endif } else { #if VMS_V4 if (export == 0) { fclose(stdout); stdout = NULL; /* Can't do terminal test below */ if ((fdl_output = fdl_create(NULL, outfilename)) == NULL) { if ((fdl_status & 01) == 0) fdl_message(NULL, outfilename); fprintf(stderr, "Can't create \"%s\" (VMS private)\n", outfilename); FAIL("can't fdl_create"); } } else { if (freopen(outfilename, "w", stdout) == NULL) { perror(outfilename); FAIL("can't create"); } } #else #ifdef decus if (freopen(outfilename, "wn", stdout) == NULL) { perror(outfilename); FAIL("can't create"); } #else if (freopen(outfilename, "w", stdout) == NULL) { perror(outfilename); FAIL("can't create"); } #endif #endif } if (stdout != NULL && isatty(fileno(stdout))) { fprintf(stderr, "%s: is a terminal. We object.\n", outfilename); FAIL("can't create"); } outstream.bp = outstream.bstart = outbuffer; outstream.bend = outbuffer + sizeof outbuffer; outstream.bsize = sizeof outbuffer; outstream.func = lz_flush; }