######## ################## ###### ###### ##### ##### #### #### ## ##### #### #### #### #### #### ##### ##### ## ## #### ## ## ## ### ## #### ## ## ## ##### ######## ## ## ## ##### ## ## ## ## ## ##### ## ## ######## ## ## ## ### ## ## #### ## ## ##### #### #### #### #### ##### #### #### #### #### #### ###### ##### ## ###### ###### Issue #17 ################## November 15, 1998 ######## ............................................................................... "The words of the wise are as goads, and as nails fastened by the masters of assemblies." Ecclesiastes 12 "Before criticizing a man I try to walk a mile in his shoes. That way, if he gets mad he's a mile away and barefoot." John Ianetta ............................................................................... BSOUT For me, fall is a time for reflection. The trees descend into their golden time and then seem to die. And yet, under the surface they are quite alive, and teeming with activity at a smaller, less-visible scale, waiting to burst forwards again in full bloom. I think there's a great metaphor for C= in this. But I have no idea what it is. In fact things are totally hectic around here, and I haven't given more than a few moments thought towards the 64, so this will be a mighty short editorial. Between a PhD thesis and begging for jobs there hasn't been much 64-time, but with a little here and a little there this issue is finally squeaking out. Everybody worked hard over the summer, and my goal was to get it out in September. Well, you know, these days, if something in the 64 world is only two months late it's doing pretty good, so no big whoop. C=Hacking ought to appear reguarly after December, though. On the down side, some things, such as the challenge problem from last time, will have to wait until the next issue. I also stayed pretty low-key while putting this issue together, but future issues will be more public in soliciting articles (e.g. on comp.sys.cbm). In other news, a C64 C-compiler has finally appeared! This outstanding effort is the work of Ullrich von Bassewitz, uz@musoftware.de, the force behind Elite128 among other things. The cc65 webpage is at http://www.von-bassewitz.de/uz/cc65/ so have a look, and tell Ullrich what a stellar guy he is :). Also, as most of you know, the Chicago Expo took place on October 24, and was a real hoot! Check out http://driven.c64.org/ for some nice pictures from the Expo, taken by Mark Seelye. Meanwhile, sit back, relax, and enjoy these latest musings from these, our masters of assembly. ....... .... .. . C=H #17 ::::::::::::::::::::::::::::::::::: Contents :::::::::::::::::::::::::::::::::: BSOUT o Voluminous ruminations from your unfettered editor. Jiffies o Is it a bug or a feature? The C=Hallenge o To be continued... Side Hacking o "SuperCPU Software Repair", by S. Judd . An amateur's excursion into correcting errant wares. Main Articles o "An Optimizing Hybrid LZ77 RLE Data Compression Program, aka Improving Compression Ratio for Low-Resource Decompression", by Pasi 'Albert' Ojala Part two of a two-part article on data compression, giving a detailed description of the compression algorithms used in pucrunch, not to mention the decompression code. o "VIC-20 Kernel ROM Disassembly Project", by Richard Cini This is the first in a series of articles which aims to present a complete, commented disassembly of the VIC-20 ROMs. o Masters Class: "NTSC/PAL fixing, part I", by Russell Reed , Robin Harbron , and S. Judd. Sit up straight and pay attention. In the Masters Class, a Commodore luminary attempts to instruct a couple of ignorant plebians in his art. In this case, Robin and I set out to learn NTSC/PAL fixing from one of the greats, Decomp/Style. Our first fix, a demo from the obscure Finnish group Pu-239, is included, along with detailed descriptions of our experiences. o "The Herd Mentality", by Bil Herd This is a collection of entertaining musings on Commodore and the development of the C128, as provided by Bil Herd (and that's no bull). If you don't know who Bil Herd _is_, why not type SYS 32800,123,45,6 on a 128 sometime... .................................. Credits ................................... Editor, The Big Kahuna, The Car'a'carn..... Stephen L. Judd C=Hacking logo by.......................... Mark Lawrence Special thanks to Marko Makela, Olaf Seibert, and the rest of the cbm-hackers for their many otherwise unacknowledged contributions. Legal disclaimer: 1) If you screw it up it's your own damn fault! 2) If you use someone's stuff without permission you're a dork! For information on the mailing list, ftp and web sites, send some email to chacking-info@jbrain.com. About the authors: Pasi 'Albert' Ojala is a 29 year old software engineer, currently working at a VLSI design company on a RISC DSP core C compiler. Around 1984 a friend introduced him to the VIC-20, and a couple of years later he bought a 64+1541 to replace a broken Spectrum48K. He began writing his own BBS, using ML routines for speed, and later wrote a series of demos under the Pu-239 label. In addition to pucrunch and his many C=Hacking articles, Pasi was written an Amiga 1581 filesystem, a graphics conversion package, a C64 burstloader, and a number of demos. He is currently uses his 64 for hobbyist pursuits, and is contemplating multipart demos for the 64 and the VIC-20, in addition to future C=Hacking articles. Pasi is also a huge Babylon-5 fan, and has a B5 quote page at http://www.cs.tut.fi/~albert/Quotes/B5-quotes.html Richard Cini is a 31 year old senior loan underwriter who first became involved with Commodore 8-bits in 1981, when his parents bought him a VIC-20 as a birthday present. Mostly he used it for general BASIC programming, with some ML later on, for projects such as controlling the lawn sprinkler system, and for a text-to-speech synthesyzer. All his CBM stuff is packed up right now, along with his other "classic" computers, including a PDP11/34 and a KIM-1. In addition to collecting old computers Richard enjoys gardening, golf, and recently has gotten interested in robotics. As to the C= community, he feels that it is unique in being fiercely loyal without being evangelical, unlike some other communities, while being extremely creative in making the best use out of the 64. Robin Harbron is a 26 year old internet tech support at a local independent phone company. He first got involved with C= 8-bits in 1980, playing with school PETs, and in 1983 his parents convinced him to spend the extra money on a C64 instead of getting a VIC-20. Like most of us he played a lot of games, typed in games out of magazines, and tried to write his own games. Now he writes demos, dabbles with Internet stuff, writes C= magazine articles, and, yes, plays games. He is currently working on a few demos and a few games, as well as the "in-progress-but-sometimes-stalled-for-a-real-long-time- until-inspiration-hits-again Internet stuff". He is also working on raising a family, and enjoys music (particularly playing bass and guitar), church, ice hockey and cricket, and classic video games. ................................... Jiffies .................................. 0 REM SHIFT-L by the cbm.hackers Everybody knows the old REM shift-L trick in BASIC 2.0, which generates a syntax error upon listing. But why does it work? The answer turns out to be quite interesting. Normally, when the BASIC interpreter tokenizes a line it strips out characters with the high bit set. One exception is characters within quotes; the other is characters after REM. In those cases, characters are embedded literally into the program line. Now, BASIC tokens all have the high bit set. When LIST encounters a character with the high bit set, it checks whether it is in quote mode. If it is, the character is outputted as normal. If not, the character is treated as a token, which is expanded by QPLOP (located at $A717). The part of QPLOP which prints keywords looks like this: :LOOP1 DEX ;Traverse the keyword table BEQ :PLOOP :LOOP2 INY ;read a keyword LDA RESLST,Y BPL :LOOP2 BMI :LOOP1 :PLOOP INY ;Print out the keyword LDA RESLST,Y BMI LISTENT1 ;Exit if on last char JSR $AB47 ;Print char, AND #$FF BNE :PLOOP The keyword strings in RESLST all stored dextral character inverted (the last char has the high bit set), and the above code just moves forward through the table until it has counted out .X keywords. At that point, :PLOOP prints out the next keyword to the screen, and hops back into LIST. Shift-L is character code 204, or $CC. When LIST sees this inside of a REM, it sees the high bit set and de-tokenizes it. As it turns out, though, the last token is $CB, which is keyword GO (so that GO TO works). It also turns out that RESLST, the list of BASIC keywords, uses 255 characters. The 256th character is zero (value zero, not character zero). So, the above code goes through the list, and then prints out token $CC, the first character of which is a null. This zero is sent to $AB47. $AB47 sends it to JSR $FFD2 (which does nothing with the character), performs an AND #$FF, and exits. But that makes the BNE :PLOOP branch _not_ get taken, and the code erroneously moves forwards into... the code which executes the FOR statement! And the first thing FOR does is look for a valid variable. When you LIST a program, the character immediately following the LIST is a statement terminator (a colon : or else an end of line null). LET (which is called by FOR) reads this character, decides it's an invalid variable name, and generates a ?SYNTAX ERROR. Because QPLOP uses absolute addressing (LDA RESLST,Y), .Y can wrap around through 255 to traverse the list again. This is why shift-M shift-N etc. all generate valid keywords. Only shift-L will force an error, and it is all due to the zero in the keyword table. Similar things can happen in other BASICs. In BASIC 4.0, token $DB does the trick. In BASIC 1.0, $CB ought to do it. The problem seems to have been fixed in BASIC 7.0; at least the trick doesn't seem to work on a 128. Finally, like most things on the 64, embedding tokens into REM statements can naturally be put to some creative use. For example, RUN once ran a contest where readers submitted stories and riddles and such, which were read by LISTing the program. They were very clever and entertaining, and I close this summary with the one I've remembered since high school: 10 REM WHAT'S AN APPLECOSTA? 20 REM {C=-V}T A {INST CTRL-0}E ............................... The C=Hallenge ............................... Wait until next time! ................................ Side Hacking ................................ SuperCPU software repair ---------------------------> by S. L. Judd One of the feature articles in this issue deals with NTSC/PAL fixing. But have you ever thought about SCPU fixing? You know how it goes: you have that program that could really benefit from the speed boost, but doesn't work, and usually because of some silly little thing. Well, it really bugs me to have programs not like my nice hardware for dumb reasons, so I decided I would try my hand at fixing up some programs. The one that really did it for me was the game "Stunt Car Racer" -- I had never played it before, but after getting ahold of it it was clear that here was a game that would be just great with a SuperCPU. I had never done something like this before, but it seemed a doable problem and so I jumped in head first, and this article sums up my inexpert experience to date. By the way, stuntcar-scpu is totally cool :). To date I have fixed up just three games: Stunt Car Racer, Rescue on Fractalus, and Stellar 7. My goal was really to "CMD-fix" these programs, to make them run off of my FD-2000 as well as my SCPU. Although these are all games, the techniques should apply equally well to application programs with a bad attitude. Before discussing the fixes, it is probably worthwhile to discuss a few generalities. I also note that programmers who don't have a SuperCPU might find some of this information helpful in designing their programs to work with SCPUs. Finally, my fixes are available in the Fridge. Tools and Process ----------------- The tools I used were: o Action Replay o Wits o Paper for taking notes (backs of receipts/envelopes work) I think this is all that is necessary, although a good sector editor can come in handy for certain things. After trying a number of different approaches to the problem, the process I've settled on goes roughly like the following: - Have an idea of what will need fixing - Familiarize yourself with the program - Track down the things that need fixing - Figure out free areas of memory - Apply patches, and test Most programs work in more or less the same way: there are some initialization routines, there's a main loop, and there's an interrupt routine or series of routines. The interrupts are easy to find, via the vectors at either $FFFA or at $0314 and friends. The initialization routine can be tougher, but can be deduced from the loader or decompressor; also, some programs point the NMI vector to the reset code, so that RESTORE restarts the program. Finding the things that need fixing usually involves freezing the program at the appropriate time, and doing a little disassembly. Sometimes a hunt for things like LDA $DC01 is helpful, too. Figuring out free areas of memory is easy, by either getting a good feel for the program, or filling some target memory with a fill byte and then checking it later, to see if it was overwritten. Once the patch works on the 64, all that remains is to test it on the SCPU, and it's all done! Diagnosis --------- It seems to me that, at the fundamental level, the SCPU is different from a stock machine in three basic ways: it is a 65816, it runs at 20MHz, and it has hardware registers/different configurations. There are also some strange and mysterious problems that can arise. All possible opcodes are defined on the 65816, which means that "illegal" or quasi-opcodes will not work correctly. On the 65xx chips, the quasi-opcodes aren't real opcodes -- they are like little holes in the cpu, and things going through those holes fall through different parts of the normal opcode circuitry. Although used by very few programs, a number of copy protection schemes make use of them, so sometimes the program works fine with a SCPU but the copy protection makes it choke -- how very annoying (example: EA's Lords of Conquest). Naturally, disk-based protection methods mean it won't work on an FD-2000, either. Running at 20Mhz makes all sorts of problems. Any kind of software loop will run too fast -- delay loops, countdown loops, input busy-loops, etc. Also main program loops, so that the game runs unplayably fast (most 3D games never had to worry about being too fast). It can also lead to flickering screens, as we shall see later, and the "play" of some games is designed with 1Mhz in mind -- velocities, accelerations, etc. What looks smooth at the low frame rate might look poor at the high, as we shall also see later. Finally, fastloaders invariably fail at 20Mhz, like any other code using software-based timing. The SuperCPU also has a series of configuration registers located at $D07x and $D0Bx, which determine things like software speed and VIC optimization modes (which areas of memory are mirrored/copied to the C64's RAM). Note also that enabling hardware registers rearranges $E000 ROM routines. Although it is possible for programs to accidentally reconfigure the SCPU, it is awfully unlikely, since the enable register, which switches the hardware registers in, is sandwiched between disable registers: $D07D Hardware register disable $D07E Enable hardware registers $D07F Hardware register disable Strangely enough, though, different hardware configurations can sometimes cause problems. For example, newer (v2) SCPUs allow selective mirroring of the stack and zero page, and by default have that mirroring turned OFF. For some totally unknown reason, this caused major problems with an early attempt of mine to fix Stunt Car Racer -- I am told that the old version would slow down to just double-speed, flicker terribly, and more. Turning mirroring back on apparently fixes up the problem. (I have an older SCPU, and hence did not have this problem). So before going after a big fix, it is worthwhile to invest a few minutes in trying different configurations. Finally, there are other strange problems that can arise. For example, I have two 128s: one is a flat 128, one a 128D. With my 128D, if $D030 is set then the SCPU sometimes -- but not always -- freaks out and locks up. The flat 128 does not have this problem. One reason this is important is that many decompressors INC $D030 to enable 2MHz mode. A simple BIT ($2C) fixes this problem up, but the point is that the SCPU has to interact with the computer, so perhaps that interaction can lead to problems in obscure cases. Now, if the goal is to CMD-fix the program, there may be a few disk-related things that may need fixing. In addition to stripping out (or possibly fixing up) any fastloaders, most programs annoyingly assume drive #8 is the only drive in town. Also, if the program uses a track-based loader (instead of a file-based loader), then that will need to fixed up as well, and any disk-based copy protection will have to be removed. There's one other thing to consider, before you fix: is the program really busted? For example, if you've tried a chess program with the SCPU, chances are that you saw no speed improvement. Why not? It turns out that most chess programs use a timer-based search algorithm -- changing the playing strength changes the amount of time the program spends searching, and not the depth of the search. (The reason is to make the gameplay flow a little better -- otherwise you have very slow play at the beginning, when there are many more moves to consider). So although it might look like it isn't working right with the SCPU, it is actually working quite well. And that pretty much covers the basic ideas. The first program I fixed up was Stunt Car Racer. Stunt Car Racer --------------- Stunt Car Racer, in case you don't know, is a 3D driving game, and quite fun. It is also too fast, unplayably fast, at 20MHz. Therefore, it needs to be slowed down! My first plan... well, suffice to say that most of my original plans were doomed to failure, either from being a bad idea, or from poor implementation. It is clear enough that some sort of delay is needed, though, in the main loop, or perhaps by intercepting the joystick reading routine. The program has a main loop and an interrupt loop as well. The interrupt handles the display and other things, and all of the game calculations are done in the main loop, which flows like Do some calculations Draw buffer 1 Swap buffers Do some calculations Draw buffer 2 Swap buffers JMP loop One of my first thoughts was to intercept the joystick I/O, which is easy to find by hunting for LDA $DC01 (or DC00, whichever joystick is used). The patch failed, and possibly because I didn't check that the memory was safe, and possibly because it was in the interrupt routine (I simply don't remember). Before patching, it is very important to make sure that the patch will survive, and not interfere with the program, so it is very important to find an area of memory that is not used by the program. It took me a little while to figure this out! Finding unused memory was pretty easy -- I just filled the suspect areas with a fill byte, ran the program, and checked that memory. Mapping out the memory areas also aids in saving the file, as un-needed areas don't need to be saved, or can be cleared out to aid compression. The first free area of memory I found was at $C000. It turns out that this is a sprite, though, and so put some garbage on the screen. The second I tried was $8000, which worked great in practice mode but got overwritten in competition mode -- always test your patches thoroughly! (I had only tested in practice mode). Finally, I found a few little spots in low memory that survived, and placed the patch there. The program does a whole lot of memory moving, and uses nearly all memory. I also left some initialization code at $8000, since it only needed to be run once, at the beginning (to turn on mirroring in v2 SCPUs). Recall that the main loop has two parts -- one for buffer 1, and one for buffer 2. The trick is to find some code that is common to both sections, like a subroutine call: JSR SomeRoutine Draw buffer 1 JSR SomeRoutine Draw buffer 2 The patch routine I used was a simple delay loop, redirected from those two JSRs: LDX #$FF CPX $D012 BNE *-5 DEX CPX #$FC BNE *-10 JMP SomeRoutine Of course, this will also slow the program down at 1Mhz; later on I became smarter about my patches, but this one works pretty well. To save the game and patches, I simply froze it from AR. Just saving from the monitor generally failed; the initialization routine doesn't initialize all I/O settings. Part of the freezing process involves RLE compression, so if you freeze it is a good idea to fill all unused portions of memory -- temporary areas, bitmaps, etc. Another thing to do is to set a freeze point at the init routine, and then JMP there from the monitor. By clearing the screen, you won't have to look at all the usual freezer ugliness, and at this point freezing isn't any different than saving from the ML monitor and RLE-packing the file. Once saved, I tested a few times from the 64 side, to make sure things worked right. Whether freezing or saving from the monitor, if the file size is larger than 202 blocks or so, it can't be loaded on the SCPU without a special loader -- unless you compress it first. I naturally recommend using pu-crunch for that purpose, but if you want to do it on the 64 then I recommend using ABCrunch, which works well with the SCPU and gives about as good compression as you can get without an REU. The result was stuntcar-scpu, which is *awfully* fun when fixed. Rescue on Fractalus ------------------- Next on my list was Rescue on Fractalus, an older (and quite cool) Lucasfilm game that just didn't cut it in the 64 conversion, for a number of reasons (that perhaps could have been avoided). There are at least two versions of the game, one of which doesn't even work on a 128 (good 'ol $D030), but I have the older version, which does work. With a SuperCPU, though, there are a number of problems. The display flickers terribly. The gameplay is smooth and not at all too fast -- in fact, it is too slow. Specifically, the velocities and turning rates and such do not give a convincing illusion of speed or excitement. The game is copy- protected and uses a track-based fastloader, loaded from disk via B-E, which also saves the high scores to disk. Clearly, this one is a bigger job: the display is too fast, the game constants need adjusting, and the highscore code needs to be replaced by some kernal calls. The structure of this code is a little different. The main loop handles the (double-buffered) display -- it does all the calculations and draws to the two buffers. The multi-part interrupt loop does the rest -- it swaps buffers, changes the display in different parts of the screen, reads the joystick, and performs the game updates which change your position and direction. It also handles enemies such as saucers, but doesn't handle the bunkers which fire at you from the mountains (the main loop takes care of those). What does all this mean? First, that the game can be a good ten steps ahead of the screen, which makes things like targeting very difficult. Second, that the bunkers almost never fire at you at 1MHz (they go crazy at 20). Third, that things like velocity and turning rate are rather low, because advancing or turning too quickly would get the game way out of sync (unfortunately, they are still too fast for 1MHz, making targeting difficult and movement clunky). On the other hand, having the movement in the interrupt is the reason that the game does not become unplayably fast at 20MHz, and means that something besides a delay loop is needed. The interrupt swaps buffers, but the main loop draws them, and because it draws so quickly it can start clearing and drawing to the visible buffer. To make sure this was what I was seeing, I reversed the buffer swap code in the interrupt, so that the drawing buffer was always on-screen. Sure enough, that's what the 20Mhz version looked like. It turned out to be pretty easy to force the main loop to wait on the interrupt. Although I messed around (unsuccessfully) with intercepting the interrupt loop, the buffer swap code actually modifies a zero-page variable upon swapping. So all the main loop has to do is wait on that variable before charging ahead. I may have made it wait for two frames, because it made the game play a little better. Now, how to find the velocity and turn code? Well it takes a keypress to change the velocity, so by hunting for LDA $DC01, and tracing back, the routine can be found; at the very least the affected variables may be found, and hunted for. For example, if the result is stored in $D0, then you can search for LDA $D0. The point is to locate the keypress processing code. From there, a little trial and error (setting freeze points and pressing the velocity key) locates the piece of code which deals with changing the velocity, and in particular which variable corresponds to velocity. Finally, from there it just takes another hunt for LDA velocity, ADC velocity, etc. to figure out where the code for updating position and direction is. In this case, I was pretty sure I had found it, as it went something like LDA velocity LSR LSR ADC #$20 and this was added to the position. To check that this was the code, I just changed the ADC, or removed an LSR, to see that the speed changed. The code for turning left and right and moving up and down was similar, and again after a little trial and error it was clear what code did what. Again, it wasn't necessary to totally understand how these routines worked exactly -- just the general idea of them, in this case to see that a multiple of the velocity was used to change the position and orientation of the player. So, to fix it up, I just changed that multiple -- probably I NOPed out an LSR above, to basically double the speed, and changed the turning rates similarly. This took a little experimentation, as it not only needed to be playably fast, but also couldn't overflow at high speeds, etc. But once that was working, all that remained was the highscore table. Finding the table location was pretty easy -- I just got a high score, and while entering my name froze the program, and figured out what got stored where. From there it was pretty easy to figure out what was saved to disk. From the original loader, I also knew where the highscores needed to be loaded to initially (the highscore table gets copied around a lot -- it doesn't just stay at that one location). Figuring out the exact number of bytes to save took a little bit of effort (saving either too many or too few bytes screws it up), but from there it was clear what memory needed to be saved. So all that remained was to add the usual SETLFS etc. kernal calls, right? Wrong. The program uses all the usual kernal variables (from $90-$C0) for its own purposes. Also recall that I wanted the program to work with device 9, etc. To get around this, I did two things. First, when the program first starts, I save some of the relevant variables to an unused part of memory -- in particular, I save the current drive number. Second, before saving the highscore file, I actually copy all zero page variables from $90-$C2 or so to a temporary location, and then copy them back after saving. That way there are no worries about altering important locations. Finding memory for the load/save patch was easy -- I just used the area which was previously used for the fastload load/save code. There was enough for the necessary code as well as temporary space for saving the zero page variables. Finally, I changed some text from Rescue on Fractalus to Behind Jaggi Lines, to distinguish it from the original, and that was that. Works great! And is now more playable and challenging; in short, more the game it always should have been. Stellar 7 --------- Finally, I tried my hand at Stellar 7. Stellar 7 had several problems. At the main screen, a delay loop tosses you to the mission screen after a while, if no keys are pressed. This is a software loop, and so passes very quickly. The game itself is too fast, so some sort of delay is needed. The mission display is also too fast, and has software delay loops, so that needs fixing. Finally, the game uses kernal calls for loading and saving, but is attached to drive #8; also, my version was split into a bunch of files, and I wanted to cut the number of files down. Well, by this time it was all pretty straightforward. From the loader, it was easy to figure out which files went where. The mission and main displays were loaded in when needed, and swapped into unused parts of memory when not, so I loaded them in and adjusted the swap variable accordingly -- this left just the highscore and seven level files. Finding the delay loops was easy -- I just went to the relevant sections of code, froze, and took a look at the loops. There were your basic :LOOP LDA $D4 ;Check for keypress BMI :key DEX BNE :LOOP DEY BNE :LOOP DEC counter BNE :LOOP :key LDX #$00 ... Luckily, all routines were pretty much the same as the above. The interrupt routine is in the $0314 vector, and the same routine is used during gameplay. So the patch is very easy at this point. First, change the IRQ code which does a JMP $EA7B to JMP $CE00 . CE00 $EE INC $CFFF . CE03 $4C JMP $EA7B To fix up the keypress routines, the idea is to change the LDA $D0 into a JSR patch. How to substitute 3 bytes for 2 bytes? The trick is to place the LDX #$00 into the patch routine: . CE06 $20 JSR $CE15 ;Wait for $CFFF . CE09 $A5 LDA $D4 . CE0B $10 BPL $CE11 . CE0D $A2 LDX #$00 ;If key pressed, then LDX #$00 . CE0F $29 AND #$FF . CE11 $60 RTS The actual delay is accomplished by waiting on $CFFF: . CE15 $AD LDA $CFFF . CE18 $C9 CMP #$04 . CE1A $90 BCC $CE15 . CE1C $A9 LDA #$00 . CE1E $8D STA $CFFF . CE21 $60 RTS As you can see, I waited a (default) of 4 frames. The patch in the game/mission rendering routine works similarly -- I just patched the rendering code to basically JSR $CE15. I also decided to try something new: let the user be able to change that CMP #$04 to make things faster or slower, to suit their tastes. The keyscan values were pretty easy to figure out, so this just required a little patch to check for the "+" and "-" keys, and change location $CE19 accordingly. Well, that about sums it up. Perhaps if you do some fixing, you might send me a little email describing your own experiences? ....... .... .. . C=H #17 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: An Optimizing Hybrid LZ77 RLE Data Compression Program, aka Improving Compression Ratio for Low-Resource Decompression =========================================================== by Pasi Ojala Short: Pucrunch is a Hybrid LZ77 and RLE compressor, uses an Elias Gamma Code for lengths, mixture of Gamma Code and linear for LZ77 offset, and ranked RLE bytes indexed by the same Gamma Code. Uses no extra memory in decompression. -------------------------------------------------------------------------- Introduction ------------ Since I started writing demos for the C64 in 1989 I have always wanted to program a compression program. I had a lot of ideas but never had the time, urge or knowledge to create one. In retrospect, most of the ideas I had then were simply bogus ("magic function theory" as Mark Nelson nicely puts it). But years passed, I gathered more knowledge and finally got an irresistible urge to finally realize my dream. The nice thing about the delay is that I don't need to write the actual compression program to run on a C64 anymore. I can write it in portable ANSI-C code and just program it to create files that would uncompress themselves when run on a C64. Running the compression program outside of the target system provides at least the following advantages. * I can use portable ANSI-C code. The compression program can be compiled to run on a Unix box, Amiga, PC etc. And I have all the tools to debug the program and gather profiling information to see why it is so slow :-) * The program runs much faster than on C64. If it is still slow, there is always multitasking to allow me to do something else while I'm compressing a file. * There is 'enough' memory available. I can use all the memory I possibly need and use every trick possible to increase the compression ratio as long as the decompression remains possible on a C64. * Large files can be compressed as easily as shorter files. Most C64 compressors can't handle files larger than around 52-54 kilobytes (210-220 disk blocks). * Cross-development is easier because you don't have to transfer a file into C64 just to compress it. Memory Refresh and Terms for Compression ---------------------------------------- Statistical compression Uses the uneven probability distribution of the source symbols to shorten the average code length. Huffman code and arithmetic code belong to this group. By giving a short code to symbols occurring most often, the number of bits needed to represent the symbols decreases. Think of the Morse code for example: the characters you need more often have shorter codes and it takes less time to send the message. Dictionary compression Replaces repeating strings in the source with shorter representations. These may be indices to an actual dictionary (Lempel-Ziv 78) or pointers to previous occurrences (Lempel-Ziv 77). As long as it takes fewer bits to represent the reference than the string itself, we get compression. LZ78 is a lot like the way BASIC substitutes tokens for keywords: one-byte tokens expand to whole words like PRINT#. LZ77 replaces repeated strings with (length,offset) pairs, thus the string VICIICI can be encoded as VICI(3,3) -- the repeated occurrence of the string ICI is replaced by a reference. Run-length encoding Replaces repeating symbols with a single occurrence of the symbol and a repeat count. For example assembly compilers have a .zero keyword or equivalent to fill a number of bytes with zero without needing to list them all in the source code. Variable-length code Any code where the length of the code is not explicitly known but changes depending on the bit values. Some kind of end marker or length count must be provided to make a code a prefix code (uniquely decodable). Compare with ASCII (or Latin-1) text, where you know you get the next letter by reading a full byte from the input. A variable-length code requires you to read part of the data to know how many bits to read next. Universal codes Universal codes are used to encode integer numbers without the need to know the maximum value. Smaller integer values usually get shorter codes. Different universal codes are optimal for different distributions of the values. Universal codes include Elias Gamma and Delta codes, Fibonacci code, and Golomb and Rice codes. Lossless compression Lossless compression algorithms are able to exactly reproduce the original contents unlike lossy compression, which omits details that are not important or perceivable by human sensory system. This article only talks about lossless compression. My goal in the pucrunch project was to create a compression system in which the decompressor would use minimal resources (both memory and processing power) and still have the best possible compression ratio. A nice bonus would be if it outperformed every other compression program available. These understandingly opposite requirements (minimal resources and good compression ratio) rule out most of the state-of-the-art compression algorithms and in effect only leave RLE and LZ77 to be considered. Another goal was to learn something about data compression and that goal at least has been satisfied. I started by developing a byte-aligned LZ77+RLE compressor/decompressor and then added a Huffman backend to it. The Huffman tree took 384 bytes and the code that decoded the tree into an internal representation took 100 bytes. I found out that while the Huffman code gained about 8% in my 40-kilobyte test files, the gain was reduced to only about 3% after accounting the extra code and the Huffman tree. Then I started a more detailed analysis of the LZ77 offset and length values and the RLE values and concluded that I would get better compression by using a variable-length code. I used a simple variable-length code and scratched the Huffman backend code, as it didn't increase the compression ratio anymore. This version became pucrunch. Pucrunch does not use byte-aligned data, and is a bit slower than the byte-aligned version because of this, but is much faster than the original version with the Huffman backend attached. And pucrunch still does very well compression-wise. In fact, it does very well indeed, beating even LhA, Zip, and GZip in some cases. But let's not get too much ahead of ourselves. To get an improvement to the compression ratio for LZ77, we have only a few options left. We can improve on the encoding of literal bytes (bytes that are not compressed), we can reduce the number of literal bytes we need to encode, and shorten the encoding of RLE and LZ77. In the algorithm presented here all these improvement areas are addressed both collectively (one change affects more than one area) and one at a time. 1. By using a variable-length code we can gain compression for even 2-byte LZ77 matches, which in turn reduces the number of literal bytes we need to encode. Most LZ77-variants require 3-byte matches to get any compression because they use so many bits to identify the length and offset values, thus making the code longer than the original bytes would've taken. 2. By using a new literal byte tagging system which distinguishes uncompressed and compressed data efficiently we can reduce number of extra bits needed to make this distinction (the encoding overhead for literal bytes). This is especially important for files that do not compress well. 3. By using RLE in addition to LZ77 we can shorten the encoding for long byte run sequences and at the same time set a convenient upper limit to LZ77 match length. The upper limit performs two functions: + we only need to encode integers in a specific range + we only need to search strings shorter than this limit (if we find a string long enough, we can stop there) Short byte runs are compressed either using RLE or LZ77, whichever gets the best results. 4. By doing statistical compression (more frequent symbols get shorter representations) on the RLE bytes (in this case symbol ranking) we can gain compression for even 2-byte run lengths, which in turn reduces the number of literal bytes we need to encode. 5. By carefully selecting which string matches and/or run lengths to use we can take advantage of the variable-length code. It may be advantageous to compress a string as two shorter matches instead of one long match and a bunch of literal bytes, and it can be better to compress a string as a literal byte and a long match instead of two shorter matches. This document consists of several parts, which are: * C64 Considerations - Some words about the target system * Escape codes - A new tagging system for literal bytes * File format - What are the primaries that are output * Graph search - How to squeeze every byte out of this method * String match - An evolution of how to speed up the LZ77 search * Some results on the target system files * Results on the Calgary Corpus Test Suite * The Decompression routine - 6510 code with commentary -------------------------------------------------------------------------- Commodore 64 Considerations --------------------------- Our target environment (Commodore 64) imposes some restrictions which we have to take into consideration when designing the ideal compression system. A system with a 1-MHz 3-register 8-bit processor and 64 kilobytes of memory certainly imposes a great challenge, and thus also a great sense of achievement for good results. First, we would like it to be able to decompress as big a program as possible. This in turn requires that the decompression code is located in low memory (most programs that we want to compress start at address 2049) and is as short as possible. Also, the decompression code must not use any extra memory or only very small amounts of it. Extra care must be taken to make certain that the compressed data is not overwritten during the decompression before it has been read. Secondly, my number one personal requirement is that the basic end address must be correctly set by the decompressor so that the program can be optionally saved in uncompressed form after decompression (although the current decompression code requires that you say "clr" before saving). This also requires that the decompression code is system-friendly, i.e. does not change KERNAL or BASIC variables or other structures. Also, the decompressor shouldn't rely on file size or load end address pointers, because these may be corrupted by e.g. X-modem file transfer protocol (padding bytes may be added). When these requirements are combined, there is not much selection in where in the memory we can put the decompression code. There are some locations among the first 256 addresses (zeropage) that can be used, the (currently) unused part of the processor stack (0x100..0x1ff), the system input buffer (0x200..0x258) and the tape I/O buffer plus some unused bytes (0x334-0x3ff). The screen memory (0x400..0x7ff) can also be used if necessary. If we can do without the screen memory and the tape buffer, we can potentially decompress files that are located from 0x258 to 0xffff. The third major requirement is that the decompression should be relatively fast. After 10 seconds the user begins to wonder if the program has crashed or if it is doing anything, even if there is some feedback like border color flashing. This means that the arithmetic used should be mostly 8- or 9-bit (instead of full 16 bits) and there should be very little of it per each decompressed byte. Processor- and memory-intensive algorithms like arithmetic coding and prediction by partial matching (PPM) are pretty much out of the question, and that is saying it mildly. LZ77 seems the only practical alternative. Still, run-length encoding handles long byte runs better than LZ77 and can have a bigger length limit. If we can easily incorporate RLE and LZ77 into the same algorithm, we should get the best features from both. A part of the decompressor efficiency depends on the format of the compressed data. Byte-aligned codes, where everything is aligned into byte boundaries, can be accessed very quickly; non-byte-aligned variable length codes are much slower to handle, but provide better compression. Note that byte-aligned codes can still have other data sizes than 8. For example you can use 4 bits for LZ77 length and 12 bits for LZ77 offset, which preserves the byte alignment. -------------------------------------------------------------------------- The New Tagging System ---------------------- I call the different types of information my compression algorithm outputs primaries. The primaries in this compression algorithm are: * literal (uncompressed) bytes and escape sequences, * LZ77 (length,offset)-pairs, * RLE (length,byte)-pairs, and * EOF (end of file marker). Literal bytes are those bytes that cannot be represented by shorter codes, unlike a part of previously seen data (LZ77), or a part of a longer sequence of the same byte (RLE). Most compression programs handle the selection between compressed data and literal bytes in a straightforward way by using a prefix bit. If the bit is 0, the following data is a literal byte (uncompressed). If the bit is 1, the following data is compressed. However, this presents the problem that non-compressible data will be expanded from the original 8 bits to 9 bits per byte, i.e. by 12.5 %. If the data isn't very compressible, this overhead consumes all the little savings you may have had using LZ77 or RLE. Some other data compression algorithms use a value (using variable-length code) that indicates the number of literal bytes that follow, but this is really analogous to a prefix bit, because 1-byte uncompressed data is very common for modestly compressible files. So, using a prefix bit may seem like a good idea, but we may be able to do even better. Let's see what we can come up with. My idea was to somehow use the data itself to mark compressed and uncompressed data and thus not need any prefix bits. Let's assume that 75% of the symbols generated are literal bytes. In this case it seems viable to allocate shorter codes for literal bytes, because they are more common than compressed data. This distribution (75% are literal bytes) suggests that we should use 2 bits to determine whether the data is compressed or a literal byte. One of the combinations indicates compressed data, and three of the combinations indicate a literal byte. At the same time those three values divide the literal bytes into three distinct groups. But how do we make the connection between which of the three bit patters we have and what are the literal byte values? The simplest way is to use a direct mapping. We use two bits (let them be the two most-significant bits) _from the literal bytes themselves_ to indicate compressed data. This way no actual prefix bits are needed. We maintain an escape code (which doesn't need to be static), which is compared to the bits, and if they match, compressed data follows. If the bits do not match, the rest of the literal byte follows. In this way the literal bytes do not expand at all if their most significant bits do not match the escape code, and fewer bits are needed to represent the literal bytes. Whenever those bits in a literal byte would match the escape code, an escape sequence is generated. Otherwise we could not represent those literal bytes which actually start like the escape code (the top bits match). This escape sequence contains the offending data and a new escape code. This escape sequence looks like # of escape bits (escape code) 3 (escape mode select) # of escape bits (new escape bits) 8-# of escape bits (rest of the byte) = 8 + 3 + # of escape bits = 13 for 2-bit escapes, i.e. expands the literal byte by 5 bits. Read further to see how we can take advantage of the changing escape code. You may also remember that in the run-length encoding presented in the previous article two successive equal bytes are used to indicate compressed data (escape condition) and all other bytes are literal bytes. A similar technique is used in some C64 packers (RLE) and crunchers (LZ77), the only difference is that the escape condition is indicated by a fixed byte value. My tag system is in fact an extension to this. Instead of a full byte, I use only a few bits. We assumed an even distribution of the values and two escape bits, so 1/4 of the values have the same two most significant bits as the escape code. I call this probability that a literal byte has to be escaped the "hit rate". Thus, literal bytes expand in average 25% of the time by 5 bits, making the average length 25% * 13 + 75% * 8 = 9.25. Not surprising, this is longer than using one bit to tag the literal bytes. However, there is one thing we haven't considered yet. The escape sequence has the possibility to change the escape code. Using this feature to its optimum (escape optimization), the average 25% hit rate becomes the -maximum- hit rate. Also, because the distribution of the literal byte values is seldom flat (some values are more common than others) and there is locality (different parts of the file only contain some of the possible values), from which we can also benefit, the actual hit rate is always much smaller than that. Empirical studies on some test files show that for 2-bit escape codes the actual realized hit rate is only 1.8-6.4%, while the theoretical maximum is the already mentioned 25%. Previously we assumed the distribution of 75% of literal bytes and 25% of compressed data (other primaries). This prompted us to select 2 escape bits. For other distributions (differently compressible files, not necessarily better or worse) some other number of escape bits may be more suitable. The compressor tries different number of escape bits and select the value which gives the best overall results. The following table summarizes the hit rates on the test files for different number of escape bits. 1-bit 2-bit 3-bit 4-bit File 50.0% 25.0% 12.5% 6.250% Maximum 25.3% 2.5% 0.3% 0.090% ivanova.bin 26.5% 2.4% 0.8% 0.063% sheridan.bin 20.7% 1.8% 0.2% 0.041% delenn.bin 26.5% 6.4% 2.5% 0.712% bs.bin 9.06 8.32 8.15 8.050 bits/Byte for bs.bin As can be seen from the table, the realized hit rates are dramatically smaller than the theoretical maximum values. A thought might occur that we should always select 4-bit (or longer) escapes, because it reduces the hit rate and presents the minimum overhead for literal bytes. Unfortunately increasing the number of escape bits also increases the code length of the compressed data. So, it is a matter of finding the optimum setting. If there are very few literal bytes compared to other primaries, 1-bit escape or no escape at all gives very short codes to compressed data, but causes more literal bytes to be escaped, which means 4 bits extra for each escaped byte (with 1-bit escapes). If the majority of primaries are literal bytes, for example a 6-bit escape code causes most of the literal bytes to be output as 8-bit codes (no expansion), but makes the other primaries 6 bits longer. Currently the compressor automatically selects the best number of escape bits, but this can be overridden by the user with the -e option. The cases in the example with 1-bit escape code validates the original suggestion: use a prefix bit. A simple prefix bit would produce better results on three of the previous test files (although only slightly). For delenn.bin (1 vs 0.828) the escape system works better. On the other hand, 1-bit escape code is not selected for any of the files, because 2-bit escape gives better overall results. -Note:- for 7-bit ASCII text files, where the top bit is always 0 (like most of the Calgary Corpus files), the hit rate is 0% for even 1-bit escapes. Thus, literal bytes do not expand at all. This is equivalent to using a prefix bit and 7-bit literals, but does not need separate algorithm to detect and handle 7-bit literals. For Calgary Corpus files the number of tag bits per primary (counting the escape sequences and other overhead) ranges from as low as 0.46 (book1) to 1.07 (geo) and 1.09 (pic). These two files (geo and pic) are the only ones in the suite where a simple prefix bit would be better than the escape system. The average is 0.74 tag bits per primary. In Canterbury Corpus the tag bits per primary ranges from 0.44 (plrabn12.txt) to 1.09 (ptt5), which is the only one above 0.85 (sum). The average is 0.61 tag bits per primary. -------------------------------------------------------------------------- Primaries Used for Compression ------------------------------ The compressor uses the previously described escape-bit system while generating its output. I call the different groups of bits that are generated primaries, whether it is the correct term or not. You are welcome to suggest a better term for them. The primaries in this compression algorithm are: literal byte (and escape sequence), LZ77 (length,offset)-pair, RLE (length, byte)-pair, and EOF (end of file marker). If the top bits of a literal byte do not match the escape code, the byte is output as-is. If the bits match, an escape sequence is generated, with the new escape code. Other primaries start with the escape code. The Elias Gamma Code is used extensively. This code consists of two parts: a unary code (a one-bit preceded by zero-bits) and a binary code part. The first part tells the decoder how many bits are used for the binary code part. Being a universal code, it produces shorter codes for small integers and longer codes for larger integers. Because we expect we need to encode a lot of small integers (there are more short string matches and shorter equal byte runs than long ones), this reduces the total number of bits needed. See the previous article for a more in-depth delve into statistical compression and universal codes. To understand this article, you only need to keep in mind that small integer value equals short code. The following discusses the encoding of the primaries. The most frequent compressed data is LZ77. The length of the match is output in Elias Gamma code, with "0" meaning the length of 2, "100" length of 3, "101" length of 4 and so on. If the length is not 2, a LZ77 offset value follows. This offset takes 9 to 22 bits. If the length is 2, the next bit defines whether this is LZ77 or RLE/Escape. If the bit is 0, an 8-bit LZ77 offset value follows. (Note that this restricts the offset for 2-byte matches to 1..256.) If the bit is 1, the next bit decides between escape (0) and RLE (1). The code for an escape sequence is thus e..e010n..ne....e, where E is the byte, and N is the new escape code. Example: * We are using 2-bit escapes * The current escape code is "11" * We need to encode a byte 0xca == 0b11001010 * The escape code and the byte high bits match (both are "11") * We output the current escape code "11" * We output the escaped identification "010" * We output the new escape bits, for example "10" (depends on escape optimization) * We output the rest of the escaped byte "001010" * So, we have output the string "1101010001010" When the decompressor receives this string of bits, it finds that the first two bits match with the escape code, it finds the escape identification ("010") and then gets the new escape, the rest of the original byte and combines it with the old escape code to get a whole byte. The end of file condition is encoded to the LZ77 offset and the RLE is subdivided into long and short versions. Read further, and you get a better idea about why this kind of encoding is selected. When I studied the distribution of the length values (LZ77 and short RLE lengths), I noticed that the smaller the value, the more occurrences. The following table shows an example of length value distribution. LZLEN S-RLE 2 1975 477 3-4 1480 330 5-8 492 166 9-16 125 57 17-32 31 33 33-64 8 15 The first column gives a range of values. The first entry has a single value (2), the second two values (3 and 4), and so on. The second column shows how many times the different LZ77 match lengths are used, the last column shows how many times short RLE lengths are used. The distribution of the values gives a hint of how to most efficiently encode the values. We can see from the table for example that values 2-4 are used 3455 times, while values 5-64 are used only 656 times. The more common values need to get shorter codes, while the less-used ones can be longer. Because in each "magnitude" there are approximately half as many values than in the preceding one, it almost immediately occurred to me that the optimal way to encode the length values (decremented by one) is: Value Encoding Range Gained 0000000 not possible 0000001 0 1 -6 bits 000001x 10x 2-3 -4 bits 00001xx 110xx 4-7 -2 bits 0001xxx 1110xxx 8-15 +0 bits 001xxxx 11110xxxx 16-31 +2 bits 01xxxxx 111110xxxxx 32-63 +4 bits 1xxxxxx 111111xxxxxx 64-127 +5 bits The first column gives the binary code of the original value (with x denoting 0 or 1, xx 0..3, xxx 0..7 and so on), the second column gives the encoding of the value(s). The third column lists the original value range in decimal notation. The last column summarizes the difference between this code and a 7-bit binary code. Using the previous encoding for the length distribution presented reduces the number of bits used compared to a direct binary representation considerably. Later I found out that this encoding in fact is Elias Gamma Code, only the assignment of 0- and 1-bits in the prefix is reversed, and in this version the length is limited. Currently the maximum value is selectable between 64 and 256. So, to recap, this version of the Gamma code can encode numbers from 1 to 255 (1 to 127 in the example). LZ77 and RLE lengths that are used start from 2, because that is the shortest length that gives us any compression. These length values are first decremented by one, thus length 2 becomes "0", and for example length 64 becomes "11111011111". The distribution of the LZ77 offset values (pointer to a previous occurrence of a string) is not at all similar to the length distribution. Admittedly, the distribution isn't exactly flat, but it also isn't as radical as the length value distribution either. I decided to encode the lower 8 bits (automatically selected or user-selectable between 8 and 12 bits in the current version) of the offset as-is (i.e. binary code) and the upper part with my version of the Elias Gamma Code. However, 2-byte matches always have an 8-bit offset value. The reason for this is discussed shortly. Because the upper part can contain the value 0 (so that we can represent offsets from 0 to 255 with a 8-bit lower part), and the code can't directly represent zero, the upper part of the LZ77 offset is incremented by one before encoding (unlike the length values which are decremented by one). Also, one code is reserved for an end of file (EOF) symbol. This restricts the offset value somewhat, but the loss in compression is negligible. With the previous encoding 2-byte LZ77 matches would only gain 4 bits (with 2-bit escapes) for each offset from 1 to 256, and 2 bits for each offset from 257 to 768. In the first case 9 bits would be used to represent the offset (one bit for gamma code representing the high part 0, and 8 bits for the low part of the offset), in the latter case 11 bits are used, because each "magnitude" of values in the Gamma code consumes two more bits than the previous one. The first case (offset 1..256) is much more frequent than the second case, because it saves more bits, and also because the symbol source statistics (whatever they are) guarantee 2-byte matches in recent history (much better chance than for 3-byte matches, for example). If we restrict the offset for a 2-byte LZ77 match to 8 bits (1..256), we don't lose so much compression at all, but instead we could shorten the code by one bit. This one bit comes from the fact that before we had to use one bit to make the selection "8-bit or longer". Because we only have "8-bit" now, we don't need that select bit anymore. Or, we can use that select bit to a new purpose to select whether this code really is LZ77 or something else. Compared to the older encoding (which I'm not detailing here, for clarity's sake. This is already much too complicated to follow, and only slightly easier to describe) the codes for escape sequence, RLE and End of File are still the same length, but the code for LZ77 has been shortened by one bit. Because LZ77 is the most frequently used primary, this presents a saving that more than compensates for the loss of 2-byte LZ77 matches with offsets 257..768 (which we can no longer represent, because we fixed the offset for 2-byte matches to use exactly 8 bits). Run length encoding is also a bit revised. I found out that a lot of bits can be gained by using the same length encoding for RLE as for LZ77. On the other hand, we still should be able to represent long repeat counts as that's where RLE is most efficient. I decided to split RLE into two modes: * short RLE for short (e.g. 2..128) equal byte strings * long RLE for long equal byte strings The Long RLE selection is encoded into the Short RLE code. Short RLE only uses half of its coding space, i.e. if the maximum value for the gamma code is 127, short RLE uses only values 1..63. Larger values switches the decoder into Long RLE mode and more bits are read to complete the run length value. For further compression in RLE we rank all the used RLE bytes (the values that are repeated in RLE) in the decreasing probability order. The values are put into a table, and only the table indices are output. The indices are also encoded using a variable length code (the same gamma code, surprise..), which uses less bits for smaller integer values. As there are more RLE's with smaller indices, the average code length decreases. In decompression we simply get the gamma code value and then use the value as an index into the table to get the value to repeat. Instead of reserving full 256 bytes for the table we only put the top 31 RLE bytes into the table. Normally this is enough. If there happens to be a byte run with a value not in the table we use a similar technique as for the short/long RLE selection. If the table index is larger than 31, it means we don't have the value in the table. We use the values 32..63 to select the 'escaped' mode and simultaneously send the 5 most significant bits of the value (there are 32 distinct values in the range 32..63). The rest 3 bits of the byte are sent separately. If you are more than confused, forget everything I said in this chapter and look at the decompression pseudo-code later in this article. -------------------------------------------------------------------------- Graph Search - Selecting Primaries ---------------------------------- In free-parse methods there are several ways to divide the file into parts, each of which is equally valid but not necessary equally efficient in terms of compression ratio. "i just saw justin adjusting his sting" "i just saw", (-9,4), "in ad", (-9,6), "g his", (-25,2), (-10,4) "i just saw", (-9,4), "in ad", (-9,6), "g his ", (-10,5) The latter two lines show how the sentence could be encoded using literal bytes and (offset, length) pairs. As you can see, we have two different encodings for a single string and they are both valid, i.e. they will produce the same string after decompression. This is what free-parse is: there are several possible ways to divide the input into parts. If we are clever, we will of course select the encoding that produces the shortest compressed version. But how do we find this shortest version? How does the data compressor decide which primary to generate in each step? The most efficient way the file can be divided is determined by a sort of a graph-search algorithm, which finds the shortest possible route from the start of the file to the end of the file. Well, actually the algorithm proceeds from the end of the file to the beginning for efficiency reasons, but the result is the same anyway: the path that minimizes the bits emitted is determined and remembered. If the parameters (number of escape bits or the variable length codes or their parameters) are changed, the graph search must be re-executed. "i just saw justin adjusting his sting" \___/ \_____/ \_|___/ 13 15 11 13 \____/ 15 Think of the string as separate characters. You can jump to the next character by paying 8 bits to do so (not shown in the figure), unless the top bits of the character match with the escape code (in which case you need more bits to send the character "escaped"). If the history buffer contains a string that matches a string starting at the current character you can jump over the string by paying as many bits as representing the LZ77 (offset,length)-pair takes (including escape bits), in this example from 11 to 15 bits. And the same applies if you have RLE starting at the character. Then you just find the least-expensive way to get from the start of the file to the end and you have found the optimal encoding. In this case the last characters " sting" would be optimally encoded with 8(literal " ") + 15("sting") = 23 instead of 11(" s") + 13("ting") = 24 bits. The algorithm can be written either cleverly or not-so. We can take a real short-cut compared to a full-blown graph search because we can/need to only go forwards in the file: we can simply start from the end! Our accounting information which is updated when we pass each location in the data consists of three values: 1. the minimum bits from this location to the end of file. 2. the mode (literal, LZ77 or RLE) to use to get that minimum 3. the "jump" length for LZ77 and RLE For each location we try to jump forward (to a location we already processed) one location, LZ77 match length locations (if a match exists), or RLE length locations (if equal bytes follow) and select the shortest route, update the tables accordingly. In addition, if we have a LZ77 or RLE length of for example 18, we also check jumps 17, 16, 15, ... This gives a little extra compression. Because we are doing the "tree traverse" starting from the "leaves", we only need to visit/process each location once. Nothing located after the current location can't change, so there is never any need to update a location. To be able to find the minimal path, the algorithm needs the length of the RLE (the number of the identical bytes following) and the maximum LZ77 length/offset (an identical string appearing earlier in the file) for each byte/location in the file. This is the most time-consuming -and- memory-consuming part of the compression. I have used several methods to make the search faster. See String Match Speedup later in this article. Fortunately these searches can be done first, and the actual optimization can use the cached values. Then what is the rationale behind this optimization? It works because you are not forced to take every compression opportunity, but select the best ones. The compression community calls this "lazy coding" or "non-greedy" selection. You may want to emit a literal byte even if there is a 2-byte LZ77 match, because in the next position in the file you may have a longer match. This is actually more complicated than that, but just take my word for it that there is a difference. Not a very big difference, and only significant for variable-length code, but it is there and I was after every last bit of compression, remember. Note that the decision-making between primaries is quite simple if a fixed-length code is used. A one-step lookahead is enough to guarantee optimal parsing. If there is a more advantageous match in the next location, we output a literal byte and that longer match instead of the shorter match. I don't have time or space here to go very deeply on that, but the main reason is that in fixed-length code it doesn't matter whether you represent a part of data as two matches of lengths 2 and 8 or as matches of lengths 3 and 7 or as any other possible combination (if matches of those lengths exist). This is not true for a variable-length code and/or a statistical compression backend. Different match lengths and offsets no longer generate equal-length codes. Note also that most LZ77 compression algorithms need at least 3-byte match to break even, i.e. not expanding the data. This is not surprising when you stop to think about it. To gain something from 2-byte matches you need to encode the LZ77 match into 15 bits. This is very little. A generic LZ77 compressor would use one bit to select between a literal and LZ77, 12 bits for moderate offset, and you have 2 bits left for match length. I imagine the rationale to exclude 2-byte matches also include "the potential savings percentage for 2-byte matches is insignificant". Pucrunch gets around this by using the tag system and Elias Gamma Code, and does indeed gain bits from even 2-byte matches. After we have decided on what primaries to output, we still have to make sure we get the best results from the literal tag system. Escape optimization handles this. In this stage we know which parts of the data are emitted as literal bytes and we can select the minimal path from the first literal byte to the last in the same way we optimized the primaries. Literal bytes that match the escape code generate an escape sequence, thus using more bits than unescaped literal bytes, and we need to minimize these occurrences. For each literal byte there is a corresponding new escape code which minimizes the path to the end of the file. If the literal byte's high bits match the current escape code, this new escape code is used next. The escape optimization routine proceeds from the end of the file to the beginning like the graph search, but it proceeds linearly and is thus much faster. I already noted that the new literal byte tagging system exploits the locality in the literal byte values. If there is no correlation between the bytes, the tagging system does not do well at all. Most of the time, however, the system works very well, performing 50% better than the prefix-bit approach. The escape optimization routine is currently very fast. A little algorithmic magic removed a lot of code from the original version. A fast escape optimization routine is quite advantageous, because the number of escape bits can now vary from 0 (uncompressed bytes always escaped) to 8 and we need to run the routine again if we change the number of escape bits used to select the optimal escape code changes. Because escaped literal bytes actually expand the data, we need a safety area, or otherwise the compressed data may get overwritten by the decompressed data before we have used it. Some extra bytes need to be reserved for the end of file marker. The compression routine finds out how many bytes we need for safety buffer by keeping track of the difference between input and output sizes while creating the compressed file. $1000 .. $2000 |OOOOOOOO| O=original file $801 .. |D|CCCCC| C=compressed data (D=decompressor) $f7.. $1000 $2010 |D| |CCCCC| Before decompression starts ^ ^ W R W=write pointer, R=read pointer If the original file is located at $1000-$1fff, and the calculated safety area is 16 bytes, the compressed version will be copied by the decompression routine higher in memory so that the last byte is at $200f. In this way, the minimum amount of other memory is overwritten by the decompression. If the safety are would exceed the top of memory, we need a wrap buffer. This is handled automatically by the compressor. The read pointer wraps from the end of memory to the wrap buffer, allowing the original file to extend up to the end of the memory, all the way to $ffff. You can get the compression program to tell you which memory areas it uses by specifying the "-s" option. Normally the safety buffer needed is less than a dozen bytes. To sum things up, Pucrunch operates in several steps: 1. Find RLE and LZ77 data, pre-select RLE byte table 2. Graph search, i.e. which primaries to use 3. Primaries/Literal bytes ratio decides how many escape bits to use 4. Escape optimization, which escape codes to use 5. Update RLE ranks and the RLE byte table 6. Determine the safety area size and output the file. -------------------------------------------------------------------------- String Match Speedup -------------------- To be able to select the most efficient combination of primaries we of course first need to find out what kind of primaries are available for selection. If the file doesn't have repeated bytes, we can't use RLE. If the file doesn't have repeating byte strings, we can't use LZ77. This string matching is the most time-consuming operation in LZ77 compression simply because of the amount of the comparison operations needed. Any improvement in the match algorithm can decrease the compression time considerably. Pucrunch is a living proof on that. The RLE search is straightforward and fast: loop from the current position (P) forwards in the file counting each step until a different-valued byte is found or the end of the file is reached. This count can then be used as the RLE byte count (if the graph search decides to use RLE). The code can also be optimized to initialize counts for all locations that belonged to the RLE, because by definition there are only one-valued bytes in each one. Let us mark the current file position by P. unsigned char *a = indata + P, val = *a++; int top = inlen - P; int rlelen = 1; /* Loop for the whole RLE */ while(rlelen did manage to snip off 2 bytes, thanks! However, there are some features you may consider unnecessary. The code can be shortened by: * No basic end address set: 8 bytes * No 2 MHz mode set/reset: 6 bytes * No wrap option: 12 bytes Actually, if the wrap option is not used, the compressor automatically selects the shorter decompression code (only for the C64 version). processor 6502 BASEND EQU $2d ; start of basic variables (updated at EOF) LZPOS EQU $2d ; temporary, BASEND *MUST* *BE* *UPDATED* at EOF bitstr EQU $f7 ; Hint the value beforehand xstore EQU $c3 ; tape load temp WRAPBUF EQU $004b ; 'wrap' buffer, 22 bytes ($02a7 for 89 bytes) ORG $0801 DC.B $0b,8,$ef,0 ; '239 SYS2061' DC.B $9e,$32,$30,$36 DC.B $31,0,0,0 sei ; disable interrupts inc $d030 ; or "bit $d030" if 2MHz mode is not enabled inc 1 ; Select ALL-RAM configuration ldx #0 ;** parameter - # of overlap bytes-1 off $ffff overlap lda $aaaa,x ;** parameter start of off-end bytes sta WRAPBUF,x ; Copy to wrap/safety buffer dex bpl overlap ldx #block200-end-block200+1 ; $54 ($59 max) packlp lda block200-1,x sta block200--1,x dex bne packlp ldx #block-stack-end-block-stack+1 ; $b3 (stack! ~$e8 max) packlp2 lda block-stack-1,x dc.b $9d ; sta $nnnn,x dc.w block-stack--1 ; (ZP addressing only addresses ZP!) dex bne packlp2 ldy #$aa ;** parameter SIZE high + 1 (max 255 extra bytes) cploop dex ; ldx #$ff on the first round lda $aaaa,x ;** parameter DATAEND-0x100 sta $ff00,x ;** parameter ORIG LEN-0x100+ reserved bytes txa ;cpx #0 bne cploop dec cploop+6 dec cploop+3 dey bne cploop jmp main The first part of the code contains a sys command for the basic interpreter, two loops that copy the decompression code to zeropage/stack ($f7-$1aa) and to the system input buffer ($200-$253). The latter code segment contains byte, bit and Gamma Code input routines and the RLE byte code table, the former code segment contains the rest. This code also copies the compressed data forward in memory so that it won't be overwritten by the decompressed data before we have had a change to read it. The decompression starts at the beginning and proceeds upwards in both the compressed and decompressed data. A safety area is calculated by the compression routine. It finds out how many bytes we need for temporary data expansion, i.e. for escaped bytes. The wrap buffer is used for files that extend upto the end of memory, and would otherwise overwrite the compressed data with decompressed data before it has been read. This code fragment is not used during the decompression itself. In fact the code will normally be overwritten when the actual decompression starts. The very start of the next code block is located inside the zero page and the rest fills the lowest portion of the microprocessor stack. The zero page is used to make the references to different variables shorter and faster. Also, the variables don't take extra code to initialize, because they are copied with the same copy loop as the rest of the code. block-stack #rorg $f7 ; $f7 - ~$1e0 block-stack- bitstr dc.b $80 ; ZP $80 == Empty esc dc.b $00 ; ** parameter (saves a byte when here) OUTPOS = *+1 ; ZP putch sta $aaaa ; ** parameter inc OUTPOS ; ZP bne 0$ ; Note: beq 0$; rts; 0$: inc OUTPOS+1;rts would be ; $0100 ; faster, but 1 byte longer inc OUTPOS+1 ; ZP 0$ rts putch is the subroutine that is used to output the decompressed bytes. In this case the bytes are written to memory. Because the subroutine call itself takes 12 cycles (6 for jsr and another 6 for rts), and the routine is called a lot of times during the decompression, the routine itself should be as fast as possible. This is achieved by removing the need to save any registers. This is done by using an absolute addressing mode instead of indirect indexed or absolute indexed addressing (sta $aaaa instead of sta ($zz),y or sta $aa00,y). With indexed addressing you would need to save+clear+restore the index register value in the routine. Further improvement in code size and execution speed is done by storing the instruction that does the absolute addressing to zero page. When the memory address is incremented we can use zero-page addressing for it too. On the other hand, the most time is spent in the bit input routine so further optimization of this routine is not feasible. newesc ldy esc ; remember the old code (top bits for escaped byte) ldx #2 ; ** PARAMETER jsr getchkf ; get & save the new escape code sta esc tya ; pre-set the bits ; Fall through and get the rest of the bits. noesc ldx #6 ; ** PARAMETER jsr getchkf jsr putch ; output the escaped/normal byte ; Fall through and check the escape bits again main ldy #0 ; Reset to a defined state tya ; A = 0 ldx #2 ; ** PARAMETER jsr getchkf ; X=2 -> X=0 cmp esc bne noesc ; Not the escape code -> get the rest of the byte ; Fall through to packed code The decompression code is first entered in main. It first clears the accumulator and the Y register and then gets the escape bits (if any are used) from the input stream. If they don't match with the current escape code, we get more bits to complete a byte and then output the result. If the escape bits match, we have to do further checks to see what to do. jsr getval ; X = 0 sta xstore ; save the length for a later time cmp #1 ; LEN == 2 ? bne lz77 ; LEN != 2 -> LZ77 tya ; A = 0 jsr get1bit ; X = 0 lsr ; bit -> C, A = 0 bcc lz77-2 ; A=0 -> LZPOS+1 ;***FALL THRU*** We first get the Elias Gamma Code value (or actually my independently developed version). If it says the LZ77 match length is greater than 2, it means a LZ77 code and we jump to the proper routine. Remember that the lengths are decremented before encoding, so the code value 1 means the length is 2. If the length is two, we get a bit to decide if we have LZ77 or something else. We have to clear the accumulator, because get1bit does not do that automatically. If the bit we got (shifted to carry to clear the accumulator) was zero, it is LZ77 with an 8-bit offset. If the bit was one, we get another bit which decides between RLE and an escaped byte. A zero-bit means an escaped byte and the routine that is called also changes the escape bits to a new value. A one-bit means either a short or long RLE. ; e..e01 jsr get1bit ; X = 0 lsr ; bit -> C, A = 0 bcc newesc ; e..e010 ;***FALL THRU*** ; e..e011 srle iny ; Y is 1 bigger than MSB loops jsr getval ; Y is 1, get len, X = 0 sta xstore ; Save length LSB cmp #64 ; ** PARAMETER 63-64 -> C clear, 64-64 -> C set.. bcc chrcode ; short RLE, get bytecode longrle ldx #2 ; ** PARAMETER 111111xxxxxx jsr getbits ; get 3/2/1 more bits to get a full byte, X = 0 sta xstore ; Save length LSB jsr getval ; length MSB, X = 0 tay ; Y is 1 bigger than MSB loops The short RLE only uses half (or actually 1 value less than a half) of the gamma code range. Larger values switches us into long RLE mode. Because there are several values, we already know some bits of the length value. Depending on the gamma code maximum value we need to get from one to three bits more to assemble a full byte, which is then used as the less significant part for the run length count. The upper part is encoded using the same gamma code we are using everywhere. This limits the run length to 16 kilobytes for the smallest maximum value (-m5) and to the full 64 kilobytes for the largest value (-m7). Additional compression for RLE is gained using a table for the 31 top-ranking RLE bytes. We get an index from the input. If it is from 1 to 31, we use it to index the table. If the value is larger, the lower 5 bits of the value gives us the 5 most significant bits of the byte to repeat. In this case we read 3 additional bits to complete the byte. chrcode jsr getval ; Byte Code, X = 0 tax ; this is executed most of the time anyway lda table-1,x ; Saves one jump if done here (loses one txa) cpx #32 ; 31-32 -> C clear, 32-32 -> C set.. bcc 1$ ; 1..31 -> the byte to repeat is in A ; Not ranks 1..31, -> 111110xxxxx (32..64), get byte.. txa ; get back the value (5 valid bits) jsr get3bit ; get 3 more bits to get a full byte, X = 0 1$ ldx xstore ; get length LSB inx ; adjust for cpx#$ff;bne -> bne dorle jsr putch dex bne dorle ; xstore 0..255 -> 1..256 deym bne dorle ; Y was 1 bigger than wanted originally mainbeq beq main ; reverse condition -> jump always After deciding the repeat count and decoding the value to repeat we simply have to output the value enough times. The X register holds the lower part and the Y register holds the upper part of the count. The X register value is first incremented by one to change the code sequence dex ; cpx #$ff ; bne dorle into simply dex ; bne dorle. This may seem strange, but it saves one byte in the decompression code and two clock cycles for each byte that is outputted. It's almost a ten percent improvement. :-) The next code fragment is the LZ77 decode routine and it is used in the file parts that do not have equal byte runs (and even in some that have). The routine simply gets an offset value and copies a sequence of bytes from the already decompressed portion to the current output position. lz77 jsr getval ; X=0 -> X=0 cmp #127 ; ** PARAMETER Clears carry (is maximum value) beq eof ; EOF sbc #0 ; C is clear -> subtract 1 (1..126 -> 0..125) ldx #0 ; ** PARAMETER (more bits to get) jsr getchkf ; clears Carry, X=0 -> X=0 lz77-2 sta LZPOS+1 ; offset MSB ldx #8 jsr getbits ; clears Carry, X=8 -> X=0 ; Note: Already eor:ed in the compressor.. ;eor #255 ; offset LSB 2's complement -1 (i.e. -X = ~X+1) adc OUTPOS ; -offset -1 + curpos (C is clear) sta LZPOS lda OUTPOS+1 sbc LZPOS+1 ; takes C into account sta LZPOS+1 ; copy X+1 number of chars from LZPOS to OUTPOS ;ldy #0 ; Y was 0 originally, we don't change it ldx xstore ; LZLEN inx ; adjust for cpx#$ff;bne -> bne lzloop lda (LZPOS),y jsr putch iny ; Y does not wrap because X=0..255 and Y initially 0 dex bne lzloop ; X loops, (256,1..255) beq mainbeq ; jump through another beq (-1 byte, +3 cycles) There are two entry-points to the LZ77 decode routine. The first one (lz77) is for copy lengths bigger than 2. The second entry point (lz77-2) is for the length of 2 (8-bit offset value). ; EOF eof lda #$37 ; ** could be a PARAMETER sta 1 dec $d030 ; or "bit $d030" if 2MHz mode is not enabled lda OUTPOS ; Set the basic prg end address sta BASEND lda OUTPOS+1 sta BASEND+1 cli ; ** could be a PARAMETER jmp $aaaa ; ** PARAMETER #rend block-stack-end Some kind of a end of file marker is necessary for all variable-length codes. Otherwise we could not be certain when to stop decoding. Sometimes the byte count of the original file is used instead, but here a special EOF condition is more convenient. If the high part of a LZ77 offset is the maximum gamma code value, we have reached the end of file and must stop decoding. The end of file code turns on BASIC and KERNEL, turns off 2 MHz mode (for C128) and updates the basic end addresses before allowing interrupts and jumping to the program start address. The next code fragment is put into the system input buffer. The routines are for getting bits from the encoded message (getbits) and decoding the Elias Gamma Code (getval). The table at the end contains the ranked RLE bytes. The compressor automatically decreases the table size if not all of the values are used. block200 #rorg $200 ; $200-$258 block200- getnew pha ; 1 Byte/3 cycles INPOS = *+1 lda $aaaa ;** parameter rol ; Shift in C=1 (last bit marker) sta bitstr ; bitstr initial value = $80 == empty inc INPOS ; Does not change C! bne 0$ inc INPOS+1 ; Does not change C! bne 0$ ; This code does not change C! lda #WRAPBUF ; Wrap from $ffff->$0000 -> WRAPBUF sta INPOS 0$ pla ; 1 Byte/4 cycles rts ; getval : Gets a 'static huffman coded' value ; ** Scratches X, returns the value in A ** getval inx ; X must be 0 when called! txa ; set the top bit (value is 1..255) 0$ asl bitstr bne 1$ jsr getnew 1$ bcc getchk ; got 0-bit inx cpx #7 ; ** parameter bne 0$ beq getchk ; inverse condition -> jump always ; getbits: Gets X bits from the stream ; ** Scratches X, returns the value in A ** get1bit inx ;2 getbits asl bitstr bne 1$ jsr getnew 1$ rol ;2 getchk dex ;2 more bits to get ? getchkf bne getbits ;2/3 clc ;2 return carry cleared rts ;6+6 table dc.b 0,0,0,0,0,0,0 dc.b 0,0,0,0,0,0,0,0 dc.b 0,0,0,0,0,0,0,0 dc.b 0,0,0,0,0,0,0,0 #rend block200-end -------------------------------------------------------------------------- Target Application Compression Tests ------------------------------------ The following data compression tests are made on my four C64 test files: bs.bin is a demo part, about 50% code and 50% graphics data delenn.bin is a BFLI picture with a viewer, a lot of dithering sheridan.bin is a BFLI picture with a viewer, dithering, black areas ivanova.bin is a BFLI picture with a viewer, dithering, larger black areas Packer Size Left Comment =============================================== bs.bin 41537 ----------------------------------------------- ByteBonker 1.5 27326 65.8% Mode 4 Cruelcrunch 2.2 27136 65.3% Mode 1 The AB Cruncher 27020 65.1% ByteBoiler (REU) 26745 64.4% RLE + ByteBoiler (REU) 26654 64.2% PuCrunch 26415 63.6% -m5 =============================================== delenn.bin 47105 ----------------------------------------------- The AB Cruncher N/A N/A Crashes ByteBonker 1.5 21029 44.6% Mode 3 Cruelcrunch 2.2 20672 43.9% Mode 1 ByteBoiler (REU) 20371 43.2% RLE + ByteBoiler (REU) 19838 42.1% PuCrunch 19734 41.9% -p2 =============================================== sheridan.bin 47105 ----------------------------------------------- ByteBonker 1.5 13661 29.0% Mode 3 Cruelcrunch 2.2 13595 28.9% Mode H The AB Cruncher 13534 28.7% ByteBoiler (REU) 13308 28.3% PuCrunch 12526 26.6% -p2 RLE + ByteBoiler (REU) 12478 26.5% =============================================== ivanova.bin 47105 ----------------------------------------------- ByteBonker 1.5 11016 23.4% Mode 1 Cruelcrunch 2.2 10883 23.1% Mode H The AB Cruncher 10743 22.8% ByteBoiler (REU) 10550 22.4% PuCrunch 9844 20.9% -p2 RLE + ByteBoiler (REU) 9813 20.8% LhA 9543 20.3% Decompressor not included gzip -9 9474 20.1% Decompressor not included ----------------------------------------------- -------------------------------------------------------------------------- Calgary Corpus Suite -------------------- The original compressor only allows files upto 63 kB. To be able to compare my algorithm to others I modified the compressor to allow bigger files. I then got some reference results using the Calgary Corpus test suite. Note that the decompression code is included in the compressed files, although it is not valid for files over 63k (compressed or uncompressed size). About 34 bytes are decompression parameters, the rest (approx. 300 bytes) is 6510 machine language. Kolmogorov complexity, anyone ?:-) To tell you the truth, the results surprised me, because the compression algorithm -IS- developed for a very special case in mind. It only has a fixed code for LZ77/RLE lengths, not even a static one (fixed != static != adaptive)! Also, it does not use arithmetic code (or Huffman) to compress the literal bytes. Because most of the big files are ASCII text, this somewhat handicaps my compressor, although the new tagging system is very happy with 7-bit ASCII input. Also, decompression is relatively fast, and uses no extra memory. I'm getting relatively near LhA, and shorter than LhA for 8 files (300-byte decompressor included!), and relatively near or shorter than LhA in other cases if the decompressor is removed. The table contains the file name (file), compression options (options), the original file size (in) and the compressed file size (out) in bytes, average number of bits used to encode one byte (b/B), remaining size (ratio) and the reduction (gained), and the time used for compression. For comparison, the last three columns show the compressed sizes for LhA, Zip and GZip (with the -9 option), respectively. FreeBSD epsilon3.vlsi.fi PentiumProŽ 200MHz Estimated decompression on a C64 (1MHz 6510) 6:47 LhA Zip GZip-9 file options in out b/B ratio gained time out out out ========================================================================= bib -p4 111261 35457 2.55 31.87% 68.13% 8.3 40740 35041 34900 book1 -p4 768771 318919 3.32 41.49% 58.51% 65.1 339074 313352 312281 book2 -p4 610856 208627 2.74 34.16% 65.84% 43.5 228442 206663 206158 geo -p2 102400 72812 5.69 71.11% 28.89% 11.4 68574 68471 68414 news -p3 377109 144566 3.07 38.34% 61.66% 15.2 155084 144817 144400 obj1 -m6 21504 10750 4.00 50.00% 50.00% 0.1 10310 10300 10320 obj2 246814 83046 2.70 33.65% 66.35% 13.5 84981 81608 81087 paper1 -p2 53161 19536 2.94 36.75% 63.25% 1.5 19676 18552 18543 paper2 -p3 82199 30676 2.99 37.32% 62.68% 4.3 32096 29728 29667 paper3 -p2 46526 19234 3.31 41.35% 58.65% 1.4 18949 18072 18074 paper4 -p1 -m5 13286 6095 3.68 45.88% 54.12% 0.2 5558 5511 5534 paper5 -p1 -m5 11954 5494 3.68 45.96% 54.04% 0.1 4990 4970 4995 paper6 -p2 38105 14159 2.98 37.16% 62.84% 0.8 13814 13207 13213 pic -p1 513216 57835 0.91 11.27% 88.73% 23.2 52221 56420 52381 progc -p1 39611 14221 2.88 35.91% 64.09% 0.7 13941 13251 13261 progl -p1 71646 17038 1.91 23.79% 76.21% 3.8 16914 16249 16164 progp 49379 11820 1.92 23.94% 76.06% 1.3 11507 11222 11186 trans -p2 93695 19511 1.67 20.83% 79.17% 3.7 22578 18961 18862 ------------------------------------------------------------------------- total 3251493 1089796 2.68 33.52% 66.48% 3:18 Canterbury Corpus Suite ----------------------- The following shows the results on the Canterbury corpus. Again, I am quite pleased with the results. For example, pucrunch beats GZip -9 for lcet10.txt if you remove the decompression code. FreeBSD epsilon3.vlsi.fi PentiumProŽ 200MHz Estimated decompression on a C64 (1MHz 6510) 6:00 LhA Zip GZip-9 file opt in out b/B ratio gained time out out out ========================================================================== alice29.txt -p4 152089 55103 2.90 36.24% 63.76% 11.3 59160 54525 54191 ptt5 -p1 513216 57835 0.91 11.27% 88.73% 23.2 52272 56526 52382 fields.c 11150 3505 2.52 31.44% 68.56% 0.1 3180 3230 3136 kennedy.xls 1029744 265887 2.07 25.83% 74.17% 571 198354 206869 209733 sum 38240 13334 2.79 34.87% 65.13% 0.6 14016 13006 12772 lcet10.txt -p4 426754 144585 2.72 33.89% 66.11% 30.8 159689 144974 144429 plrabn12.txt-p4 481861 199134 3.31 41.33% 58.67% 43.6 210132 195299 194277 cp.html -p1 24603 8679 2.83 35.28% 64.72% 0.4 8402 8085 7981 grammar.lsp -m5 3721 1591 3.43 42.76% 57.24% 0.0 1280 1336 1246 xargs.1 -m5 4227 2117 4.01 50.09% 49.91% 0.0 1790 1842 1756 asyoulik.txt-p4 125179 50594 3.24 40.42% 59.58% 7.5 52377 49042 48829 -------------------------------------------------------------------------- total 2810784 802364 2.28 28.55% 71.45% 11:29 -------------------------------------------------------------------------- Conclusions ----------- In this article I have presented a compression program which creates compressed executable files for C64, VIC20 and Plus4/C16. The compression can be performed on Amiga, MS-DOS/Win machine or any other machine with a C-compiler. A powerful machine allows asymmetric compression: a lot of resources can be used to compress the data while needing minimal resources for decompression. This was one of the design requirements. Two original ideas were presented: a new literal byte tagging system and an algorithm using hybrid RLE and LZ77. Also, a detailed explanation of the LZ77 string match routine and the optima parsing scheme was presented. The compression ratio and decompression speed is comparable to other compression programs for Commodore 8-bit computers. But what are then the real advantages of pucrunch compared to traditional C64 compression programs in addition to that you can now compress VIC20 and Plus4/C16 programs? Because I'm lousy at praising my own work, I let you see some actual user comments. I have edited the correspondence a little, but I hope he doesn't mind. My comments are marked with an asterisk. Maybe Steve has something to add also? ---8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<--- A big advantage is that pucrunch does RLE and LZ in one pass. For demos I only used a cruncher and did my own RLE routines as it is somewhat annoying to use an external program for this. These programs require some memory and ZP-addresses like the cruncher does. So it can easily happen that the decruncher or depacker interfere with your demo-part, if you didn't know what memory is used by the depacker. At least you have more restrictions to care about. With pucrunch you can do RLE and LZ without having too much of these restrictions. * Right, and because pucrunch is designed that way from the start, it can * get better results with one-pass RLE and LZ than doing them separately. * On the other hand it more or less requires that you _don't_ RLE-pack the * file first.. This is true, we also found that out. We did a part for our demo which had some tables using only the low-nybble. Also the bitmap had to be filled with a specific pattern. We did some small routines to shorten the part, but as we tried pucrunch, this became obsolete. From 59xxx bytes to 12xxx or 50 blocks, with our own RLE and a different cruncher we got 60 blks! Not bad at all ;) Not to mention that you have the complete and commented source-code for the decruncher, so that you can easily change it to your own needs. And it's not only very flexible, it is also very powerful. In general pucrunch does a better job than ByteBoiler+Sledgehammer. In addition to that pucrunch is of course much faster than crunchers on my C64, this has not only to do with my 486/66 and the use of an HDD. See, I use a cross-assembler-system, and with pucrunch I don't have to transfer the assembled code to my 64, crunch it, and transfer it back to my pc. Now, it's just a simple command-line and here we go... And not only I can do this, my friend who has an amiga uses pucrunch as well. This is the first time we use the same cruncher, since I used to take ByteBoiler, but my friend didn't have a REU so he had to try another cruncher. So, if I try to make a conclusion: It's fast, powerful and extremly flexible (thanks to the source-code). ---8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<--- Just for your info... We won the demo-competition at the Interjam'98 and everything that was crunched ran through the hands of pucrunch... Of course, you have been mentioned in the credits. If you want to take a look, search for KNOOPS/DREAMS, which should be on the ftp-servers in some time. So, again, THANKS! :) Ninja/DREAMS ---8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<----8<--- So, what can I possibly hope to add to that, right?:-) If you have any comments, questions, article suggestions or just a general hello brewing in your mind, send me mail or visit my homepage. See you all again in the next issue! -Pasi -------------------------------------------------------------------------- Appendix: The Log Book ---------------------- 5.3.1997 Tried reverse LZ, i.e. mirrored history buffer. Gained some bytes, but its not really worth it, i.e. the compress time increases hugely and the decompressor gets bigger. 6.3.1997 Tried to have a code to use the last LZ copy position (offset added to the lastly used LZ copy position). On bs.run I gained 57 bytes, but in fact the net gain was only 2 bytes (decompressor becomes ~25 bytes longer, and the lengthening of the long rle codes takes away the rest 30). 10.3.1997 Discovered that my representation of integers 1-63 is in fact an Elias Gamma Code. Tried Fibonacci code instead, but it was much worse (~500 bytes on bs.run, ~300 bytes on delenn.run) without even counting the expansion of the decompression code. 12.3.1997 'huffman' coded RLE byte -> ~70 bytes gain for bs.run. The RLE bytes used are ranked, and top 15 are put into a table, which is indexed by a Elias Gamma Code. Other RLE bytes get a prefix "1111". 15.3.1997 The number of escape bits used is again selectable. Using only one escape bit for delenn.run gains ~150 bytes. If #-option is not selected, automatically selects the number of escape bits (is a bit slow). 16.3.1997 Changed some arrays to short. 17 x inlen + 64kB memory used. opt-escape() only needs two 16-element arrays now and is slightly faster. 31.3.1997 Tried to use BASIC ROM as a codebook, but the results were not so good. For mostly-graphics files there are no long matches -> no net gain, for mostly-code files the file itself gives a better codebook.. Not to mention that using the BASIC ROM as a codebook is not 100% compatible. 1.4.1997 Tried maxlen 128, but it only gained 17 bytes on ivanova.run, and lost ~15 byte on bs.run. This also increased the LZPOS maximum value from ~16k to ~32k, but it also had little effect. 2.4.1997 Changed to coding so that LZ77 has the priority. 2-byte LZ matches are coded in a special way without big loss in efficiency, and codes also RLE/Escape. 5.4.1997 Tried histogram normalization on LZLEN, but it really did not gain much of anything, not even counting the mapping table from index to value that is needed. 11.4.1997 8..14 bit LZPOS base part. Automatic selection. Some more bytes are gained if the proper selection is done before the LZ/RLELEN optimization. However, it can't really be done automatically before that, because it is a recursive process and the original LZ/RLE lengths are lost in the first optimization.. 22.4.1997 Found a way to speed up the almost pathological cases by using the RLE table to skip the matching beginnings. 2.5.1997 Switched to maximum length of 128 to get better results on the Calgary Corpus test suite. 25.5.1997 Made the maximum length adjustable. -m5, -m6, and -m7 select 64, 128 and 256 respectively. The decompression code now allows escape bits from 0 to 8. 1.6.1997 Optimized the escape optimization routine. It now takes almost no time at all. It used a whole lot of time on large escape bit values before. The speedup came from a couple of generic data structure optimizations and loop removals by informal deductions. 3.6.1997 Figured out another, better way to speed up the pathological cases. Reduced the run time to a fraction of the original time. All 64k files are compressed under one minute on my 25 MHz 68030. pic from the Calgary Corpus Suite is now compressed in 19 seconds instead of 7 minutes (200 MHz Pentium w/ FreeBSD). Compression of ivanova.run (one of my problem cases) was reduced from about 15 minutes to 47 seconds. The compression of bs.run has been reduced from 28 minutes (the first version) to 24 seconds. An excellent example of how the changes in the algorithm level gives the most impressive speedups. 6.6.1997 Changed the command line switches to use the standard approach. 11.6.1997 Now determines the number of bytes needed for temporary data expansion (i.e. escaped bytes). Warns if there is not enough memory to allow successful decompression on a C64. Also, now it's possible to decompress the files compressed with the program (must be the same version). (-u) 17.6.1997 Only checks the lengths that are power of two's in OptimizeLength(), because it does not seem to be any (much) worse than checking every length. (Smaller than found maximum lengths are checked because they may result in a shorter file.) This version (compiled with optimizations on) only spends 27 seconds on ivanova.run. 19.6.1997 Removed 4 bytes from the decrunch code (begins to be quite tight now unless some features are removed) and simultaneously removed a not-yet-occurred hidden bug. 23.6.1997 Checked the theoretical gain from using the lastly outputted byte (conditional probabilities) to set the probabilities for normal/LZ77/RLE selection. The number of bits needed to code the selection is from 0.0 to 1.58, but even using arithmetic code to encode it, the original escape system is only 82 bits worse (ivanova.run), 7881/7963 bits total. The former figure is calculated from the entropy, the latter includes LZ77/RLE/escape select bits and actual escapes. 18.7.1997 In LZ77 match we now check if a longer match (further away) really gains more bits. Increase in match length can make the code 2 bits longer. Increase in match offset can make the code even longer (2 bits for each magnitude). Also, if LZPOS low part is longer than 8, the extra bits make the code longer if the length becomes longer than two. ivanova -5 bytes, sheridan -14, delenn -26, bs -29 When generating the output rescans the LZ77 matches. This is because the optimization can shorten the matches and a shorter match may be found much nearer than the original longer match. Because longer offsets usually use more bits than shorter ones, we get some bits off for each match of this kind. Actually, the rescan should be done in OptimizeLength() to get the most out of it, but it is too much work right now (and would make the optimize even slower). 29.8.1997 4 bytes removed from the decrunch code. I have to thank Tim Rogers (timr@eurodltd.co.uk) for helping with 2 of them. 12.9.1997 Because SuperCPU doesn't work correctly with inc/dec $d030, I made the 2 MHz user-selectable and off by default. (-f) 13.9.1997 Today I found out that most of my fast string matching algorithm matches the one developed by [Fenwick and Gutmann, 1994]*. It's quite frustrating to see that you are not a genius after all and someone else has had the same idea. :-) However, using the RLE table to help still seems to be an original idea, which helps immensely on the worst cases. I still haven't read their paper on this, so I'll just have to get it and see.. * [Fenwick and Gutmann, 1994]. P.M. Fenwick and P.C. Gutmann, "Fast LZ77 String Matching", Dept of Computer Science, The University of Auckland, Tech Report 102, Sep 1994 14.9.1997 The new decompression code can decompress files from $258 to $ffff (or actually all the way upto $1002d :-). The drawback is: the decompression code became 17 bytes longer. However, the old decompression code is used if the wrap option is not needed. 16.9.1997 The backSkip table can now be fixed size (64 kWord) instead of growing enormous for "BIG" files. Unfortunately, if the fixed-size table is used, the LZ77 rescan is impractical (well, just a little slow, as we would need to recreate the backSkip table again). On the other hand the rescan did not gain so many bytes in the first place (percentage). The define BACKSKIP-FULL enables the old behavior (default). Note also, that for smaller files than 64kB (the primary target files) the default consumes less memory. The hash value compare that is used to discard impossible matches does not help much. Although it halves the number of strings to consider (compared to a direct one-byte compare), speedwise the difference is negligible. I suppose a mismatch is found very quickly when the strings are compared starting from the third charater (the two first characters are equal, because we have a full hash table). According to one test file, on average 3.8 byte-compares are done for each potential match. A define HASH-COMPARE enables (default) the hash version of the compare, in which case "inlen" bytes more memory is used. After removing the hash compare my algorithm quite closely follows the [Fenwick and Gutmann, 1994] fast string matching algorithm (except the RLE trick). (Although I *still* haven't read it.) 14 x inlen + 256 kB of memory is used (with no HASH-COMPARE and without BACKSKIP-FULL). 18.9.1997 One byte removed from the decompression code (both versions). 30.12.1997 Only records longer matches if they compress better than shorter ones. I.e. a match of length N at offset L can be better than a match of length N+1 at 4*L. The old comparison was "better or equal" (">="). The new comparison "better" (">") gives better results on all Calgary Corpus files except "geo", which loses 101 bytes (0.14% of the compressed size). An extra check/rescan for 2-byte matches in OptimizeLength() increased the compression ratio for "geo" considerably, back to the original and better. It seems to help for the other files also. Unfortunately this only works with the full backskip table (BACKSKIP-FULL defined). 21.2.1998 Compression/Decompression for VIC20 and C16/+4 incorporated into the same program. 16.3.1998 Removed two bytes from the decompression codes. 17.8.1998 There was a small bug in pucrunch which caused the location $2c30 to be decremented (dec $2c30 instead of bit $d030) when run without the -f option. The source is fixed and executables are now updated. -------------------------------------------------------------------------- References 1. http://www.cs.tut.fi/~albert/ 2. http://www.cs.tut.fi/~albert/Dev/ 2. http://www.cs.tut.fi/~albert/Dev/pucrunch/ ....... .... .. . C=H #17 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: VIC-20 Kernal ROM Disassembly Project Richard Cini Introduction In order to put this project into perspective, a little personal history is needed. I received my first Commodore as a gift from my parents back in 1982. I used Commodore PETs in the school's computer lab, and Radio Shack Model I's in the local R/S store. It was nice to have one of my own to hack on, though. Back then, most of my work was with the built-in BASIC interpreter. My claim to fame (to my family, at least) was a BASIC/machine language mailing list management program, and an allophone speech synthesizer hardware-software hack. Both worked well, which surprised my mother, who claims that nothing I ever built worked right. As I grew-up, my computer-of-choice changed, but I never lost my love for the VIC. It is small, easy to program, has a very capable processor (by 1980's standards) and decent I/O capability. It's peripherals were varied, if not quirky (take the 1515 printer, for example), but at least everything worked well together. Now, fast forward to 1994. Commodore International failed, crippled by years of a weak product strategy, squandered opportunities and the market's increased focus on mainstream PC-compatible or Macintosh machines as productivity tools. After Commodore's failure, I decided that I wanted to try to purchase the Commodore 8-bit intellectual property, including the rights to the Kernal and BASIC source code, primarily for preservation purposes. One of my hobbies is collecting and preserving obsolete and unsupported computers, accessories, documentation, etc. Since Commodore's bankruptcy attorney would not return my calls (no surprise there), I embarked on decompiling the Kernal. This project, although time consuming and rewarding from an informational perspective was not truly trail-blazing. Many before me probably decompiled parts of the Kernal in order to gain some understanding as it related to another project. However, in my research, I don't recall ever seeing complete recompileable source code. Memory and ROM maps, yes; source code, no. Marko Makela manages a great Commodore web site that contains lots of useful information, including these memory and ROM maps. These provided the starting point for my work. See http://www.hut.fi/misc/cbm/docs/ for this and much more information. What I would like to accomplish in a series of articles is to explain the process and to discuss specific Kernal routines that may be of interest to C=Hacking readers. Also, where appropriate, I will make comparisons with the other Commodore machines that were the contemporaries of the VIC, the C64 and the PET, specifically. One might ask, "Why is this project different from all of the other resources already available?" In short, here's why: 1. The end result is a fully modifiable and compilable source file. 2. Only the VIC memory map and ROM location map are available *because* to date everyone has focused on the C64 as it is the more functional (and hence, more popular) Commodore of the era. The same goes for the PET, too. 3. A far as I know, there was no "Mapping the VIC" written, because of #2, above. 4. Because of #3, the fact that I had some spare time, and that I love my VIC (although I don't use it as much these days), I felt that I had to do it. Brief VIC-20 History -------------------- Although I'd like to provide a complete history lesson on the VIC-20, many others before me have done a better job. The March 1985 issue of the IEEE Spectrum has an article, as does Marko's web site. See http://www.hut.fi/misc/cbm/docs/peddle.english.html for a great article on Chuck Peddle, the well-known creator of the 6502 microprocessor. Nonetheless, I will provide a Readers' Digest version of the history of the VIC. In the late-70s, engineers in the Advanced Systems Design Group (ASDG) at Commodore created a multi-function video/sound interface chip, the 6560 (a.k.a., the VIC). Al Charpentier ran the LSI section of the ASDG, and was the lead in designing the VIC-I, and later, the VIC-II. The ASDG was the old MOS Semiconductor operation that Commodore bought in the mid-70s. The 6560 supported complete composite color video, 3-voice plus white noise sound, a volume control, two analog-to-digital converters that supported the use of a game paddle or joystick, and a light pen interface. Feature-rich as it was, no manufacturer wanted to commit a product line to it. Since Commodore could not find anyone to buy the chip, they decided to build a computer that featured the chip. Hence, the VIC-20 was born. Al's buddy, Bob Yannes, was a senior systems designer at Commodore who developed the VIC-20 (and later, the C64) prototype. The VIC was Commodore's first color computer, and the first designed for home use. Relationships with other Commodore Products ------------------------------------------- The structure of the VIC-20 ROM parallels those of Commodore's other machines of that era, the PET and the C64. All three machines share the concept o