Squeezing Blood from a Stone: Extreme 6502 Image Decoding Optimizations

Optimizing computationally intensive algorithms for resource-constrained systems demands radical thinking. When targeting the 1MHz MOS 6502 processor—a cornerstone of iconic 8-bit systems like the Apple II—standard compiler output is woefully inadequate. Transforming a JPEG-like image decoder from an impractical 70-minute runtime to a barely tolerable 60 seconds required abandoning modern conventions and embracing the chip's raw, unforgiving architecture. Here’s how it was done.

Embracing the Unthinkable

Modern software engineering emphasizes safety: bounds checks, abstraction, and strict contracts. On the 6502, these luxuries vanish. The optimization journey began by discarding them entirely:

Self-Modifying Code: Critical values (like the factor reused 153,600 times per image) were patched directly into instruction operands (LDA #$FF), eliminating repeated loads. While dangerous and non-reentrant, it saved precious cycles where they mattered most.
Zero Stack Parameters: Function calls pushing parameters onto the software stack were eliminated. Parameters passed via registers (A, X, Y) or hardcoded zero-page locations slashed call overhead. Critical multiplications moved to direct entry points.
No Safety Nets: Bounds checks and input validation were stripped. "The data is trusted. Crashes? That's a data problem," the approach declares. Assumptions about valid image encoding became implicit.

Exploiting 8-Bit Reality

The original C algorithm leaned on 16-bit (int) operations. On the 6502, these are cripplingly slow. Every variable was scrutinized:

8-Bit Where Possible: Variables that could fit in 8 bits were forced into bytes. The 32-bit bit buffer was reduced to 16 bits, then optimized for single-bit shifts.
16-Bit Data Structure Splitting: Arrays of 16-bit values were split into separate high-byte and low-byte arrays. Accessing value = buffer_high[i] << 8 | buffer_low[i] via indexed addressing (LDX index, LDA buffer_low,X, LDY buffer_high,X) proved drastically faster than 16-bit indexed math.
Lookup Tables Reign Supreme: Operations like (val * factor) >> 4 performed 150,000 times were prime targets. Precomputed 256-byte tables for shift-left-by-4 (_shiftl4) and shift-right-by-4 (_shiftr4) transformed a 34-cycle shift sequence into a 20-cycle lookup:

TYA             ; Low byte result mult to A
LDY _shiftr4,A  ; Get shifted low (high nibble)
ORA _shiftl4,X  ; OR with shifted high (low nibble)
STA result      ; Low byte done
LDA _shiftr4,X  ; Get shifted high byte
STA result+1

Micro-Optimizations: When Cycles Are Gold

Saving 3 cycles per operation adds up fast across millions of iterations:

Branch Prediction: Code layout was adjusted so the unlikely path (e.g., incrementing a high byte pointer every 256 bytes) incurred the branch-taken penalty (3 cycles), while the common path (no page cross) used the cheaper branch-not-taken (2 cycles). Reversing a common loop check saved ~33% overhead.
Inlining Ruthlessly: Tiny functions (>>4, <<7) were inlined via macros, eliminating JSR/RTS overhead (12 cycles). Clever bit-twiddling replaced shifts: val << 7 became a single LSR and carry manipulation.
Hardcoding Pointers: Indirect accesses (LDA (ptr),Y) to the next_line buffer (accessed ~64 times per column pair) were replaced with absolute indexed addressing (LDA next_line_low,Y). The page bytes of these instructions were patched twice per row, a small cost for avoiding millions of slow indirect accesses.

Letting Data Dictate Design

Profiling revealed data patterns enabling radical shortcuts:

Special-Casing Common Values: Observing that the scaling factor was 255 for >70% of rows allowed replacing next_line[i] = (255 * next_line[i]) / 256 with the equivalent next_line[i] -= next_line[i] >> 8. This replaced a 150-cycle 16x16 multiply/divide with a 20-cycle subtract-with-borrow.
Approximation via Lookup: The final pixel clamp (pixel = clamp(16bit_value / factor, 0, 255)), originally costing ~200 cycles per pixel, was revolutionized. A 512-byte lookup table (dyndiv) was built per row (or reused if factor hadn't changed), precomputing clamp(value / factor) for values $0080 to $FF80. Pixel storage collapsed to LDA dyndiv,X; STA buffer,Y (8 cycles). Precision loss? A maximum error of ±1, visually imperceptible.
Static Tables for Common Factors: Profiling showed factor=48 dominated 60% of rows. A permanent div48 table was added, eliminating most dynamic table rebuilds. Division time plummeted from 30s to 300ms/image.
Huffman Decoder Specialization: Generic tree walking was replaced with hand-optimized, tree-specific decoders. Skipping unused data used a simpler decoder that only validated codes, not values. Jump tables replaced indirect calls.

The Cost of Velocity

These gains came at tangible costs: code became brittle, reliant on specific data layouts, and filled with "dangerous" techniques anathema to modern practice. Memory usage ballooned with lookup tables. Yet, for the target—a 1MHz Apple II decoding images within a human lifetime—the trade-off was non-negotiable. This optimization journey underscores a fundamental truth: extreme hardware demands extreme solutions, where elegance bows to raw efficiency, and the programmer's intimate understanding of both data and silicon unlocks performance compilers cannot fathom.

Source: Based on technical deep dive by Colin O'Flynn

#6502 #AssemblyOptimization #RetroComputing