Last modified on 13 July 2006, at 21:48

Programming:Fast Sprites

Fast Sprites

Here's some tips for sprite/screen handling.

1. The best screen width to use is 32 (mode 1 characters in CRTC register 1). This is wide enough for most games, and it's special property is that if your screen base address is on a 64 byte boundary, no pixel row will cross a 256 byte page boundary, hence you can use INC reg8 to move to the next display byte rather than INC reg16. This only saves 1 microsecond, but in a tight loop, every one counts.

2. Avoid using PUSH/POP inside your sprite loop. These are slow instructions. Similarly, avoid using index registers (IX+n) and (IY+n) are slow. One case where it may be wise to use an index register is for a loop counter, so you can keep BC for something else.

3. If it doesn't need to be transparent, don't draw it that way. The quickest way to move the data in most cases is LDI. Draw solid sprites separately, and possible, build a large sprite from smalled ones with some transparent sections, some opaque.

4. Keep your sprite data from crossing page boundaries on pixel rows (or at all if possible). For the same reason as 1 above, you may be able to increment your data pointer using INC reg8.

5. Use mask tables to save memory. If you've got plenty of memory left, you can put your AND and OR masks for each sprite with the sprite data, this is the fatest approach for transparent sprites:

LD A,(DE)
AND (HL)
INC L
OR (HL)
INC L
LD (DE),A
INC E
 ;Total = 11 us per byte

But in most cases, you don't have that much memory to store all your graphics. You can create a 256 AND mask table, and possibly also a 256 byte OR mask table. This is the quickest way to mask the data while saving some memory.

LD A,(BC)
INC C
LD L,A
LD A,(DE)
AND (HL)
INC H
OR (HL)
DEC H
LD (DE),A
INC E
 ;Total = 15 us per byte

The other advantage of using this method is that with a small change to the above code you can quite easily use different mask tables. In ZACK, I use a reversing table which has the two MODE 0 pixels reversed, essentially flipping the byte left to right. By then changing the INC C to a DEC C (and starting at a different offset in the sprite data) it's easy to flip a sprite, saving memory for sprites which can face both ways. Another table has the OR masks all set to ink 15 for every non-transparent pixel. This can then be masked again with a colour to create a solid colour version of a sprite.

Out of registers?

As you can see with the above code, all the 8 bit registers are used. What's left for loop counters etc?

1. One method is to use the alternate register set (beware if the OS is still being used). The EXX instruction will swap BC,DE and HL in one cycle.

2. Another method (if you're not worried about clipping and all your sprites are the same size) is to unroll the loop n times where n is the width in bytes of the sprite, eg. for 4 bytes wide:

.byte0
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
.byte1
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
.byte2
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
.byte3
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E

This is by far the fastest method if you make sure your sprites never need to clip. If they do need to clip, you can use some tricky self modifying code to loop part way through this code. eg. Store the address (byte0, byte1, byte2 or byte3) in IX and do PUSH IX:RET to continue the loop. Non-indexed, ie. not (IX+n) are only one cycle slower than operations on HL.

3. Use undocumented 8 bit operations on IX and IY 8 bit values. You could use HX for your loop counter, for example:


.dobyte
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
DB #DD:DEC H
JR NZ,dobyte

The same of course applies to the unrolled version, where you could use HX for the row counter.

Incrementing the row

How do we start drawing the next screen line? Since we've already used all the registers, this can be quite tricky. Each scan line is offset by #800 bytes, except that every 8 scan lines the offset is actually (CRTC Register 1 * 2) - #3800. Also, we've been incrementing the screen address each byte by 1, so now our offset is also out by 4 bytes if the sprite is 4 bytes wide. Clipping the sprite makes this even more of a challenge, but I won't get into that :)

With the unrolled version above, with only 4 bytes wide, the easiest method would be to change the last INC E to three DEC E's, ie.

.byte3
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:DEC E:DEC E:DEC E

Then we only need to decrement E two more times to get it back to the original state, and increment the line. If your sprite is more that 4 bytes wide, you're not using unrolled loops, you'll need to either subtract the width from E, or make sure you add the appropriate amount (eg. #7FC rather than #800), but this involved using 16 bit addition which is slower, and we're short of registers as usual.

The technique I normally use is to either use some self modifying code, eg.

LD A,E
.sprwid equ $+1
SUB 4
LD E,A

Or to store the width in one of those undocumented 8 bit registers (HX, HY, LX or LY). eg.

LD A,E
DB #DD:SUB L
LD E,A

Calculating the next screen address is usually relatively simple then, using 8 bit registers:

LD A,D
ADD 8
LD D,A
JR NC,nowrap
LD A,#40
ADD E
LD E,A
LD A,#C0
ADC D
LD D,A

There are of course numerous other ways of doing this, maybe even faster ones. One example is to have the stack pointing to a table of screen addresses for each row, and hold an offset in a register (with interrupts disabled, or interrupt safe timing), eg.

POP DE
LD A,I
ADD E
LD E,A

The only thing left to do now is continue the loop for each row, depending on where you stored your loop counter(s), this could be as simple as:

DB #FD:DEC H
JR NZ,dobyte

A complete sprite routine

Here's a pretty fast transparent sprite drawing routine in full. Note that it assumes the screen with is 32 mode 1 pixels, and the screen base is #c000 with CRTC register 9 set to 7, and has no clipping. Please note, I've also changed the loop so that it exits with RET Z before calculation the next address if it's the last row. This also allowed me to use jr nc,rowloop to continue the loop if the address doesn't wrap. Please also note that I haven't actually tried to assemble this code yet :) it may need a couple of tweaks to run!

;Entry: BC = Sprite address, D = width, E = height, HL = screen address

db #dd:ld l,d  ;LX = width
db #fd:ld h,e  ;HY = height
ex de,hl
ld h,andmasks / 256
.rowloop
db #dd:ld h,l  ;HX <= width
.dobyte
ld a,(bc):inc c
ld l,a:ld a,(de)
and (hl):inc h:or (hl):dec h
ld (de),a:inc e
db #dd:dec h
jr nz,dobyte
db #fd:dec h
ret z
ld a,e
db #dd:sub l
ld e,a
ld a,d
add 8
ld d,a
jr nc,rowloop
ld a,e
add #40
ld e,a
ld a,d
adc #c0
ld d,a
jr rowloop

Further Optimisation

This code could be further improved by unrolling the loop, but that would involve using some self modifying code to patch the rowloop jumps. If you have a number of predetermined widths for your sprites, you could rewrite the routine unrolled for each available size. Note that this could make it difficult later if you want to clip the sprites.

If you don't need to either paint the sprite in a solid colour, or flip the sprite left to right (or you've got enough memory to store a flipped version), then you can optimise the routine to plot a single byte even further:

LD A,(BC)
INC C
LD L,A
LD A,(DE)
AND (HL)
OR L
LD (DE),A
INC E
 ;Total = 12 us per byte

As you can see, this is now down to 12 microseconds per byte, only 1 microsecond slower than storing the mask with the data, but the sprite data is half the size, so you can store twice as much graphics data in your left-over 2K of memory once you've got the double-buffered screens, music and sound fx code and game logic in place.

Clipping

Depending on the routine you use, clipping can be quite difficult or quite simple.

Clipping vertically is simply a matter of adjusting (a) the start offset in the sprite and (b), the number of rows (passed in E in the example code).

Clipping horizontally presents more of a problem. Simply adjusting the start offset and the width (passed in D in the example) won't do the job, because the INC C won't happen enough times to increment the sprite offset to the next row. To get around this, either store the value of C in a spare register at the start of the loop (I think there's 2 left in the example: LY and I), then at the end of the horizontal loop add the width to the value (remember the width passed in is not the width of the data!), or precalculate the difference between the actual sprite width and the displayed sprite width, and add this value. This is probably the preferred method. eg.

At the start of the code, pass in the clipped width in E, and the sprite width in A for example:

SUB E:DB #FD:LD L,A ;LY = sprite width - displayed width

Then at the end of the loop (just after RET Z):

LD A,C:DB #FD:ADD L:LD C,A

One last thought

Earlier on in this document, I mentioned not using index registers because they are slow. There could, however be some merit in using them, especially for unrolled loops to replace the BC register above. Using the index register may remove the need for the INC C and the LD L,A above, for example:

LD L,(IY + 0)
LD A,(DE)
AND (HL)
OR L
LD (DE),A
INC E
 ;Total = 13 us per byte

This code is only 1 microsecond slower than the previous, but unrolled (using (IY + 1), (IY + 2) etc) it won't destroy the value in IY, and it also leaves BC free for loop counters or perhaps extra masks. Not destroying the value in IY means you can simply add the width in bytes to LY even when clipping to ensure the sprite data points to the next valid byte, but remember that adding a value to LY will take at least 5 microseconds, plus one extra microsecond per byte in the loop....

Executioner 19:55, 12 July 2006 (CDT)