Programming:Fast Sprites
Fast Sprites
Here's some tips for sprite/screen handling.
1. The best screen width to use is 32 (mode 1 characters in CRTC register 1). This is wide enough for most games, and it's special property is that if your screen base address is on a 64 byte boundary, no pixel row will cross a 256 byte page boundary, hence you can use INC reg8 to move to the next display byte rather than INC reg16. This only saves 1 microsecond, but in a tight loop, every one counts.
2. Avoid using PUSH/POP inside your sprite loop. These are slow instructions. Similarly, avoid using index registers (IX+n) and (IY+n) are slow. One case where it may be wise to use an index register is for a loop counter, so you can keep BC for something else.
3. If it doesn't need to be transparent, don't draw it that way. The quickest way to move the data in most cases is LDI. Draw solid sprites separately, and possible, build a large sprite from smalled ones with some transparent sections, some opaque.
4. Keep your sprite data from crossing page boundaries on pixel rows (or at all if possible). For the same reason as 1 above, you may be able to increment your data pointer using INC reg8.
5. Use mask tables to save memory. If you've got plenty of memory left, you can put your AND and OR masks for each sprite with the sprite data, this is the fatest approach for transparent sprites:
LD A,(DE)
AND (HL)
INC L
OR (HL)
LD (DE),A
But in most cases, you don't have that much memory to store all your graphics. You can create a 256 AND mask table, and possibly also a 256 byte OR mask table. This is the quickest way to mask the data while saving some memory.
LD A,(BC)
INC C
LD L,A
LD A,(DE)
AND (HL)
INC H
OR (HL)
DEC H
LD (DE),A
The other advantage of using this method is that with a small change to the above code you can quite easily use different mask tables. In ZACK, I use a reversing table which has the two MODE 0 pixels reversed, essentially flipping the byte left to right. By then changing the INC C to a DEC C (and starting at a different offset in the sprite data) it's easy to flip a sprite, saving memory for sprites which can face both ways. Another table has the OR masks all set to ink 15 for every non-transparent pixel. This can then be masked again with a colour to create a solid colour version of a sprite.
Out of registers?
As you can see with the above code, all the 8 bit registers are used. What's left for loop counters etc?
1. One method is to use the alternate register set (beware if the OS is still being used). The EXX instruction will swap BC,DE and HL in one cycle.
2. Another method (if you're not worried about clipping and all your sprites are the same size) is to unroll the loop n times where n is the width in bytes of the sprite, eg. for 4 bytes wide:
.byte0
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
.byte1
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
.byte2
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
.byte3
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
This is by far the fastest method if you make sure your sprites never need to clip. If they do need to clip, you can use some tricky self modifying code to loop part way through this code. eg. Store the address (byte0, byte1, byte2 or byte3) in IX and do PUSH IX:RET to continue the loop. Non-indexed, ie. not (IX+n) are only one cycle slower than operations on HL.
3. Use undocumented 8 bit operations on IX and IY 8 bit values. You could use HX for your loop counter, for example:
.dobyte
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:INC E
DB #DD:DEC H
JR NZ,dobyte
The same of course applies to the unrolled version, where you could use HX for the row counter.
Incrementing the row
How do we start drawing the next screen line? Since we've already used all the registers, this can be quite tricky. Each scan line is offset by #800 bytes, except that every 8 scan lines the offset is actually (CRTC Register 1 * 2) - #3800. Also, we've been incrementing the screen address each byte by 1, so now our offset is also out by 4 bytes if the sprite is 4 bytes wide. Clipping the sprite makes this even more of a challenge, but I won't get into that :)
With the unrolled version above, with only 4 bytes wide, the easiest method would be to change the last INC E to three DEC E's, ie.
.byte3
LD A,(BC):INC C
LD L,A:LD A,(DE)
AND (HL):INC H:OR (HL):DEC H
LD (DE),A:DEC E:DEC E:DEC E
Then we only need to decrement E two more times to get it back to the original state, and increment the line. If your sprite is more that 4 bytes wide, you're not using unrolled loops, you'll need to either subtract the width from E, or make sure you add the appropriate amount (eg. #7FC rather than #800), but this involved using 16 bit addition which is slower, and we're short of registers as usual.
The technique I normally use is to either use some self modifying code, eg.
LD A,E
.sprwid equ $+1
SUB 4
LD E,A
Or to store the width in one of those undocumented 8 bit registers (HX, HY, LX or LY). eg.
LD A,E
DB #DD:SUB L
LD E,A
Calculating the next screen address is usually relatively simple then, using 8 bit registers:
LD A,D
ADD 8
LD D,A
JR NC,nowrap
LD A,#40
ADD E
LD E,A
LD A,#C0
ADC D
LD D,A
There are of course numerous other ways of doing this, maybe even faster ones. One example is to have the stack pointing to a table of screen addresses for each row, and hold an offset in a register (with interrupts disabled, or interrupt safe timing), eg.
POP DE
LD A,I
ADD E
LD E,A
The only thing left to do now is continue the loop for each row, depending on where you stored your loop counter(s), this could be as simple as:
DB #FD:DEC H
JR NZ,dobyte
A complete sprite routine
Here's a pretty fast transparent sprite drawing routine in full. Note that it assumes the screen with is 32 mode 1 pixels, and the screen base is #c000 with CRTC register 9 set to 7, and has no clipping. Please note, I've also changed the loop so that it exits with RET Z before calculation the next address if it's the last row. This also allowed me to use jr nc,rowloop to continue the loop if the address doesn't wrap. Please also note that I haven't actually tried to assemble this code yet :) it may need a couple of tweaks to run!
;Entry: BC = Sprite address, D = width, E = height, HL = screen address
db #dd:ld l,d ;LX = width
db #fd:ld h,e ;HY = height
ex de,hl
ld h,andmasks / 256
.rowloop
db #dd:ld h,l ;HX <= width
.dobyte
ld a,(bc):inc c
ld l,a:ld a,(de)
and (hl):inc h:or (hl):dec h
ld (de),a:inc e
db #dd:dec h
jr nz,dobyte
db #fd:dec h
ret z
ld a,e
db #dd:sub l
ld e,a
ld a,d
add 8
ld d,a
jr nc,rowloop
ld a,e
add #40
ld e,a
ld a,d
adc #c0
ld d,a
jr rowloop
Further Optimisation
This code could be further improved by unrolling the loop, but that would involve using some self modifying code to patch the rowloop jumps. If you have a number of predetermined widths for your sprites, you could rewrite the routine unrolled for each available size. Note that this could make it difficult later if you want to clip the sprites.
If you don't need to either paint the sprite in a solid colour, or flip the sprite left to right (or you've got enough memory to store a flipped version), then you can optimise the routine to plot a single byte even further:
ld a,(bc):inc c
ld l,a:ld a,(de)
and (hl):or l
ld (de),a:inc e
Executioner 19:55, 12 July 2006 (CDT)