Changes

Programming:Integer Multiplication

6,379 bytes added, 6 March
/* Fast 8bit * 8bit Unsigned with only 512 bytes of tables */ added improved version by calc84maniac
</pre>
== FasterFastest, accurate 8bit * 8bit Unsigned with 16 KB tables == '''Input:''' L = Multiplier, C = Multiplicand '''Output:''' DE = Product '''CPC Cycles:''' 80 = 20 usec '''Size:''' 14 bytes of code + 16 KB tables
I'm currently working on a new routine, based on a routine by Kirk Meyer [http://www.ticalc.org/pub/86/asm/source/routines/vfmult.asm] which uses nibble multiplication tables.
ld d,a ; 20
</pre>
 
== Fast, accurate 8bit * 8bit Unsigned with 8 KB tables ==
 
'''Input:''' L = Multiplier, C = Multiplicand
 
'''Output:''' DE = Product
 
'''CPC Cycles:''' 100 = 25 usec
 
'''Size:''' 19 bytes of code + 8 KB tables
16K is a lot of memory to use for tables, but I'm working on a way to reduce this to 8K while maintaining similar performance (around 26&mu;s). The idea is rather than using tables for the low and high nibbles, use tables for alternate bits. So there would be 16 tables for the values #00, #01, #04, #05, #10, #11, #14, #15, #40, #41, #44, #45, #50, #51, #54, #55. The values can be shifted by 1 (rather than 4) to give values for the alternate bits using a simple ADD HL,HL.
== Very fast 8bit * 8bit Unsigned with only 1K of tables ==
'''Input:''': B A = ''Multiplier'', C L = Multiplicant''Multiplicand''
'''Output:''': DE = ''Product''
'''ClocksCPC Cycles''': 104-112 (108 on average)= 26-28 (27) usec
'''Size''': 24 bytes of code and 1024 bytes for the tables
Now for the actual multiply routine:
'''Input:''' A = ''Multiplier'', L = ''Multiplicand''
 
'''Output:''' DE = ''Product''
<pre>
ld h,umul_tab_lo / #100 ; 2
[[User:Executioner|Executioner]] 06:25, 4 April 2008 (CEST)
== Fast 8bit * 8bit Unsigned with only 512 bytes of tables == '''Input''': A = Multiplier, B = Multiplicant '''Output''': A:E = Product '''CPC Cycles''': 136-172 (154 on average) = 34-43 (38.5) usec '''Size''': 44 bytes of code and 512 bytes for the tables Fast 8 bit unsigned multiplication with 16 bit result. It uses formula<br>x*y = ((x+y)/2)<sup>2</sup> - ((x-y)/2)<sup>2</sup>, if x+y is even<br> = ((x+y-1)/2)<sup>2</sup> - ((x-y-1)/2)<sup>2</sup> + y, if x+y is odd and x>=y <pre> cp b jr nc,l1  ld e,a ld a,b ld b,e l1 ld c,a sub b rra ld d,a ld a,c add a,b rra ld l,a ld h,high(sqrlo) ld a,(hl) ld e,l ld l,d jr nc,l2  sub (hl) ;odd ld l,e ld e,a inc h ;loads high(sqrhi) ld a,(hl) ld l,d sbc a,(hl) ld d,a  ld a,e add a,b ld e,a ld a,d adc a,0 ret l2 sub (hl) ;even ld l,e ld e,a inc h ld a,(hl) ld l,d sbc a,(hl) ret sqrlo ;low(x*x) should be at the page border db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1 db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1 db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1 db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81 db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61 db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41 db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21 db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1 db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1 db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1 db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1 db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81 db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61 db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41 db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21 db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1sqrhi ;high(x*x) db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 db 1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,3 db 4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,8 db 9,9,9,$a,$a,$a,$b,$b,$c,$c,$d,$d,$e,$e,$f,$f db $10,$10,$11,$11,$12,$12,$13,$13,$14,$14,$15,$15,$16,$17,$17,$18 db $19,$19,$1a,$1a,$1b,$1c,$1c,$1d,$1e,$1e,$1f,$20,$21,$21,$22,$23 db $24,$24,$25,$26,$27,$27,$28,$29,$2a,$2b,$2b,$2c,$2d,$2e,$2f,$30 db $31,$31,$32,$33,$34,$35,$36,$37,$38,$39,$3a,$3b,$3c,$3d,$3e,$3f db $40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$4a,$4b,$4c,$4d,$4e,$4f db $51,$52,$53,$54,$55,$56,$57,$59,$5a,$5b,$5c,$5d,$5f,$60,$61,$62 db $64,$65,$66,$67,$69,$6a,$6b,$6c,$6e,$6f,$70,$72,$73,$74,$76,$77 db $79,$7a,$7b,$7d,$7e,$7f,$81,$82,$84,$85,$87,$88,$8a,$8b,$8d,$8e db $90,$91,$93,$94,$96,$97,$99,$9a,$9c,$9d,$9f,$a0,$a2,$a4,$a5,$a7 db $a9,$aa,$ac,$ad,$af,$b1,$b2,$b4,$b6,$b7,$b9,$bb,$bd,$be,$c0,$c2 db $c4,$c5,$c7,$c9,$cb,$cc,$ce,$d0,$d2,$d4,$d5,$d7,$d9,$db,$dd,$df db $e1,$e2,$e4,$e6,$e8,$ea,$ec,$ee,$f0,$f2,$f4,$f6,$f8,$fa,$fc,$fe</pre> [[User:Litwr|Litwr]] 10:25, 24 October 2015 (CEST)   == Fast 8bit * 8bit Unsigned with only 512 bytes of tables (improved version) == '''Input''': A = Multiplier, L = Multiplicant '''Output:''' DE = ''Product'' '''CPC Cycles''': 124 = 31 usec '''Size''': 27 bytes of code and 512 bytes for the tables '''by calc84maniac''' Hey, so someone brought up the wiki's 512-byte table routine in the context of Game Boy development, and I thought I'd take a crack at seeing if the speed and size could be made closer to the 1KB table routine. I actually had a good bit of success, by making the following observations:<br>* The x >= y constraint isn't necessary if negative values for floor((x - y) / 2) have their square calculated properly, so no need to swap inputs.<br>* Going from floor((x + y) / 2) to floor((x - y) / 2) simply requires a subtraction of y, so no need to divide by 2 twice or save the input value of x.<br>* To square a negative number z, it's sufficient to calculate z ^ 2 = (z + 256) ^ 2 - z * 512 - 65536, meaning a conditional offset of -(z * 512) in the subtracted square, or z * 512 in the final result.<br>* When conditionally adding y to the result when x + y is odd, the carry can be propagated into the low bit of the 8-bit register holding the upper half of z * 512.<br>* Both of these conditional adds can be turned into branchless adds of either the value or zero.<br><br>This results in the following implementation (which I adjusted to take the same inputs and outputs as the 1KB routine for a fair comparison), which comes out to 27 bytes and a constant 31 usec:<br> <pre>; Input: A = x, L = y; Output: DE = x * y ; C = floor((x + y) / 2) add a,l ; 1 rra ; 2 ld c,a ; 3 ; E = y if x + y is odd, else 0 sbc a,a ; 4 and l ; 5 ld e,a ; 6 ; L = floor((x - y) / 2) ld a,c ; 7 sub l ; 8 ld l,a ; 9 ; D = L if negative, else 0 sbc a,a ; 10 and l ; 11 ld d,a ; 12 ; Calculate low half of product ld h,high(sqrlo) ; 14 ld b,h ; 15 ld a,(bc) ; 17 add a,e ; 18 rl d ; 20, multiplies D by 2 and propagates carry sub (hl) ; 22 ld e,a ; 23 ; Calculate high half of product inc h ; 24, loads high(sqrhi) ld b,h ; 25 ld a,(bc) ; 27 sbc a,(hl) ; 29 add a,d ; 30 ld d,a ; 31</pre> == 16bit * 16bit Unsigned -> 32 bits result == '''Input:''' BC = ''Multiplier'', DE = ''Multiplicand'' '''Output:''' DE,HL = ''Product''<pre>ld hl,0ld a,16muluw add hl,hl rl e rl d jr nc,muluw_cont add hl,bc jr nc,muluw_cont inc demuluw_cont dec a jr nz,muluw</pre> == 16bit * 16bit Unsigned -> 24 bits result ==
'''Input:''' BC = ''Multiplier'', DE = ''Multiplicand''