“Whats wrong with this code?”
for ( int i=0; i<10; i++ ) { //do stuff }
As it turns out, I will soon start working as an embedded software developer. It seems then, now would be a good time to start learning how to program properly. I’ll be putting up some tidbits here for my own reference and hopefully others to learn something as well!
Several months ago, during a (the) job interview, I was told about reverse loops. Supposedly, computer scientists will scoff at this, but since I never learnt about it until interviewing for my first job after five years of electrical engineering studies and even more time doing hobby-level coding, I thought I might just put one more mention of it on the internetz.
So, if you find yourself writing code for a system with heavy resource constraints, it would seem the school-book example of a for-loop above has some room for improvement. Assuming the loop body does not depend on i incrementing upwards, it would be better to do this:
for ( int i=10; i>=0; i-- ) { //do stuff }
Why? It all depends on what instructions the code compiles down to. When evaluating whether or not to stay in the loop, the processor has to first put the number 10 into a register, then do a comparison to see which one is larger. The comparison in its turn is usually implemented by subtracting and checking the sign of the result. Since most architectures have a designated “zero register”, the reverse loop may skip loading the value to a register and go straight to comparing. Before we move any further however, there is one more thing we can do:
for ( int i=10; i--; ) { //do stuff }
Since the loop criterion has to be fulfilled in order to stay in the loop, we are looking at the zero register (Z) to find out when to break it. This means we might as well use the decrementing itself as the loop criterion, saving us the whole compare instruction altogether!
Sweet! I was nearly going to leave it at that but of course you should not have to take my word for it. Lets try to verify this on an MSP430 using msp430-gcc. First, lets implement the three loop variants:
#include <msp430.h> //Normal loop void loop1() { for ( int i=0; i<10; i++ ) { __delay_cycles( 1 ); } } //Slightly optimized void loop2() { for ( int i=10; i>=0; i-- ) { __delay_cycles( 1 ); } } //Super optimized! void loop3() { for ( int i=10; i--; ) { __delay_cycles( 1 ); } } void main() { loop1(); loop2(); loop3(); }
Next, I compiled the code and fed it through msp430-objdump to get the assembly output. Note that compiler optimizations were set to the lowest level with the -O0 flag and variable declaration inside the for-statement was made possible with -std=c99 (let’s not get into that today):
$ msp430-gcc -O0 -std=c99 main.c -mmcu=msp430g2553 -o main.elf $ msp430-objdump -DS main.elf > main.lst
Now this is where it gets interesting. Watch what happens in the relevant parts of the output:
0000c058 <loop1>: c058: 04 12 push r4 c05a: 04 41 mov r1, r4 c05c: 24 53 incd r4 c05e: 21 83 decd r1 c060: 84 43 fc ff mov #0, -4(r4) ;r3 As==00, 0xfffc(r4) c064: 03 3c jmp $+8 ;abs 0xc06c c066: 03 43 nop c068: 94 53 fc ff inc -4(r4) ;0xfffc(r4) c06c: b4 90 0a 00 cmp #10, -4(r4) ;#0x000a, 0xfffc(r4) c070: fc ff c072: f9 3b jl $-12 ;abs 0xc066 c074: 21 53 incd r1 c076: 34 41 pop r4 c078: 30 41 ret 0000c07a <loop2>: c07a: 04 12 push r4 c07c: 04 41 mov r1, r4 c07e: 24 53 incd r4 c080: 21 83 decd r1 c082: b4 40 0a 00 mov #10, -4(r4) ;#0x000a, 0xfffc(r4) c086: fc ff c088: 03 3c jmp $+8 ;abs 0xc090 c08a: 03 43 nop c08c: b4 53 fc ff add #-1, -4(r4) ;r3 As==11, 0xfffc(r4) c090: 84 93 fc ff tst -4(r4) ;0xfffc(r4) c094: fa 37 jge $-10 ;abs 0xc08a c096: 21 53 incd r1 c098: 34 41 pop r4 c09a: 30 41 ret 0000c09c <loop3>: c09c: 04 12 push r4 c09e: 04 41 mov r1, r4 c0a0: 24 53 incd r4 c0a2: 21 83 decd r1 c0a4: b4 40 0a 00 mov #10, -4(r4) ;#0x000a, 0xfffc(r4) c0a8: fc ff c0aa: 01 3c jmp $+4 ;abs 0xc0ae c0ac: 03 43 nop c0ae: 5f 43 mov.b #1, r15 ;r3 As==01 c0b0: 84 93 fc ff tst -4(r4) ;0xfffc(r4) c0b4: 01 20 jnz $+4 ;abs 0xc0b8 c0b6: 4f 43 clr.b r15 c0b8: b4 53 fc ff add #-1, -4(r4) ;r3 As==11, 0xfffc(r4) c0bc: 4f 93 tst.b r15 c0be: f6 23 jnz $-18 ;abs 0xc0ac c0c0: 21 53 incd r1 c0c2: 34 41 pop r4 c0c4: 30 41 ret
See that? The most optimised loop, loop3(), has the most instructions! How can that be!? Lets try again with -Os for size optimization:
0000c054 <loop1>: c054: 3f 40 0a 00 mov #10, r15 ;#0x000a c058: 03 43 nop c05a: 3f 53 add #-1, r15 ;r3 As==11 c05c: fd 23 jnz $-4 ;abs 0xc058 c05e: 30 41 ret 0000c060 <loop2>: c060: 3f 40 0b 00 mov #11, r15 ;#0x000b c064: 03 43 nop c066: 3f 53 add #-1, r15 ;r3 As==11 c068: fd 23 jnz $-4 ;abs 0xc064 c06a: 30 41 ret 0000c06c <loop3>: c06c: 3f 40 0b 00 mov #11, r15 ;#0x000b c070: 01 3c jmp $+4 ;abs 0xc074 c072: 03 43 nop c074: 3f 53 add #-1, r15 ;r3 As==11 c076: fd 23 jnz $-4 ;abs 0xc072 c078: 30 41 ret
Sorry, loop3() is still loosing!!
So what have we learned from this exercise? Unless I’m missing something, this seems to be a typical case of what shall henceforth be known as “the-people-who-wrote-the-compiler-were-smarter-than-you”. Even on the lowest optimisation setting, the compiler is already reversing the loop, as it detects the loop variable is not used. Supposedly, a more complex loop body would change this. Still, I’m not sure why they don’t all compile down to the same code then. Maybe one day I will have become smart enough to fix it!
While the theory behind the reverse loop as I presented it above seems sound, it appears the compiler optimizes the “normal” code best anyway, at least on this one platform with this one compiler. Until I actually have a resource constrained system to shoehorn into a low-power device, I’ll just keep writing my loops as I used to. This also has another major advantage: readability. A traditional loop statement is simply easier to understand. Also, I don’t feel like explaining that my loops are not broken, but brilliant. Especially when they are, in fact, not even brilliant.
Leave a Reply