Time For Loops
This is my first blog in a long time so lets try to do something simple.
Let’s explore some loops. We will mostly be using C++ for this and see what sort of assembly we can generate. Will it be fast? Not sure. Loops being fast are also mostly a function of what the loops are doing inside the loop body so.. yeah. All code samples presented here are with C++20 unless otherwise stated.
Normal Loops
C-Style Simple Loops
Let’s start simple. Here’s a simple for loop wrapped in a function.
C++ Code
// gcc 15.2: -std=c++2a -O3
#include <span>
#include <cstdint>
using namespace std;
int simple_c_style_for_loop(int* arr, const size_t n) {
int sum = 0;
for(int i = 0;i < n;i++) {
sum += arr[i] + sum / 10;
}
return sum;
}
Assembly (x86-64)
simple_c_style_for_loop(int*, unsigned long):
test rsi, rsi
je .L4
lea rsi, [rdi+rsi*4]
xor edx, edx
.L3:
movsx rax, edx
mov ecx, edx
add rdi, 4
imul rax, rax, 1717986919
sar ecx, 31
sar rax, 34
sub eax, ecx
add eax, DWORD PTR [rdi-4]
add edx, eax
cmp rsi, rdi
jne .L3
mov eax, edx
ret
.L4:
xor edx, edx
mov eax, edx
ret
What about a while loop?
C++ Code
// gcc 15.2: -std=c++2a -O3
#include <span>
#include <cstdint>
using namespace std;
int simple_c_style_while_loop(int* arr, const size_t n) {
int loop_counter = 0;
int sum = 0;
while (loop_counter < n) {
sum += arr[loop_counter++] + sum / 10;
}
return sum;
}
Assembly (x86-64)
simple_c_style_while_loop(int*, unsigned long):
test rsi, rsi
je .L4
lea rsi, [rdi+rsi*4]
xor edx, edx
.L3:
movsx rax, edx
mov ecx, edx
add rdi, 4
imul rax, rax, 1717986919
sar ecx, 31
sar rax, 34
sub eax, ecx
add eax, DWORD PTR [rdi-4]
add edx, eax
cmp rsi, rdi
jne .L3
mov eax, edx
ret
.L4:
xor edx, edx
mov eax, edx
ret
Well, pretty much the same code. not much to say here I guess.
Finally, a do-while loop for completeness’s sake.
C++ Code
// gcc 15.2: -std=c++2a -O3
#include <span>
#include <cstdint>
using namespace std;
int simple_c_style_do_while_loop(int* arr, const size_t n) {
int loop_counter = 0;
int sum = 0;
do {
sum += arr[loop_counter++] + sum / 10;
} while (loop_counter < n);
return sum;
}
Assembly (x86-64)
simple_c_style_do_while_loop(int*, unsigned long):
mov r8, rdi
xor ecx, ecx
mov rdi, rsi
xor edx, edx
.L2:
movsx rax, edx
mov esi, edx
imul rax, rax, 1717986919
sar esi, 31
sar rax, 34
sub eax, esi
add eax, DWORD PTR [r8+rcx*4]
add rcx, 1
add edx, eax
cmp rcx, rdi
jb .L2
mov eax, edx
ret
Well, the assembly here is certainly shorter and we use jb instead of jne, as well as the DWORD PTR access being different.
Sooo… Is there any actual performance difference??
Performance Comparison
To see if there’s any real-world difference, I ran some benchmarks using Google Benchmark on my personal computer. As is expected and perhaps rational, there’s not really any difference to be seen. Maybe in your machine/compiler version there can be a difference in the do-while loop?
| Benchmark | Iterations | Time (ns) | Throughput |
|---|---|---|---|
| Do-While | 1,000 | 1,965 | 1.90 Gi/s |
| Do-While | 10,000 | 19,716 | 1.89 Gi/s |
| Do-While | 100,000 | 197,121 | 1.89 Gi/s |
| Do-While | 10,000,000 | 19,809,072 | 1.88 Gi/s |
| While | 1,000 | 1,963 | 1.90 Gi/s |
| While | 10,000 | 19,711 | 1.89 Gi/s |
| While | 100,000 | 197,262 | 1.89 Gi/s |
| While | 10,000,000 | 19,817,713 | 1.88 Gi/s |
| For | 1,000 | 1,971 | 1.89 Gi/s |
| For | 10,000 | 19,709 | 1.89 Gi/s |
| For | 100,000 | 197,470 | 1.89 Gi/s |
| For | 10,000,000 | 19,846,456 | 1.88 Gi/s |
In the next part, I will be continuing checking out the other types of loops in C++.