Benchmarks of list initialization to arr[i]=i¶

Benchmarks of filling an array of 1E8 ints (32 bits generally). I place the description UNDER the codes, because that's where the results will show! This was done using python 3.8, which is a little old.

Cling was built with root as follows:

raw
   ------------------------------------------------------------------
  | Welcome to ROOT 6.30/04                        https://root.cern |
  | (c) 1995-2024, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Jul 26 2024, 08:45:57                 |
  | From heads/master@tags/v6-30-04                                  |
  | With c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0                   |
  | Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q'  |
   ------------------------------------------------------------------

Most of the code is now here: The benchmarks

The gcc runs work were a mix of gcc 9.4 and 14.2. I updated and convinced myself that there was no significant difference in the asm or the resulting times.

The C# work was using dotnet 9.0 (latest and greatest).

I started this to compare observable containers in C# with Lists, and actually AI made the original code for that almost instantly, with a little correction. But then I figured, why not arrays? Why not unsafe pointers? Hey, jupyter has a cling kernel, why not try that? Then I found a cling bug, and then I dug into g++ and assembly too and figured I may as well look at numpy and python while I was at it.

For constency below I compare time for 10^8 writes of 32 bit integers, so 400MB. A time of 0.4 seconds corresponds to 1GB/s. The best times achieved ~ 0.035s correspond to about 12 GB/s where this machine has a maximum theoretical memory throughput of probably around 17GB/s, without digging into my bios to look at exact RAM timmings. Anyway, 0.035s (I've seen .033 ocassionally) is probably hitting the hardware limit.

C++ indexed vs ptr arithmatic.¶

The first set of tests is with an inline loop of arr[i]=i
The arr variable is set volatile to insure nothing is optimized away, but I checked the assembly too:

%use xcpp14
#include <iostream>
#include <chrono>
#include <string>
#include <vector>
// Include necessary libraries
#ifdef ASM
extern "C" void index_test(int arr[],int count);
#else
void index_test(int arr[],int count);
void ptr_test(int arr[],int count);
void vector_test(std::vector<int>& vec); 
#endif
void the_benchmark(void (*loop_test)(int[], int),std::string msg) {
    const int loop_write_count = (int)(1E7);
    int* arr = new int[loop_write_count];  // Create a dynamic array
    const int multiplier = 10;  // loop the loop this many times
    //prefill to page memory maybe:
    for (int i = 0; i < loop_write_count; i++) {
        arr[i] = loop_write_count-i; 
    }
    auto start = std::chrono::high_resolution_clock::now();
    for (int j = 0; j < multiplier; j++){
        // compiler cannot optimize away what it can't see!
        loop_test(arr,loop_write_count);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cerr << "Loop validation: " << (bool)arr[loop_write_count-1] << std::endl;
    auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    std::chrono::duration<double> duration = std::chrono::duration<double>(duration_ns);
    std::cout << msg << duration.count() << " seconds" << std::endl;
}

//It's actually hard to template without impacting anything:
void the_benchmark(void (*loop_test)(std::vector<int>&), std::string msg) {
    const int loop_write_count = (int)(1E7);
    std::vector<int> vec(loop_write_count); 
    const int multiplier = 10; 
    //prefill to page memory maybe:
    for (int i = 0; i < loop_write_count; i++) {
        vec[i] = loop_write_count-i; 
    }
    auto start = std::chrono::high_resolution_clock::now();
    for (int j = 0; j < multiplier; j++){
        // compiler cannot optimize away what it can't see!
        loop_test(vec);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cerr << "Loop validation: " << (bool)vec[loop_write_count-1] << std::endl;
    auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    std::chrono::duration<double> duration = std::chrono::duration<double>(duration_ns);
    std::cout << msg << duration.count() << " seconds" << std::endl;
}
int benchmark() {
        the_benchmark(index_test, "Indexed  :    ");
#ifndef ASM
        the_benchmark(ptr_test, "Pointer math: ");
        the_benchmark(vector_test, "Vector : ");
#endif
        return 0;
}
//int main(){
    benchmark();
    return 0;
//}

Results follow as run with different interpreters/compilers:

Xeus cling (a cling kernel specifically for jupyter notebook installs:) **This was the test that started this **
See github issue

Indexed:      0.386421 seconds
Pointer math  29.3262 seconds

Note root cling gives about:

Indexed:      0.126567 seconds
Pointer math  27.2834 seconds

gcc -O0:

Indexed:      0.388667 seconds
Pointer math  0.454149 seconds

gcc -O3, volatile prevents SSE/SIMD:

Indexed:      0.136248 seconds
Pointer math  0.130748 seconds

this is the asm for the last one:

.L9:
    mov DWORD PTR [rdx], eax
    add eax, 1
    add rdx, 4
    cmp eax, 10000000
    jne .L9

but this is a little wasteful. x86 allows scaled addressing like:

.L9:
    mov DWORD PTR [rdi+4*eax], eax
    inc eax
    cmp eax, 10000000
    jne .L9

which might be a little faster even without unrolling.
I didn't test that, and that alone might make up the difference to clang.

But clang went farther and gains about 30% with this:

.LBB0_17:                              
	mov	dword ptr [rbx + 4*rax], eax
	lea	ecx, [rax + 1]
	mov	dword ptr [rbx + 4*rax + 4], ecx
	lea	ecx, [rax + 2]
	mov	dword ptr [rbx + 4*rax + 8], ecx
	lea	ecx, [rax + 3]
	mov	dword ptr [rbx + 4*rax + 12], ecx
	lea	ecx, [rax + 4]
	mov	dword ptr [rbx + 4*rax + 16], ecx
	lea	ecx, [rax + 5]
	mov	dword ptr [rbx + 4*rax + 20], ecx
	lea	ecx, [rax + 6]
	mov	dword ptr [rbx + 4*rax + 24], ecx
	lea	ecx, [rax + 7]
	mov	dword ptr [rbx + 4*rax + 28], ecx
	add	rax, 8
	cmp	rax, 10000000
	jne	.LBB0_17

Indexed:      0.083 seconds  
Pointer math  0.083 seconds

This is better than gcc because the inner loop is unrolled, diluting the loop overhead.

(Note: lea breaks chain dependency and should allow the moves and adds to parallelize as opposed to add ecx 1. clang kind of absurdly fully unrolls the outer loop toop, making 10 copies of this code for almost no gains.)

But still, neither are using vectorized instructions and the reason seems to be they are afraid to because of the volatile declaration. This is kind of silly but it's what they do, so...

C++ Loop as separate compilation unit:¶

This can lead to a little different results because a) There is no possibility to fully optimize away the loop, volatile or not, since the compiler doesn't know how it will get used. b) Likewise, volatile isn't restricting its optimizations. c) But it doesn't know what the loop unrolling benefit is since it can't see size, so it compromise a little for smaller arrays.

void index_test(int arr[],int size){    
    for (int i = 0; i < size; ++i) {
            arr[i] = i;
    }
}

void vector_test(std::vector<int>& vec) {
    for (int i = 0; i < vec.size(); ++i) {
        vec[i] = i;
    }
}

... and likewise for pointer math.

passed a non-volatile, unaligned input,

g++ -O0 (note the assembly here is scalar, and horrible, it moves i from stack to register twice in every loop, has two conditionals, and more. It's just ugly horribleness:

Indexed  :    0.146968 seconds
Pointer math: 0.14157 seconds
Vector : 0.423789 seconds

g++ -O3 Assembly here is ok. It uses a single lane of SSE/SIMD/vectorized addition, and no unrolling. It wastes a register and an add in what I believe is a failed attempt to break a cyclic dependency (you just cannot pipeline i++), but basically it hits what seems to be my RAM speed limit in my VM, so that's as good as I can do anyway. I did write a version use the maximum eight lanes, without the Texas two-step, but I think improvements are just limited by I/O.

Indexed  :    0.0361919 seconds
Pointer math: 0.0383006 seconds
Vector : 0.0356294 seconds

Here's the core of the asm loop for that for indexed:

.L4:
	movdqa	xmm1, xmm0 ; copy to parallelize add and move
	add	rax, 16                       
	paddd	xmm0, xmm2
	movups	XMMWORD PTR -16[rax], xmm1
	cmp	rax, rdx
	jne	.L4

clang++ -O3 used two vectorized lanes and a single side-step but no wasted registers:

.LBB0_4:                               
	movdqa	xmm3, xmm0
	paddd	xmm3, xmm1
	movdqu	xmmword ptr [rdi + rsi], xmm0
	movdqu	xmmword ptr [rdi + rsi + 16], xmm3
	paddd	xmm0, xmm2
	add	rsi, 32
	cmp	rdx, rsi
	jne	.LBB0_4

But results are getting to my I/O limit and depend strongly on fluctuating state of background tests. I killed as many things as possible for this and gcc above, and take the best result of several runs. Clang should be better here, but I can't easily prove it.

Indexed  :    0.0382706 seconds

furhtermore if I hard-coded the loop size, both clang and gcc unrolled significantly more. Presumably without knowing the loop size they're compromising a little for smaller arrays.

I also ran a custom asm with the maximum 8 lanes of vectorization and unrolling but could observe no improvement in this or the unrolled compiler versions, as I'm just at my I/O limit anyway.

PYTHON...¶

Normal python list arr[i]=i¶

about 4.5 seconds on a good run.

In [12]:

%use python3
#numba.py
import time
#import numpy as np


def prefill(arr):  
    size = len(arr)
    for i in range(size):
        arr[i] = size-i  

def test_loop(arr):
    size = len(arr)
    for i in range(size):
        arr[i] = i  

def benchmark():
    size = int(1E8)
    arr = [0]*size # Std python list
    prefill(arr)

   
    start = time.time()
    test_loop(arr) 
    end = time.time()

    duration = end - start
    print(f": {duration:.6f} seconds")

# Run the benchmark
benchmark()

: 4.357281 seconds

Numpy array (arr = np.zeros(size), then arr[i]=i)¶

More than twice slower than pure python!! Be careful!

In [10]:

%use python3
#numba.py
import time
import numpy as np


def prefill(arr):  # to pre-page memory
    size = len(arr)
    for i in range(size):
        arr[i] = size-i  # Populate the array with values

def test_loop(arr):
    size = len(arr)
    for i in range(size):
        arr[i] = i  # Populate the array with values

def benchmark():
    size = int(1E8)
    arr = np.zeros(size)  # Use a typed NumPy array
    prefill(arr)

    # Step 1: Measure the time taken for the forward fill
    start = time.time()
    test_loop(arr)  # Call the JIT-compiled function to fill forwards
    end = time.time()

    duration = end - start
    print(f": {duration:.6f} seconds")

# Run the benchmark
benchmark()

: 12.055232 seconds

Numpy vectorized syntax, not necessarily vectorized cpu instructions:¶

The below test behaves a little finicky. It's slow on the first few tries and speeds up. Almost makes one wonder if it's real, but doesn't seem like caching exactly. The same is true from a command line, but less so than jupyter. The CLI version tends to run slightly faster, but similar within fluctions. It can get down to about 0.2s for 1E8 writes. Quite good.

In [13]:

%use python3
#vectorized.py
import numpy as np
import time
import sys
print(sys.version)

def benchmark():
    size = int(1E8)
    arr = np.arange(size, 0, dtype=int)  # Create a NumPy array with values from 0 to size-1
    
    start = time.time()
    # The array is filled with values from 0 to size-1 using NumPy's vectorized operation
    arr = np.arange(size, dtype=int)
    end = time.time()
    
    duration = end - start
    print(f"NumPy Vectorized fill: {duration:.6f} seconds")

# Run the benchmark
benchmark()

3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
NumPy Vectorized fill: 0.255053 seconds

Numpy with numba, static types, scalar (arr[i]=i)¶

typically around .11s on good run.

In [17]:

%use python3
#numba.py
import time
import numpy as np
from numba import jit, types
from numba.typed import List

# Numba-compiled function with a statically typed NumPy array

@jit(nopython=True)
def prefill(arr):  # to pre-page memory
    size = len(arr)
    for i in range(size):
        arr[i] = size-i  # Populate the array with values

@jit(nopython=True)
def benchmark_inner(arr):
    size = len(arr)
    for i in range(size):
        arr[i] = i  # Populate the array with values

def benchmark():
    size = int(1E8)
    arr = np.zeros(size, dtype=np.int32)  # Use a typed NumPy array
    prefill(arr)

    # Step 1: Measure the time taken for the forward fill
    start = time.time()
    benchmark_inner(arr)  # Call the JIT-compiled function to fill forwards
    end = time.time()

    duration = end - start
    print(f"Indexed with Numba (typed NumPy): {duration:.6f} seconds")

# Run the benchmark
benchmark()

Indexed with Numba (typed NumPy): 0.108047 seconds

Numpy with numba, static types, vectorized (arr = np.arange(size))¶

typically about 0.36s on a good run, jupyter or CLI

In [13]:

%use python3
#numba.py
import time
import numpy as np
from numba import jit, types
from numba.typed import List

# Numba-compiled function with a statically typed NumPy array

@jit(nopython=True)
def prefill(arr):  # to pre-page memory
    size = len(arr)
    arr = np.arange(size)

@jit(nopython=True)
def benchmark_inner(arr):
    size = len(arr)
    arr = np.arange(size)

def benchmark():
    size = int(1E8)
    arr = np.zeros(size, dtype=np.int32)  # Use a typed NumPy array
    prefill(arr)

    # Step 1: Measure the time taken for the forward fill
    start = time.time()
    benchmark_inner(arr)  # Call the JIT-compiled function to fill forwards
    end = time.time()

    duration = end - start
    print(f"Indexed with Numba, vectorized numpy: {duration:.6f} seconds")

# Run the benchmark
benchmark()

Indexed with Numba, vectorized numpy: 0.376308 seconds

Numba with un-typed python arrays (no numpy):¶

I just ran 10x less here, for same 1E8, it's 110s !!

In [16]:

%use  python3
import time
from numba import jit, types


def prefill(arr):  # to pre-page memory
    size = len(arr)
    for i in range(size):
        arr[i] = size - i  # Populate the array with values

@jit
def benchmark_inner(arr):
    size = len(arr)
    for i in range(size):
        arr[i] = i  # Populate the array with values


def benchmark():
    size = int(1E7)
    arr = [0] * size  # Use a Python list
    prefill(arr)

    # Step 1: Measure the time taken for the forward fill
    start = time.time()
    benchmark_inner(arr)  # Call the JIT-compiled function to fill forwards
    end = time.time()

    duration = end - start
    print(f"Indexed with Numba (typed Python list): {duration:.6f} seconds TIMES TEN")


# Run the benchmark
benchmark()

Indexed with Numba (typed Python list): 10.798340 seconds TIMES TEN

C#,¶

I got organized here and wrote a proper tester. The a link to the code follows, but results first.

Points:

Vectors (register-mapped variables) are as fast as gcc/clang.
- Impressive when you get bounds checking, and a full runtime system.
- But frankly these stink to program with. Barely better than assembly.
- You have to manually vectorize and probably parallize registers.
But raw arrays are only 1.5 times slower! These are auto-vectorizing!
They are 2x faster than numba with numpy.
Even lists aren't so much worse than numba, and you get a proper language.

Advanced nitpicks and annoyances:

Collections are slow, that's fine. List and IList are the most useful types in C# (List internally, IList publicly).
There are weird things happening with interfaces (IList). Interfaces are like duck-types in python (abstract classes in C++), but safer, very useful. But they make Collections faster but make Arrays even slower than Lists (which use arrays internally). It's possible to accept IList and conditionally downcast to arrays/lists if optimization is needed. Direct array is also ok for internal passing, not for tools.

The results below were a particularly good run. hard to reproduce. Vectorized can often be closer to 50, but this is system state dependent as g++ was also running fast at the time. Generally, regardless, Vectorized is running around the same speed as g++ -O3.

Array access time: 45 ms
Array access time: 53 ms
Array access time: 58 ms
Array access time: 63 ms
ArrayAsIList access time: 322 ms
ArrayAsIList access time: 385 ms
ArrayAsIList access time: 387 ms
ArrayAsIList access time: 386 ms
Vectorized access time: 39 ms
Vectorized access time: 37 ms
Vectorized access time: 34 ms
Vectorized access time: 38 ms
Unmanaged access time: 49 ms
Unmanaged access time: 48 ms
Unmanaged access time: 62 ms
Unmanaged access time: 58 ms
List access time: 142 ms
List access time: 180 ms
List access time: 176 ms
List access time: 191 ms
List as IList access time: 140 ms
List as IList access time: 175 ms
List as IList access time: 187 ms
List as IList access time: 180 ms
Collection access time: 813 ms
Collection access time: 877 ms
Collection access time: 913 ms
Collection access time: 887 ms
Collection as IList access time: 299 ms
Collection as IList access time: 321 ms
Collection as IList access time: 360 ms
Collection as IList access time: 319 ms
Observable Collection access time: 5232 ms
Observable Collection access time: 5729 ms
Observable Collection access time: 6459 ms
Observable Collection access time: 5788 ms
Re-running Array for throttling check.
Array access time: 45 ms
Array access time: 47 ms
Array access time: 51 ms
Array access time: 55 ms

The tester¶

The full code is available here: Csharp benchmark tests

It uses a modular test to test many tests each defining a warmup (Setup()) and Test() phase.

But generally the tests look something like these, including the simplest and most complex: The Test() function is what is timed after running Setup().

  public class ArrayTest : Tester {

        protected override string _name { get; } = "Array";
        private int[] _array = [];

        protected override void Setup() {
            //warmup
            _array = new int[Size];
            for (int j = 0; j < Multiplier; j++) {
                for (int i = 0; i < Size; i++) {
                    _array[i] = i;
                }
            }
        }

        protected override void Test() {
            for (int j = 0; j < Multiplier; j++) {
                for (int i = 0; i < Size; i++) {
                    _array[i] = i;
                }
            }
        }

    }


 public class VectorizedTest : Tester {
        protected override string _name { get; } = "Vectorized";
        private int[] _array = [];

        static int _lanes = 3;
        static int _vectorSize = Vector<int>.Count; // Get the number of ints in a vector
        static Vector<int> _delta = new Vector<int>(_lanes * _vectorSize); // Correct delta initialization
        static int _groupCount = _lanes * _vectorSize;
        Vector<int>[] vectors = new Vector<int>[_lanes];

        int _vectorizedSize = 0;
        static int _remainder = 0;

        protected override void Setup() {
            _array = new int[Size];
            // Ensure the size is a multiple of lanes * vectorSize
            _vectorizedSize = Size / (_lanes * _vectorSize) * (_lanes * _vectorSize);
            _remainder = Size - _vectorizedSize;
        }

        protected override void Test() {
            for (int j = 0; j < Multiplier; j++) {
                vectors = new Vector<int>[_lanes];
                for (int k = 0; k < _lanes; k++) {
                    vectors[k] = new Vector<int>(Enumerable.Range(k * _vectorSize, _vectorSize).ToArray()); //Correct range
                }

                for (int i = 0; i < _vectorizedSize; i += _groupCount) {
                    //Copy the initial values
                    vectors[0].CopyTo(_array, i + 0 * _vectorSize);
                    vectors[1].CopyTo(_array, i + 1 * _vectorSize);
                    vectors[2].CopyTo(_array, i + 2 * _vectorSize);
                    vectors[0] += _delta;
                    vectors[1] += _delta;
                    vectors[2] += _delta;
                }
                // Handle the remainder (if any)
                for (int i = _vectorizedSize; i < Size; i++) {
                    _array[i] = i;
                }
            }
        }
    }

In [ ]: