You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Matrix Multiplication Performance Across Python, C++, and C
Execution Environment
A 400×400 matrix multiplication benchmark was executed across three programming languages: Python, C++, and C. The recorded execution times were 19.86 seconds for Python, 1.01 seconds for C++, and 0.37 seconds for C. The sections below document the technical factors responsible for these results.
Python Execution Model
CPython, the reference implementation of Python, executes programs by first compiling source code to bytecode and then interpreting that bytecode at runtime. Each operation within a loop requires runtime type resolution, reference count management, and object attribute lookup. An expression such as A[i][j] involves multiple layers of pointer indirection that compiled languages resolve at compile time and eliminate from the execution path entirely.
Python's dynamic type system means that no type information is statically guaranteed. The interpreter must determine the type of every operand at the moment of execution. In a nested loop performing 64 million multiply-accumulate operations, this per-operation overhead compounds into the observed execution time.
The Global Interpreter Lock (GIL) present in CPython prevents concurrent execution of Python bytecode across multiple threads, which further constrains throughput in compute-bound workloads.
Performance-critical Python libraries such as NumPy address this by delegating computation to compiled C routines. The Python layer in such cases serves only as an interface.
Compilation Pipeline
A common misconception holds that C produces faster executables than C++ because C compiles directly to machine code while C++ requires an intermediate compilation stage. This is incorrect.
Both gcc and g++ employ an identical compilation pipeline. Source code is translated into an Intermediate Representation, passed through a series of optimization passes, lowered to architecture-specific assembly, and assembled into a binary. The source language label does not alter this pipeline. When a C program and a C++ program implement equivalent logic using equivalent data structures, the resulting machine code is typically identical.
Memory Layout and CPU Cache Behavior
The performance difference between the C++ and C implementations originates from memory layout, not from any property of the languages themselves.
The C++ implementation used std::vector<std::vector<long long>>. This structure allocates each row of the matrix as an independent heap object at an arbitrary memory address. Consecutive rows are not guaranteed to occupy adjacent memory locations. During the inner loop of matrix multiplication, the CPU must dereference a pointer to locate each row before accessing its elements. This introduces non-contiguous memory access patterns.
Modern processors rely on spatial locality. When a memory address is accessed, the CPU loads an entire cache line — typically 64 bytes of adjacent memory — into L1 or L2 cache in anticipation of sequential access. A vector-of-vectors structure defeats this mechanism. Each row transition requires following a pointer to an unpredictable address, generating a cache miss and stalling the CPU pipeline until data is retrieved from main memory, which has a latency several hundred cycles higher than L1 cache.
The C implementation used a two-dimensional array declared as long long A[SIZE][SIZE]. This stores the entire matrix as a single contiguous block of memory. Row elements are adjacent in memory, and consecutive rows follow immediately. The CPU prefetcher can load data ahead of time and the pipeline operates without cache-miss stalls.
This performance gap is therefore a consequence of contiguous versus fragmented memory layout. C++ using a flat one-dimensional allocation, std::vector<long long> A(SIZE * SIZE), with index arithmetic A[i * SIZE + j], produces contiguous memory and yields equivalent performance to the C implementation under the same compiler optimization flags.
Stack Allocation Considerations
A 400×400 matrix of 64-bit integers occupies 400 × 400 × 8 = 1,280,000 bytes, approximately 1.22 mebibytes. The default stack size on most operating systems is between 1 and 8 megabytes. Declaring three such matrices as local variables simultaneously risks exceeding the stack limit and produces undefined behavior.
A correct C implementation at this scale uses one of two approaches. The first is global or static array declaration, which places the arrays in the BSS segment, a region of fixed size allocated before program execution. The second is heap allocation via malloc using a flattened one-dimensional layout, which produces the same contiguous memory properties as a stack-allocated two-dimensional array without the stack size constraint.
Zero-Initialization
std::vector value-initializes its elements to zero upon construction. Zero-initializing 1,280,000 bytes is a single sequential memory write. At contemporary memory bandwidth rates this completes in the order of microseconds. The multiplication loop performs 64 million multiply-accumulate operations. The initialization cost is not a meaningful contributor to the observed execution time difference at this scale.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Matrix Multiplication Performance Across Python, C++, and C
Execution Environment
A 400×400 matrix multiplication benchmark was executed across three programming languages: Python, C++, and C. The recorded execution times were 19.86 seconds for Python, 1.01 seconds for C++, and 0.37 seconds for C. The sections below document the technical factors responsible for these results.
Python Execution Model
CPython, the reference implementation of Python, executes programs by first compiling source code to bytecode and then interpreting that bytecode at runtime. Each operation within a loop requires runtime type resolution, reference count management, and object attribute lookup. An expression such as
A[i][j]involves multiple layers of pointer indirection that compiled languages resolve at compile time and eliminate from the execution path entirely.Python's dynamic type system means that no type information is statically guaranteed. The interpreter must determine the type of every operand at the moment of execution. In a nested loop performing 64 million multiply-accumulate operations, this per-operation overhead compounds into the observed execution time.
The Global Interpreter Lock (GIL) present in CPython prevents concurrent execution of Python bytecode across multiple threads, which further constrains throughput in compute-bound workloads.
Performance-critical Python libraries such as NumPy address this by delegating computation to compiled C routines. The Python layer in such cases serves only as an interface.
Compilation Pipeline
A common misconception holds that C produces faster executables than C++ because C compiles directly to machine code while C++ requires an intermediate compilation stage. This is incorrect.
Both
gccandg++employ an identical compilation pipeline. Source code is translated into an Intermediate Representation, passed through a series of optimization passes, lowered to architecture-specific assembly, and assembled into a binary. The source language label does not alter this pipeline. When a C program and a C++ program implement equivalent logic using equivalent data structures, the resulting machine code is typically identical.Memory Layout and CPU Cache Behavior
The performance difference between the C++ and C implementations originates from memory layout, not from any property of the languages themselves.
The C++ implementation used
std::vector<std::vector<long long>>. This structure allocates each row of the matrix as an independent heap object at an arbitrary memory address. Consecutive rows are not guaranteed to occupy adjacent memory locations. During the inner loop of matrix multiplication, the CPU must dereference a pointer to locate each row before accessing its elements. This introduces non-contiguous memory access patterns.Modern processors rely on spatial locality. When a memory address is accessed, the CPU loads an entire cache line — typically 64 bytes of adjacent memory — into L1 or L2 cache in anticipation of sequential access. A vector-of-vectors structure defeats this mechanism. Each row transition requires following a pointer to an unpredictable address, generating a cache miss and stalling the CPU pipeline until data is retrieved from main memory, which has a latency several hundred cycles higher than L1 cache.
The C implementation used a two-dimensional array declared as
long long A[SIZE][SIZE]. This stores the entire matrix as a single contiguous block of memory. Row elements are adjacent in memory, and consecutive rows follow immediately. The CPU prefetcher can load data ahead of time and the pipeline operates without cache-miss stalls.This performance gap is therefore a consequence of contiguous versus fragmented memory layout. C++ using a flat one-dimensional allocation,
std::vector<long long> A(SIZE * SIZE), with index arithmeticA[i * SIZE + j], produces contiguous memory and yields equivalent performance to the C implementation under the same compiler optimization flags.Stack Allocation Considerations
A 400×400 matrix of 64-bit integers occupies 400 × 400 × 8 = 1,280,000 bytes, approximately 1.22 mebibytes. The default stack size on most operating systems is between 1 and 8 megabytes. Declaring three such matrices as local variables simultaneously risks exceeding the stack limit and produces undefined behavior.
A correct C implementation at this scale uses one of two approaches. The first is global or static array declaration, which places the arrays in the BSS segment, a region of fixed size allocated before program execution. The second is heap allocation via
mallocusing a flattened one-dimensional layout, which produces the same contiguous memory properties as a stack-allocated two-dimensional array without the stack size constraint.Zero-Initialization
std::vectorvalue-initializes its elements to zero upon construction. Zero-initializing 1,280,000 bytes is a single sequential memory write. At contemporary memory bandwidth rates this completes in the order of microseconds. The multiplication loop performs 64 million multiply-accumulate operations. The initialization cost is not a meaningful contributor to the observed execution time difference at this scale.Beta Was this translation helpful? Give feedback.
All reactions