Investigating Go, Cgo, Metal Shading Language, Metal Performance Shaders, and benchmarking different approaches to matrix multiplication
Below I’ll describe the process of using cgo to interface between Go and native C, how this can be used to interface with Objective-C bindings for Apple’s Metal Performance Shaders framework, how to interface with custom GPU code (shaders) written in the Metal Shading Language, and finally benchmarking all of that against hand-written and OpenBLAS Go-based matrix multiplication operations. This was written to run on my M2 MacBook.
The layout of the source, available here on GitHub, looks like this:
That’s a lot, so I’ll break it down here into these sections, or feel free to just jump right to the benchmarks.
- GPUs and Floating-Point Parallelism
- Metal GPU Basics
- Metal Shading Language
- Objective-C Bindings
- Metal Performance Shaders Framework
- Go and cgo
- Go Implementation Baselines and OpenBLAS
- Results
GPUs and Floating-Point Parallelism
I assume most people at this point are intuitively familiar with the concept that GPUs are incredibly powerful at certain kinds of computational tasks; especially some which support machine learning. It wasn’t until I started playing around with Metal that I understood first-hand how much more powerful they can be than CPUs.
GPUs by design are extremely efficient at massively parallel floating-point operations which demand high memory bandwidth. My MacBook M2 has 8 CPU cores and 8 GPU cores, but for comparison, the Nvidia RTX 4090 contains 16384 cores and the H100 contains 16896 CUDA cores with hundreds of additional specialized tensor cores. GPUs usually support SIMD processing, allowing them to execute the same instruction simultaneously across multiple data points.
Outside of graphics, matrix multiplication and linear algebraic tasks in general benefit from this concurrency thanks to their highly parallelizable algorithms. This in turns supports core machine learning workloads like training and inference [1] [2].
CUDA is probably the most well-known GPU programming platform, which specific to Nvidia hardware. There are also mathematical frameworks available for OpenGL. Frameworks like TensorFlow and PyTorch can integrate easily and reasonably transparently with GPU hardware. This was an interesting writeup about the performance improvements of supporting Metal-based GPU frameworks into the spaCy NLP library.
Metal GPU Basics
Programming direct GPU computation is not as simple as writing code for on-device CPUs. When working with Apple’s Metal framework, a rough series of operations for executing code on the GPU looks like:
- Find an appropriate GPU device
- Create a queue for executing commands (ie the MTLCommandQueue)
- Wrap pointers to data arrays in a structured buffer; if data is executable code, then a pipeline state, otherwise a regular buffer. Apple GPUs use a unified memory space, meaning we don’t need to actually copy any data into GPU-specific physical memory
- Commit the command buffer for execution, and either wait for results or set an event handler on completion
- Extract bytes out out a response buffer and format locally with CPU program code
Raw GPU programming uses an async model.
Metal Shading Language
Metal Shading Language is a derivative of C++14 which can be used to compose custom logic (called “shaders”) to run on Metal-compatible GPUs. In general, and if possible, you’d probably be better off using the MPS framework (discussed later) for equivalent functionality when possible — it tends to be highly-optimized for common classes of GPU-aligned use cases (like matrix multiplication or neural networks).
Debugging of MSL code is pretty difficult. You can use the Shader Debugger through Xcode, but if you want to inspect or print intermediate values without Xcode, you need to write data to a response debug buffer, and parse the primitives in your C++ or Objective-C wrapper.
MSL functions are exposed as public interfaces through the kernel designation. The Metal framework passes IDs for the current calling thread context, or thread group, which can be used to ensure non-overlapping writes. Threads can be represented by a three-dimensional ID system; the dimensions of this thread space are configured in the wrapper code.
The following is an implementation of the naive matrix multiplication algorithm, combined with some loop unrolling which surprisingly improved its performance significantly. This is just for comparison purposes; normally the MPSMatrixMultiplication functionality from MPS would be more suitable.
kernel void matrix_multiply_naive(
device const MatrixParams *params,
constant float *A,
constant float *B,
device float *C,
// Indicates the thread's unique position within the entire grid of
// threads being executed. The uint2 type is a 2D coordinate, with
// fields x and y representing its indices on each axis.
// This parameter is not directly provided from the calling code,
// but provided by the Metal framework
uint2 gid [[thread_position_in_grid]]
) {
if (gid.x >= params->a_rows || gid.y >= params->b_cols) {
return; // This thread is out of matrix dimensionality range, do nothing
}
float sum = 0.0;
int k;
// Loop unrolling; improves performance by a notable margin
for (k = 0; k <= params->a_cols - 4; k += 4) {
sum += A[gid.x * params->a_cols + k]
* B[k * params->b_cols + gid.y];
sum += A[gid.x * params->a_cols + k + 1]
* B[(k + 1) * params->b_cols + gid.y];
sum += A[gid.x * params->a_cols + k + 2]
* B[(k + 2) * params->b_cols + gid.y];
sum += A[gid.x * params->a_cols + k + 3]
* B[(k + 3) * params->b_cols + gid.y];
}
// Handle any remaining elements
for (; k < params->a_cols; ++k) {
sum += A[gid.x * params->a_cols + k] * B[k * params->b_cols + gid.y];
}
C[gid.x * params->b_cols + gid.y] = sum;
}
I implemented a naive-transpose function as well in MSL for comparison. Given a transposed matrix, this is a trivial adjustment to the logic above, whose inner loop runs over B’s rows rather than down its columns:
// Loop unrolling; improves performance by a notable margin
for (k = 0; k <= params->a_cols - 4; k += 4) {
sum += A[gid.x * params->a_cols + k]
* B[gid.y * params->b_cols + k]; // Note this is gid.y * cols plus k
sum += A[gid.x * params->a_cols + k + 1]
* B[gid.y * params->b_cols + k + 1];
sum += A[gid.x * params->a_cols + k + 2]
* B[gid.y * params->b_cols + k + 2];
sum += A[gid.x * params->a_cols + k + 3]
* B[gid.y * params->b_cols + k + 3];
}
// Handle any remaining elements
for (; k < params->a_cols; ++k) {
sum += A[gid.x * params->a_cols + k] * B[gid.y * params->b_cols + k];
}
I discussed this approach in an earlier blog post as a pretty easy way to improve the scalar performance of the naive algorithm, at least, on CPUs. More on that later.
Objective-C Bindings
The Metal framework provides the ability to compile a library from Metal source code. Once the file contents are loaded, the binding code looks for the kernel functions by name, and initializes a new MTLComputePipelineState representing the compiled function code.
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
// Compile and initialize a new library located at the provided source path.
MTLCompileOptions *compileOptions = [MTLCompileOptions new];
compileOptions.languageVersion = MTLLanguageVersion3_0;
// Wrap input source path string
NSString *ss = [NSString stringWithUTF8String:source_path];
// Initialize new library containing compiled shader functions
id<MTLLibrary> lib = [device newLibraryWithSource:ss
options:compileOptions
error:&error];
// Create a representation of the naive multiplication public shader function in
// the Metal library created above
id<MTLFunction> naiveFunction =
[lib newFunctionWithName:@"matrix_multiply_naive"];
// Create the new compute pipeline state
id<MTLComputePipelineState> pipelineStateNaive = [device newComputePipelineStateWithFunction:naiveFunction
error:&error];
To actually call the native Metal code, the thread configuration needs to be set, and the GPU buffers initialized.
[computeEncoder setComputePipelineState:pipelineStateNaive];
MTLSize threadsPerGrid = MTLSizeMake(params->a_cols, params->a_rows, 1);
// Calculate a threadgroup size.
// https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes?language=objc
NSUInteger w = pipelineStateNaive.threadExecutionWidth;
NSUInteger h = pipelineStateNaive.maxTotalThreadsPerThreadgroup / w;
MTLSize threadsPerThreadgroup = MTLSizeMake(w, h, 1);
// Encode kernel function inputs
[computeEncoder setBytes:params length:16 atIndex:0];
[computeEncoder setBuffer:bufferA offset:0 atIndex:1];
[computeEncoder setBuffer:bufferB offset:0 atIndex:2];
[computeEncoder setBuffer:bufferC offset:0 atIndex:3];
// Encode the compute command.
[computeEncoder dispatchThreads:threadsPerGrid
threadsPerThreadgroup:threadsPerThreadgroup];
// End the compute pass.
[computeEncoder endEncoding];
// Execute the command.
[commandBuffer commit];
This is a lot, so I’ll illustrate the relationships here:
Metal Performance Shaders Framework
The MPS Framework is a high-performance library provided by Apple for use with its Metal family of GPUs. It offers functionality from image tasks to neural network support.
APIs are primarily available through Swift or Objective-C, though there is also a Metal-cpp library available for use.
The MPSMatrixMultiplication API is reasonably easy to use. As with the MSL code above, the MPS commands still need to be encoded into the MTLCommandBuffer and asynchronously committed for execution.
// Define Matrix "descriptions", accounting for matrix dimensionality and byte size
MPSMatrixDescriptor *descriptorA = [MPSMatrixDescriptor matrixDescriptorWithDimensions:a_rows
columns:a_cols
rowBytes:a_cols * sizeof(float)
dataType:MPSDataTypeFloat32];
MPSMatrixDescriptor *descriptorB = [MPSMatrixDescriptor matrixDescriptorWithDimensions:b_rows
columns:b_cols
rowBytes:b_cols * sizeof(float)
dataType:MPSDataTypeFloat32];
// Output matrix
MPSMatrixDescriptor *descriptorC = [MPSMatrixDescriptor matrixDescriptorWithDimensions:a_rows
columns:b_cols
rowBytes:b_cols * sizeof(float)
dataType:MPSDataTypeFloat32];
// Initialize matrix representations using above descriptions and matrix buffers
MPSMatrix *matrixA = [[MPSMatrix alloc] initWithBuffer:bufferA descriptor:descriptorA];
MPSMatrix *matrixB = [[MPSMatrix alloc] initWithBuffer:bufferB descriptor:descriptorB];
MPSMatrix *matrixC = [[MPSMatrix alloc] initWithBuffer:bufferC descriptor:descriptorC];
// Creates the multiplication instance
MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
resultRows:a_rows
resultColumns:b_cols
interiorColumns:a_cols];
// Encodes the multiplication command into the command buffer for the GPU
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
[matrixMultiplication encodeToCommandBuffer:commandBuffer
leftMatrix:matrixA
rightMatrix:matrixB
resultMatrix:matrixC];
Go and cgo
I don’t particularly like working with Objective-C, and the point of this program is to run code on the GPU originating from a Go program.
Cgo is a Go language feature which allows the Go compiler to understand compiler directives contained within comments related to native C code. It supports a version of foreign function interface.
Directive configuration is a little brittle, but any comments immediately preceding the line import “C” (called “the preamable”) will be interpreted as header imports or compilation arguments when compiling referenced C code. For example:
/*
#cgo LDFLAGS: -framework Foundation -framework CoreGraphics -framework Metal -framework MetalPerformanceShaders -L/opt/homebrew/opt/openblas/lib -lopenblas
#include <stdlib.h>
#include "metal.h"
*/
import "C"
- Passes linking flags to the linker via command-line LDFLAGS
- Compiles C code with the standard header stdlib.h
- Compiles C code with the local project header metal.h
It took some trial-and-error to get the right set of linker flags to work on MacOS.
- Foundation: base libraries
- CoreGraphics: necessary on MacOS to interface with the GPU
- Metal: libraries and language support for Metal, including MSL
- MetalPerformanceShaders: libraries for MPS discussed above
It turns out that Apple bundles a BLAS implementation in its Accelerate framework, so in addition to installing OpenBLAS through brew, the location of the library needs to be provided when linking as well:
-L/opt/homebrew/opt/openblas/lib -lopenblas
The go:embed directive allows Go programs to include files at compile time, which is useful in this case when we want to pass the contents of the MSL source file (mm.metal) to the Metal framework, as discussed above, for compilation.
//go:embed mm.metal
var source string
// Compile the shader source code and initialize pipelines. The metalSource
// param contains the contents of an embedded Metal Shading Language file.
func Compile (metalSource string) {
// Wrap string in a C string
src := C.CString(metalSource)
// Free the above string after command queue is initialized
defer C.free(unsafe.Pointer(src))
// Compile the source, initialize pipelines and command queue
C.initializePipelineAndCommandQueue(src)
}
The references to C above are interfacing with C APIs through cgo, for example:
// Calls initializeMTLBuffers from Obj-C bindings
C.initializeMTLBuffers(
a_data, // Input opaque pointer for A
b_data, // Input opaque pointer for B
C.int(4), // Converts 4 into C integer type
C.int(a.Size()),
C.int(b.Size()),
C.int(a.Rows * b.Cols))
params := MatrixParams{
a_rows: int32(a.Rows),
a_cols: int32(a.Cols),
b_rows: int32(b.Rows),
b_cols: int32(b.Cols),
}
// Return an unsafe pointer to this MatrixParams struct, cast to
// the native C representation defined in the shared header file
return (*C.MatrixParams)(unsafe.Pointer(¶ms));
Note that this means that C is a reserved keyword and cannot be used as a variable name.
Go Implementation Baselines and OpenBLAS
I wanted to compare the performance of GPU-based matrix multiplication to both higher-level implementations, such as the Gonum library, as well as intuitive, hand-written (and comparatively inefficient) implementations.
I implemented a couple different algorithms in Go, including this parallel transpose naive algorithm, which naively divides the multiplication work across N goroutines:
func (a Matrix[T]) TransposeMultParallel(b *Matrix[T]) *Matrix[T] {
if a.Cols != b.Rows {
panic("matrices are the wrong size for multiplication")
}
c_data := make([]T, a.Rows*b.Cols)
t := b.Transpose()
var wg sync.WaitGroup
for i := 0; i < a.Rows; i++ {
wg.Add(1) // Add a count to the WaitGroup for the new goroutine
go func(i int) { // Kick off goroutine
defer wg.Done() // Decrease the count when the goroutine completes
ptr := i * b.Cols
for j := 0; j < b.Cols; j++ {
var sum T = 0.0
for k := 0; k < a.Cols; k++ {
sum += a.At(i, k) * t.At(j, k)
}
c_data[ptr+j] = sum
}
}(i)
}
wg.Wait() // Wait for all goroutines to complete
return InitMatrixWithData(a.Rows, b.Cols, c_data)
}
Gonum BLAS is a pure Go library which implements BLAS interfaces. However, it can also be configured to divert algebraic operations to a native-code BLAS implementation like OpenBLAS through netlib.
I showed above how cgo can be configured to properly link to an OpenBLAS installation on MacOS. Within the application code, the preferred BLAS implementation can be set directly. From the benchmark code:
// Convert primitive arrays into gonum dense matrix types
gonum_a := mat.NewDense(a_rows, a_cols, a64_data)
gonum_b := mat.NewDense(b_rows, b_cols, b64_data)
gonum_c := mat.NewDense(a_rows, b_cols, nil)
gonum_d := mat.NewDense(a_rows, b_cols, nil)
// Configure Gonum to use Gonum-default Go implementation
blas64.Use(gonum.Implementation{})
// Run a multiplication using Gonum BLAS impl
start = time.Now()
gonum_c.Mul(gonum_a, gonum_b)
bdata.TimeGonumNative(start)
// Configure Gonum to use Netlib which forwards operations to a
// native C-code BLAS implementation (OpenBLAS in our case)
blas64.Use(netlib.Implementation{})
// Run a multiplication using OpenBLAS impl through Gonum API
start = time.Now()
gonum_d.Mul(gonum_a, gonum_b)
bdata.TimeGonumOpenBLAS(start)
Results
My benchmarking code runs a few trials each of the following matrix multiplication implementations, and reports the average time each took to multiply two square matrices of gradually increasing dimensionality:
- Naive multiplication, in Go
- Transposed naive multiplication, in Go
- Goroutine-parallelized transposed naive multiplication, in Go
- Gonum pure Go-based BLAS multiplication
- Gonum-wrapped OpenBLAS multiplication, written in C
- Hand-implemented naive multiplication, in MSL, on GPU
- Hand-implemented transposed naive multiplication, in MSL, on GPU
- Metal Performance Shaders framework, called from Objective-C, on GPU
Benchmarking output looks like this (floats are ms):
2023-12-01 11:12:51.644 go-mm[75818:22427382] Using default device Apple M2
elements naive transpose transpose_parallel metal_naive metal_transpose mps gonum openblas
160000 196.00 201.00 42.00 8.00 9.67 0.33 4.67 6.00
250000 381.33 387.67 80.67 11.00 11.67 0.00 8.33 21.00
360000 801.00 789.33 159.33 19.00 16.33 0.00 14.33 4.67
490000 1228.00 1075.00 411.00 23.67 24.33 1.00 26.67 16.33
...
Some quick plotting through matplotlib
As one might expect, my hand-written Go implementations are comparatively out of control. In fact, the other approaches are so fast, you can’t even tell them apart in the graph. Here’s the sliding histogram of GPU usage during this run
You can see the GPU isn’t particularly busy, because time is mostly being spent on CPU operations. Here’s another run, excluding the slowest three multiplication techniques:
Around 16M elements (4k x 4k), Gonum begins to degrade. You can see clearly here that the GPU-based and OpenBLAS operations outperform the pure Go implementations. Looking just at the GPU-based approaches:
A couple interesting notes here:
- The Metal Performance Shaders library is amazingly fast
- There’s no real performance difference between the naive and transposed naive approaches
For the second point: this is unlike the performance characteristics of the Go-based pair of implementations above. It turns out that favorable cache access patterns for CPUs do not work the same way for GPUs and the way their SIMD groups (or warps) access memory. See GPU utilization here for comparison:
Now looking at just OpenBLAS and MPS alone — the two fastest approaches:
Around 35M elements, the OpenBLAS implementation begins to degrade, while MPS is holding steady. The difference here is pretty remarkable, with the latter completing the same 35M-element matrix multiplication operations in < 15% of the time. It’s reasonable to assume that difference would continue to grow with matrix cardinality.
Now of course, there are probably algorithmic differences between these two approaches, so this is not a fair CPU-to-GPU comparison. If I plot the performance differences between my two hand-coded implementations, it looks like this:
Which is saying that the naive MSL-based implementation completes the multiplication of 5M elements in just 1% of the time of my Go implementation, and that ratio seems to be improving in the GPU’s favor over time.
Programming Apple GPUs through Go and Metal Shading Language was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Programming Apple GPUs through Go and Metal Shading Language
Go Here to Read this Fast! Programming Apple GPUs through Go and Metal Shading Language