MapReduce Matrix Multiplication in Java

CUda Matrix Multiply library.

INT32 Data Range Limitation: The original cumm matrix multiplication operation raises an error when encountering int32 data ranges. When the mesh is very large, this ...

GitHub

Triton-Optimized MatMul Kernel and MPI-Based Parallel Training

In this project, I implemented a high-performance matrix multiplication kernel using Triton, optimized for execution on NVIDIA T4 GPUs. The kernel computes D = ReLU(A × B + C) by leveraging shared ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

CUda Matrix Multiply library.

Triton-Optimized MatMul Kernel and MPI-Based Parallel Training

Trending now