Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

lvwerra HF staff commited on 7 days ago

Commit

4c7193e

2 Parent(s): 4d951cf fff70fc

Merge main into pr/55 and resolve conflicts

Browse files

Files changed (11) hide show

assets/images/torch-compile-triton-kernel.png +3 -0
assets/images/torch-compile-triton.png +3 -0
dist/assets/images/torch-compile-triton-kernel.png +3 -0
dist/assets/images/torch-compile-triton.png +3 -0
dist/index.html +72 -15
dist/main.bundle.js +1 -1
dist/main.bundle.js.map +0 -0
dist/style.css +12 -0
src/fragmentLoader.js +1 -1
src/index.html +72 -15
src/style.css +12 -0

assets/images/torch-compile-triton-kernel.png ADDED Viewed

Git LFS Details

SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
Pointer size: 131 Bytes
Size of remote file: 113 kB

assets/images/torch-compile-triton.png ADDED Viewed

Git LFS Details

SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
Pointer size: 131 Bytes
Size of remote file: 102 kB

dist/assets/images/torch-compile-triton-kernel.png ADDED Viewed

Git LFS Details

SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
Pointer size: 131 Bytes
Size of remote file: 113 kB

dist/assets/images/torch-compile-triton.png ADDED Viewed

Git LFS Details

SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
Pointer size: 131 Bytes
Size of remote file: 102 kB

dist/index.html CHANGED Viewed

@@ -1933,12 +1933,62 @@
         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <div class="figure-legend"><p>Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p></div>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <div class="figure-legend"><p>Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p></div>
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
@@ -1974,8 +2024,9 @@
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
@@ -2003,7 +2054,7 @@
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
-        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
@@ -2034,23 +2085,25 @@
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
-        <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
-        <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
-        <ul>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
-        </ul>
-        <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
@@ -2081,8 +2134,12 @@
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>

         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
+        <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
+            <div>
+                <d-code block language="python">
+                    // Host code
+                    void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
+                        // Allocate vectors in device memory
+                        int size = n * sizeof(float);
+                        float *d_A, *d_B, *d_C;
+                        cudaMalloc(&d_A, size);
+                        cudaMalloc(&d_B, size);
+                        cudaMalloc(&d_C, size);
+                        // Copy vectors from host memory to device memory
+                        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
+                        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
+                        // Invoke kernel
+                        int threadsPerBlock = 256;
+                        int blocksPerGrid =
+                                (N + threadsPerBlock - 1) / threadsPerBlock;
+                        VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
+                        // Copy result from device memory to host memory
+                        // h_C contains the result in host memory
+                        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
+                        // Free device memory
+                        cudaFree(d_A);
+                        cudaFree(d_B);
+                        cudaFree(d_C);
+                    }</d-code>
+        <div class="figure-legend">
+            <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+        </div>
+    </div>
+            <div>
+                <d-code block language="python">
+                    // Device code
+                    __global__ void VecAdd(float* A, float* B, float* C, int N)
+                    {
+                        int i = blockDim.x * blockIdx.x + threadIdx.x;
+                        if (i < N)
+                            C[i] = A[i] + B[i];
+                    }
+                </d-code>
+            <div class="figure-legend">
+                <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+            </div>
+            </div>
+        </div>
+        <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+        <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
+        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+ -->
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
+        <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
+        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
+        <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
+        <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
+        <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
+        <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
+        <ol>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
+        </ol>
+        <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
+        </div>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
+        </div>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>

dist/main.bundle.js CHANGED Viewed

@@ -5396,7 +5396,7 @@ function _loadFragments() {
                     while (1) switch (_context5.prev = _context5.next) {
                       case 0:
                         fragmentName = element.id.replace('fragment-', '');
-                        fragmentPath = "/fragments/".concat(fragmentName, ".html");
                         return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
                           var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
                             var fetchPromise;

                     while (1) switch (_context5.prev = _context5.next) {
                       case 0:
                         fragmentName = element.id.replace('fragment-', '');
+                        fragmentPath = "fragments/".concat(fragmentName, ".html");
                         return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
                           var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
                             var fetchPromise;

dist/main.bundle.js.map CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/style.css CHANGED Viewed

@@ -424,3 +424,15 @@ d-article {
 d-code {
     font-size: 12px;
 }

 d-code {
     font-size: 12px;
 }
+.large-image-background {
+        width: 100vw;
+        padding-top: 10px;
+        padding-bottom: 10px;
+        margin-left: calc(-50vw + 50%);
+        margin-right: calc(-50vw + 50%);
+        background: white;
+        height: fit-content; /* This will make it match the image height */
+        display: flex;
+        justify-content: center; /* This will center your image */
+}

src/fragmentLoader.js CHANGED Viewed

@@ -36,7 +36,7 @@ async function loadFragments() {
         async addFetch(element) {
             const fragmentName = element.id.replace('fragment-', '');
-            const fragmentPath = `/fragments/${fragmentName}.html`;
             return new Promise(async (resolve, reject) => {
                 try {

         async addFetch(element) {
             const fragmentName = element.id.replace('fragment-', '');
+            const fragmentPath = `fragments/${fragmentName}.html`;
             return new Promise(async (resolve, reject) => {
                 try {

src/index.html CHANGED Viewed

@@ -1933,12 +1933,62 @@
         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <div class="figure-legend"><p>Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p></div>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <div class="figure-legend"><p>Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p></div>
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
@@ -1974,8 +2024,9 @@
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
@@ -2003,7 +2054,7 @@
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
-        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
@@ -2034,23 +2085,25 @@
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
-        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
-        <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
-        <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
-        <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
-        <ul>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
-        </ul>
-        <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
@@ -2081,8 +2134,12 @@
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
-        <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>

         <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
+        <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
+            <div>
+                <d-code block language="python">
+                    // Host code
+                    void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
+                        // Allocate vectors in device memory
+                        int size = n * sizeof(float);
+                        float *d_A, *d_B, *d_C;
+                        cudaMalloc(&d_A, size);
+                        cudaMalloc(&d_B, size);
+                        cudaMalloc(&d_C, size);
+                        // Copy vectors from host memory to device memory
+                        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
+                        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
+                        // Invoke kernel
+                        int threadsPerBlock = 256;
+                        int blocksPerGrid =
+                                (N + threadsPerBlock - 1) / threadsPerBlock;
+                        VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
+                        // Copy result from device memory to host memory
+                        // h_C contains the result in host memory
+                        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
+                        // Free device memory
+                        cudaFree(d_A);
+                        cudaFree(d_B);
+                        cudaFree(d_C);
+                    }</d-code>
+        <div class="figure-legend">
+            <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+        </div>
+    </div>
+            <div>
+                <d-code block language="python">
+                    // Device code
+                    __global__ void VecAdd(float* A, float* B, float* C, int N)
+                    {
+                        int i = blockDim.x * blockIdx.x + threadIdx.x;
+                        if (i < N)
+                            C[i] = A[i] + B[i];
+                    }
+                </d-code>
+            <div class="figure-legend">
+                <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
+            </div>
+            </div>
+        </div>
+        <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+        <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
+        <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
+ -->
         <p>Kernels are generally scheduled as follow:</p>
         <ul>
         <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
+        <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
         <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
                 tl.store(out_ptr0 + (x0), tmp6, xmask)
         </d-code>
+        <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
         <d-code block language="python">
             @triton.jit
         <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
+        <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
+        <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
+        <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
+        <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
+        <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
+        <ol>
             <li>Pytorch: easy but slow</li>
             <li>torch.compile: easy, fast, but not flexible</li>
             <li>triton: harder, faster, and more flexible</li>
             <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
+        </ol>
+        <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
         <h4>Memory Coalescing</h4>
         <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
+        </div>
+        <div class="large-image-background">
+            <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
+        </div>
         <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>

src/style.css CHANGED Viewed

@@ -424,3 +424,15 @@ d-article {
 d-code {
     font-size: 12px;
 }

 d-code {
     font-size: 12px;
 }
+.large-image-background {
+        width: 100vw;
+        padding-top: 10px;
+        padding-bottom: 10px;
+        margin-left: calc(-50vw + 50%);
+        margin-right: calc(-50vw + 50%);
+        background: white;
+        height: fit-content; /* This will make it match the image height */
+        display: flex;
+        justify-content: center; /* This will center your image */
+}