lvwerra HF staff commited on
Commit
4c7193e
·
2 Parent(s): 4d951cf fff70fc

Merge main into pr/55 and resolve conflicts

Browse files
assets/images/torch-compile-triton-kernel.png ADDED

Git LFS Details

  • SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
  • Pointer size: 131 Bytes
  • Size of remote file: 113 kB
assets/images/torch-compile-triton.png ADDED

Git LFS Details

  • SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
  • Pointer size: 131 Bytes
  • Size of remote file: 102 kB
dist/assets/images/torch-compile-triton-kernel.png ADDED

Git LFS Details

  • SHA256: 5089051b4eb8fdce48de619330a97a97813ce9695e3ffa706f08406abda2f776
  • Pointer size: 131 Bytes
  • Size of remote file: 113 kB
dist/assets/images/torch-compile-triton.png ADDED

Git LFS Details

  • SHA256: ee020e48eebdbde5f5b75ae65e63a946961f0219fe3d97969d08712fae81d173
  • Pointer size: 131 Bytes
  • Size of remote file: 102 kB
dist/index.html CHANGED
@@ -1933,12 +1933,62 @@
1933
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1937
- <div class="figure-legend"><p>Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p></div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1938
 
1939
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1940
- <div class="figure-legend"><p>Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p></div>
 
1941
 
 
 
 
 
1942
  <p>Kernels are generally scheduled as follow:</p>
1943
 
1944
  <ul>
@@ -1974,8 +2024,9 @@
1974
 
1975
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
1976
 
1977
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1978
 
 
1979
 
1980
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
1981
 
@@ -2003,7 +2054,7 @@
2003
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2004
  </d-code>
2005
 
2006
- <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
2007
 
2008
  <d-code block language="python">
2009
  @triton.jit
@@ -2034,23 +2085,25 @@
2034
 
2035
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2036
 
2037
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
2038
 
2039
- <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
2040
 
2041
- <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
2042
 
2043
- <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
 
 
2044
 
2045
- <ul>
2046
  <li>Pytorch: easy but slow</li>
2047
  <li>torch.compile: easy, fast, but not flexible</li>
2048
  <li>triton: harder, faster, and more flexible</li>
2049
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2050
 
2051
- </ul>
2052
 
2053
- <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2054
 
2055
  <h4>Memory Coalescing</h4>
2056
 
@@ -2081,8 +2134,12 @@
2081
 
2082
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2083
 
2084
- <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
2085
- <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
 
 
 
 
2086
 
2087
 
2088
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
 
1933
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1937
+ <div>
1938
+ <d-code block language="python">
1939
+ // Host code
1940
+ void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
1941
+ // Allocate vectors in device memory
1942
+ int size = n * sizeof(float);
1943
+ float *d_A, *d_B, *d_C;
1944
+ cudaMalloc(&d_A, size);
1945
+ cudaMalloc(&d_B, size);
1946
+ cudaMalloc(&d_C, size);
1947
+
1948
+ // Copy vectors from host memory to device memory
1949
+ cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
1950
+ cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
1951
+
1952
+ // Invoke kernel
1953
+ int threadsPerBlock = 256;
1954
+ int blocksPerGrid =
1955
+ (N + threadsPerBlock - 1) / threadsPerBlock;
1956
+ VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
1957
+
1958
+ // Copy result from device memory to host memory
1959
+ // h_C contains the result in host memory
1960
+ cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
1961
+
1962
+ // Free device memory
1963
+ cudaFree(d_A);
1964
+ cudaFree(d_B);
1965
+ cudaFree(d_C);
1966
+ }</d-code>
1967
+ <div class="figure-legend">
1968
+ <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1969
+ </div>
1970
+ </div>
1971
+ <div>
1972
+ <d-code block language="python">
1973
+ // Device code
1974
+ __global__ void VecAdd(float* A, float* B, float* C, int N)
1975
+ {
1976
+ int i = blockDim.x * blockIdx.x + threadIdx.x;
1977
+ if (i < N)
1978
+ C[i] = A[i] + B[i];
1979
+ }
1980
+ </d-code>
1981
+ <div class="figure-legend">
1982
+ <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1983
 
1984
+ </div>
1985
+ </div>
1986
+ </div>
1987
 
1988
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1989
+ <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1990
+ <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1991
+ -->
1992
  <p>Kernels are generally scheduled as follow:</p>
1993
 
1994
  <ul>
 
2024
 
2025
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
2026
 
2027
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
2028
 
2029
+ <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
2030
 
2031
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
2032
 
 
2054
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2055
  </d-code>
2056
 
2057
+ <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
2058
 
2059
  <d-code block language="python">
2060
  @triton.jit
 
2085
 
2086
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2087
 
2088
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
2089
 
2090
+ <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
2091
 
2092
+ <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
2093
 
2094
+ <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
2095
+
2096
+ <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
2097
 
2098
+ <ol>
2099
  <li>Pytorch: easy but slow</li>
2100
  <li>torch.compile: easy, fast, but not flexible</li>
2101
  <li>triton: harder, faster, and more flexible</li>
2102
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2103
 
2104
+ </ol>
2105
 
2106
+ <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2107
 
2108
  <h4>Memory Coalescing</h4>
2109
 
 
2134
 
2135
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2136
 
2137
+ <div class="large-image-background">
2138
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2139
+ </div>
2140
+ <div class="large-image-background">
2141
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2142
+ </div>
2143
 
2144
 
2145
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
dist/main.bundle.js CHANGED
@@ -5396,7 +5396,7 @@ function _loadFragments() {
5396
  while (1) switch (_context5.prev = _context5.next) {
5397
  case 0:
5398
  fragmentName = element.id.replace('fragment-', '');
5399
- fragmentPath = "/fragments/".concat(fragmentName, ".html");
5400
  return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
5401
  var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
5402
  var fetchPromise;
 
5396
  while (1) switch (_context5.prev = _context5.next) {
5397
  case 0:
5398
  fragmentName = element.id.replace('fragment-', '');
5399
+ fragmentPath = "fragments/".concat(fragmentName, ".html");
5400
  return _context5.abrupt("return", new Promise(/*#__PURE__*/function () {
5401
  var _ref = _asyncToGenerator(/*#__PURE__*/_regeneratorRuntime().mark(function _callee4(resolve, reject) {
5402
  var fetchPromise;
dist/main.bundle.js.map CHANGED
The diff for this file is too large to render. See raw diff
 
dist/style.css CHANGED
@@ -424,3 +424,15 @@ d-article {
424
  d-code {
425
  font-size: 12px;
426
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
424
  d-code {
425
  font-size: 12px;
426
  }
427
+
428
+ .large-image-background {
429
+ width: 100vw;
430
+ padding-top: 10px;
431
+ padding-bottom: 10px;
432
+ margin-left: calc(-50vw + 50%);
433
+ margin-right: calc(-50vw + 50%);
434
+ background: white;
435
+ height: fit-content; /* This will make it match the image height */
436
+ display: flex;
437
+ justify-content: center; /* This will center your image */
438
+ }
src/fragmentLoader.js CHANGED
@@ -36,7 +36,7 @@ async function loadFragments() {
36
 
37
  async addFetch(element) {
38
  const fragmentName = element.id.replace('fragment-', '');
39
- const fragmentPath = `/fragments/${fragmentName}.html`;
40
 
41
  return new Promise(async (resolve, reject) => {
42
  try {
 
36
 
37
  async addFetch(element) {
38
  const fragmentName = element.id.replace('fragment-', '');
39
+ const fragmentPath = `fragments/${fragmentName}.html`;
40
 
41
  return new Promise(async (resolve, reject) => {
42
  try {
src/index.html CHANGED
@@ -1933,12 +1933,62 @@
1933
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1937
- <div class="figure-legend"><p>Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p></div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1938
 
1939
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1940
- <div class="figure-legend"><p>Device code containing the definition of the vector addition kernel from https://blog.codingconfessions.com/p/gpu-computing</p></div>
 
1941
 
 
 
 
 
1942
  <p>Kernels are generally scheduled as follow:</p>
1943
 
1944
  <ul>
@@ -1974,8 +2024,9 @@
1974
 
1975
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
1976
 
1977
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1978
 
 
1979
 
1980
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
1981
 
@@ -2003,7 +2054,7 @@
2003
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2004
  </d-code>
2005
 
2006
- <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments, as demonstrated below:</p>
2007
 
2008
  <d-code block language="python">
2009
  @triton.jit
@@ -2034,23 +2085,25 @@
2034
 
2035
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2036
 
2037
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
2038
 
2039
- <p>This standalone kernel demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely here just an artifact from the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, we can focus on optimizing this generated kernel, saving us time in the process. </p>
2040
 
2041
- <p>However, in Triton, sometimes, we cannot fully achieve the peak performance of the device due to limitations in handling shared memory and scheduling within streaming multiprocessors (SMs). Our access is restricted to blocks, allowing us only to manage the scheduling of blocks across SMs. To gain even more control, we will need to implement kernels in CUDA, where we have access to all the underlying components.</p>
2042
 
2043
- <p>In CUDA, there are various techniques that can be employed to make kernels more efficient; we will present just a few. These include optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times. In summary, the tools for writing code to execute instructions on the GPU are:</p>
 
 
2044
 
2045
- <ul>
2046
  <li>Pytorch: easy but slow</li>
2047
  <li>torch.compile: easy, fast, but not flexible</li>
2048
  <li>triton: harder, faster, and more flexible</li>
2049
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2050
 
2051
- </ul>
2052
 
2053
- <p>Let’s talk about one of the most frequent technique we can use: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2054
 
2055
  <h4>Memory Coalescing</h4>
2056
 
@@ -2081,8 +2134,12 @@
2081
 
2082
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2083
 
2084
- <p><img alt="image.png" src="/assets/images/memorycoalescing2.png" /></p>
2085
- <p><img alt="image.png" src="/assets/images/memorycoalescing3.png" /></p>
 
 
 
 
2086
 
2087
 
2088
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
 
1933
 
1934
  <p>To run the kernel, you will also need a specific code part, called <strong>host code</strong>, which is executed on the <strong>CPU/host</strong> and will take care of preparing data allocations and loading data and code.</p>
1935
 
1936
+ <div class="l-body" style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
1937
+ <div>
1938
+ <d-code block language="python">
1939
+ // Host code
1940
+ void vecAdd(float* h_A, float *h_B, float *h_c, int n) {
1941
+ // Allocate vectors in device memory
1942
+ int size = n * sizeof(float);
1943
+ float *d_A, *d_B, *d_C;
1944
+ cudaMalloc(&d_A, size);
1945
+ cudaMalloc(&d_B, size);
1946
+ cudaMalloc(&d_C, size);
1947
+
1948
+ // Copy vectors from host memory to device memory
1949
+ cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
1950
+ cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
1951
+
1952
+ // Invoke kernel
1953
+ int threadsPerBlock = 256;
1954
+ int blocksPerGrid =
1955
+ (N + threadsPerBlock - 1) / threadsPerBlock;
1956
+ VecAdd&lt;&lt;&lt;blocksPerGrid, threadsPerBlock&gt;&gt;&gt;(d_A, d_B, d_C, N);
1957
+
1958
+ // Copy result from device memory to host memory
1959
+ // h_C contains the result in host memory
1960
+ cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
1961
+
1962
+ // Free device memory
1963
+ cudaFree(d_A);
1964
+ cudaFree(d_B);
1965
+ cudaFree(d_C);
1966
+ }</d-code>
1967
+ <div class="figure-legend">
1968
+ <p>Host code for a CUDA kernel for adding two vectors. Adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1969
+ </div>
1970
+ </div>
1971
+ <div>
1972
+ <d-code block language="python">
1973
+ // Device code
1974
+ __global__ void VecAdd(float* A, float* B, float* C, int N)
1975
+ {
1976
+ int i = blockDim.x * blockIdx.x + threadIdx.x;
1977
+ if (i < N)
1978
+ C[i] = A[i] + B[i];
1979
+ }
1980
+ </d-code>
1981
+ <div class="figure-legend">
1982
+ <p>Device code containing the definition of the vector addition kernel adapted from https://docs.nvidia.com/cuda/cuda-c-programming-guide/ and https://blog.codingconfessions.com/p/gpu-computing</p>
1983
 
1984
+ </div>
1985
+ </div>
1986
+ </div>
1987
 
1988
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1989
+ <p>Figure 5: Host code for a CUDA kernel for adding two vectors from https://blog.codingconfessions.com/p/gpu-computing</p>
1990
+ <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1991
+ -->
1992
  <p>Kernels are generally scheduled as follow:</p>
1993
 
1994
  <ul>
 
2024
 
2025
  <p>The distinction between the compiled and non-compiled versions is striking, especially given that we only added a single decorator. This remarkable difference is illustrated in the graph below (N is the number of columns):</p>
2026
 
2027
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton.png" /></p>
2028
 
2029
+ <!-- <p><img alt="image.png" src="/assets/images/dp_scaling.svg"/></p> -->
2030
 
2031
  <p>However, if this performance increase is insufficient, you can consider implementing Triton kernels. As a starting point, you can take a look at the triton kernel generated by @torch.compile . To do so, you simply need to set the environment variable <code>TORCH_LOGS</code> to <code>"output_code"</code>:</p>
2032
 
 
2054
  tl.store(out_ptr0 + (x0), tmp6, xmask)
2055
  </d-code>
2056
 
2057
+ <p>To enhance readability, we can modify the variable names, add comments, and make slight adjustments (or ask an LLM to do it for us), as demonstrated below:</p>
2058
 
2059
  <d-code block language="python">
2060
  @triton.jit
 
2085
 
2086
  <p>When we benchmark the generated kernel using <code>triton.testing.Benchmark</code> we have the following performance:</p>
2087
 
2088
+ <p><img alt="image.png" src="/assets/images/torch-compile-triton-kernel.png" /></p>
2089
 
2090
+ <p>This standalone kernel even demonstrates superior performance with smaller sizes compared to <code>@torch.compile</code> but this is likely just an artifact of the compilation time of <code>torch.compile</code>. In any case, instead of starting from scratch, remember that you can start from such generated kernels and focus your attention to optimizing its performance, saving you a lot of time in the process. </p>
2091
 
2092
+ <p>Even in Triton, sometimes, we cannot fully achieve the peak performance of the device due to the language limitations to handle low level details like shared memory and scheduling within streaming multiprocessors (SMs). Triton capabilities are restricted to blocks and scheduling of blocks across SMs. To gain an even deeper control, you will need to implement kernels directly in CUDA, where you will have access to all the underlying low-level details.</p>
2093
 
2094
+ <p>Moving down to CUDA, various techniques can be employed to improve the efficiency of kernels. We will just cover a few here: optimizing memory access patterns to reduce latency, using shared memory to store frequently accessed data, and managing thread workloads to minimize idle times.</p>
2095
+
2096
+ <p> Before we dive deeper in CUDA examples, let's summarize the tools we've seen that let us write kernel code to execute instructions on the GPU:</p>
2097
 
2098
+ <ol>
2099
  <li>Pytorch: easy but slow</li>
2100
  <li>torch.compile: easy, fast, but not flexible</li>
2101
  <li>triton: harder, faster, and more flexible</li>
2102
  <li>CUDA: hardest, fastest, and flexiblest (if you get it right)</li>
2103
 
2104
+ </ol>
2105
 
2106
+ <p>Let’s talk about one of the most frequent technique we can use in CUDA: optimizing memory access. The global memory in GPUs (the largest memory in our above graph) has a long latency and low bandwidth in comparison to the cache which often creates a major bottleneck for most applications. Efficiently accessing data from global memory can improve a lot the performance.</p>
2107
 
2108
  <h4>Memory Coalescing</h4>
2109
 
 
2134
 
2135
  <p>However, when profiling this kernel with a tool like <code>ncu</code>, we can see issues, including low memory throughput and uncoalesced memory accesses.</p>
2136
 
2137
+ <div class="large-image-background">
2138
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing2.png" />
2139
+ </div>
2140
+ <div class="large-image-background">
2141
+ <img width="1200px" alt="image.png" src="/assets/images/memorycoalescing3.png" />
2142
+ </div>
2143
 
2144
 
2145
  <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
src/style.css CHANGED
@@ -424,3 +424,15 @@ d-article {
424
  d-code {
425
  font-size: 12px;
426
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
424
  d-code {
425
  font-size: 12px;
426
  }
427
+
428
+ .large-image-background {
429
+ width: 100vw;
430
+ padding-top: 10px;
431
+ padding-bottom: 10px;
432
+ margin-left: calc(-50vw + 50%);
433
+ margin-right: calc(-50vw + 50%);
434
+ background: white;
435
+ height: fit-content; /* This will make it match the image height */
436
+ display: flex;
437
+ justify-content: center; /* This will center your image */
438
+ }