Spaces:
Runtime error
Runtime error
[![PyPI](https://img.shields.io/pypi/v/spatial-correlation-sampler.svg)](https://pypi.org/project/spatial-correlation-sampler/) | |
# Pytorch Correlation module | |
this is a custom C++/Cuda implementation of Correlation module, used e.g. in [FlowNetC](https://arxiv.org/abs/1504.06852) | |
This [tutorial](http://pytorch.org/tutorials/advanced/cpp_extension.html) was used as a basis for implementation, as well as | |
[NVIDIA's cuda code](https://github.com/NVIDIA/flownet2-pytorch/tree/master/networks/correlation_package) | |
- Build and Install C++ and CUDA extensions by executing `python setup.py install`, | |
- Benchmark C++ vs. CUDA by running `python benchmark.py {cpu, cuda}`, | |
- Run gradient checks on the code by running `python grad_check.py --backend {cpu, cuda}`. | |
# Requirements | |
This module is expected to compile for Pytorch `2.1.0`. | |
Before installation please check compatibility of your GPU and CUDA (_Compute Capability_) [nvidia docs](https://developer.nvidia.com/cuda-gpus). | |
e.g RTX 6000 is using CC=8.9 so we are setting the environment variable to | |
`export TORCH_CUDA_ARCH_LIST="8.9+PTX"` | |
# Installation | |
be reminded this module requires `python3-dev` to compile C++ code, e.g. on Ubuntu run: | |
`apt install python3-dev` | |
this module is available on pip | |
`pip install spatial-correlation-sampler` | |
For a cpu-only version, you can install from source with | |
`python setup_cpu.py install` | |
# Known Problems | |
This module needs compatible gcc version and CUDA to be compiled. | |
Namely, CUDA 9.1 and below will need gcc5, while CUDA 9.2 and 10.0 will need gcc7 | |
See [this issue](https://github.com/ClementPinard/Pytorch-Correlation-extension/issues/1) for more information | |
# Usage | |
API has a few difference with NVIDIA's module | |
* output is now a 5D tensor, which reflects the shifts horizontal and vertical. | |
``` | |
input (B x C x H x W) -> output (B x PatchH x PatchW x oH x oW) | |
``` | |
* Output sizes `oH` and `oW` are no longer dependant of patch size, but only of kernel size and padding | |
* Patch size `patch_size` is now the whole patch, and not only the radii. | |
* `stride1` is now `stride` and`stride2` is `dilation_patch`, which behave like dilated convolutions | |
* equivalent `max_displacement` is then `dilation_patch * (patch_size - 1) / 2`. | |
* `dilation` is a new parameter, it acts the same way as dilated convolution regarding the correlation kernel | |
* to get the right parameters for FlowNetC, you would have | |
``` | |
kernel_size=1 | |
patch_size=21, | |
stride=1, | |
padding=0, | |
dilation=1 | |
dilation_patch=2 | |
``` | |
## Example | |
```python | |
import torch | |
from spatial_correlation_sampler import SpatialCorrelationSampler, | |
device = "cuda" | |
batch_size = 1 | |
channel = 1 | |
H = 10 | |
W = 10 | |
dtype = torch.float32 | |
input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True) | |
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True) | |
#You can either use the function or the module. Note that the module doesn't contain any parameter tensor. | |
#function | |
out = spatial_correlation_sample(input1, | |
input2, | |
kernel_size=3, | |
patch_size=1, | |
stride=2, | |
padding=0, | |
dilation=2, | |
dilation_patch=1) | |
#module | |
correlation_sampler = SpatialCorrelationSampler( | |
kernel_size=3, | |
patch_size=1, | |
stride=2, | |
padding=0, | |
dilation=2, | |
dilation_patch=1) | |
out = correlation_sampler(input1, input2) | |
``` | |
# Benchmark | |
* default parameters are from `benchmark.py`, FlowNetC parameters are same as use in `FlowNetC` with a batch size of 4, described in [this paper](https://arxiv.org/abs/1504.06852), implemented [here](https://github.com/lmb-freiburg/flownet2) and [here](https://github.com/NVIDIA/flownet2-pytorch/blob/master/networks/FlowNetC.py). | |
* Feel free to file an issue to add entries to this with your hardware ! | |
## CUDA Benchmark | |
* See [here](https://gist.github.com/ClementPinard/270e910147119831014932f67fb1b5ea) for a benchmark script working with [NVIDIA](https://github.com/NVIDIA/flownet2-pytorch/tree/master/networks/correlation_package)'s code, and Pytorch. | |
* Benchmark are launched with environment variable `CUDA_LAUNCH_BLOCKING` set to `1`. | |
* Only `float32` is benchmarked. | |
* FlowNetC correlation parameters where launched with the following command: | |
```bash | |
CUDA_LAUNCH_BLOCKING=1 python benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 cuda -d float | |
CUDA_LAUNCH_BLOCKING=1 python NV_correlation_benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 | |
``` | |
| implementation | Correlation parameters | device | pass | min time | avg time | | |
| -------------- | ---------------------- | ------- | -------- | ------------: | ------------: | | |
| ours | default | 980 GTX | forward | **5.745 ms** | **5.851 ms** | | |
| ours | default | 980 GTX | backward | 77.694 ms | 77.957 ms | | |
| NVIDIA | default | 980 GTX | forward | 13.779 ms | 13.853 ms | | |
| NVIDIA | default | 980 GTX | backward | **73.383 ms** | **73.708 ms** | | |
| | | | | | | | |
| ours | FlowNetC | 980 GTX | forward | **26.102 ms** | **26.179 ms** | | |
| ours | FlowNetC | 980 GTX | backward | **208.091 ms** | **208.510 ms** | | |
| NVIDIA | FlowNetC | 980 GTX | forward | 35.363 ms | 35.550 ms | | |
| NVIDIA | FlowNetC | 980 GTX | backward | 283.748 ms | 284.346 ms | | |
### Notes | |
* The overhead of our implementation regarding `kernel_size` > 1 during backward needs some investigation, feel free to | |
dive in the code to improve it ! | |
* The backward pass of NVIDIA is not entirely correct when stride1 > 1 and kernel_size > 1, because not everything | |
is computed, see [here](https://github.com/NVIDIA/flownet2-pytorch/blob/master/networks/correlation_package/src/correlation_cuda_kernel.cu#L120). | |
## CPU Benchmark | |
* No other implementation is avalaible on CPU. | |
* It is obviously not recommended to run it on CPU if you have a GPU. | |
| Correlation parameters | device | pass | min time | avg time | | |
| ---------------------- | -------------------- | -------- | ----------: | ----------: | | |
| default | E5-2630 v3 @ 2.40GHz | forward | 159.616 ms | 188.727 ms | | |
| default | E5-2630 v3 @ 2.40GHz | backward | 282.641 ms | 294.194 ms | | |
| FlowNetC | E5-2630 v3 @ 2.40GHz | forward | 2.138 s | 2.144 s | | |
| FlowNetC | E5-2630 v3 @ 2.40GHz | backward | 7.006 s | 7.075 s | | |