tomaszki commited on
Commit
7633fe1
1 Parent(s): 42b3383

Create requirements.txt and update README

Browse files
Files changed (2) hide show
  1. README.md +10 -72
  2. requirements.txt +6 -0
README.md CHANGED
@@ -1,78 +1,16 @@
1
- # NumpyAc: Fast Autoregressive Arithmetic Coding
2
 
3
- ## About
4
 
5
- This is a modified version of the [torchac](https://github.com/fab-jul/torchac). NumpyAc takes numpy array as input and can decode in an autoregressive mode.The backend is written in C++, the API is for PyTorch tensors. It will compile in the first run with ninja.The implementation is based on [this blog post](https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html), meaning that we implement _arithmetic coding_. While it could be further optimized, it is already much faster than doing the equivalent thing in pure-Python (because of all the bit-shifts etc.).
6
-
7
- ### Set up conda environment
8
-
9
- This library has been tested with
10
-
11
- - PyTorch 1.5, 1.6, 1.7
12
- - Python 3.8
13
- And that's all you need. Other versions of Python may also work,
14
- but on-the-fly ninja compilation only works for PyTorch 1.5+.
15
-
16
- ### Example
17
-
18
- ```python
19
- import numpyAc
20
- import numpy as np
21
-
22
- # Generate random symbols and pdf.
23
- dim = 128
24
- symsNum = 2000
25
- pdf = np.random.rand(symsNum,dim)
26
- pdf = pdf / (np.sum(pdf,1,keepdims=True))
27
- sym = np.random.randint(0,dim,symsNum,dtype=np.int16)
28
- output_pdf = pdf
29
-
30
- # Encode to bytestream.
31
- codec = numpyAc.arithmeticCoding()
32
- byte_stream,real_bits = codec.encode(pdf, sym,'out.b')
33
-
34
- # Number of bits taken by the stream.
35
- print('real_bits',real_bits)
36
-
37
- # Theoretical bits number
38
- print('shannon entropy',-int(np.log2(pdf[range(0,symsNum),sym]).sum()))
39
-
40
- # Decode from bytestream.
41
- decodec = numpyAc.arithmeticDeCoding(None,symsNum,dim,'out.b')
42
-
43
- # Autoregressive decoding and output will be equal to the input.
44
- for i,s in enumerate(sym):
45
- assert decodec.decode(output_pdf[i:i+1,:]) == s
46
  ```
47
-
48
-
49
- ## Important Implementation Details
50
-
51
- ### How we represent probability distributions
52
-
53
- The probabilities are specified as [PDFs](https://en.wikipedia.org/wiki/Probability_density_function).
54
- For each possible symbol, we need one PDF. This means that if there are `symsNum` possible symbols, and the values of them are distributed in `{0, ..., dim-1}`. The PDF ( shape (`symsNum,dim`) ) must specified the value for `symsNum` symbols.
55
-
56
- **Example**:
57
-
58
  ```
59
- For a symsNum = 1 particular symbol, let's say we have dim = 3 possible values.
60
- We can draw 4 CDF from 3 PDF to specify the symbols distribution:
61
-
62
- symbol: 0 1 2
63
- pdf: P(0) P(1) P(2)
64
- cdf: C_0 C_1 C_2 C_3
65
-
66
- This corresponds to the 3 probabilities
67
 
68
- P(0) = C_1 - C_0
69
- P(1) = C_2 - C_1
70
- P(2) = C_3 - C_2
71
 
72
- where PDF =[[ P(0), P(1) ,P(2) ]]
73
- NOTE: The arithmetic coder assumes that P(0) + P(1) + P(2) = 1, C_0 = 0, C_3 = 1
74
- ```
75
- The theoretical bits number can be estimated by Shannon’s source coding theorem:
76
- ![](https://latex.codecogs.com/svg.image?\\sum_{s}-log_2P(s))
77
- ## Citation
78
- Reference from [torchac](https://github.com/fab-jul/torchac), thanks!
 
1
+ # Python file compressor
2
 
3
+ ## Usage
4
 
5
+ Install dependencies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ```
7
+ pip install -r requirements.txt
8
+ ```
9
+ Run app
10
+ ```
11
+ streamlit run app.py
 
 
 
 
 
 
12
  ```
 
 
 
 
 
 
 
 
13
 
14
+ ## Results
 
 
15
 
16
+ We achieve about 10x compression using 1.1B model. Compressing 100 lines file takes around 10s on GPU.
 
 
 
 
 
 
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ transformers >= 4.33.1
2
+ tokenizers>=0.13.3
3
+ torch
4
+ streamlit
5
+ ninja
6
+ protobuf==3.20