File size: 1,555 Bytes
77e34ac 422a162 fc1ba61 e97893e fc1ba61 e97893e a94558c e97893e bdf6b6e e97893e bdf6b6e 09b0dbe e97893e 09b0dbe e97893e 77e34ac 61af6b4 77e34ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
language: "Python"
tags:
- Code
- GPyT
- code generator
license: "MIT"
---
GPyT is a GPT2 model trained from scratch (not fine tuned) on Python code from Github. Overall, it was ~80GB of pure Python code, the current GPyT model is a mere 2 epochs through this data, so it may benefit greatly from continued training and/or fine-tuning.
Newlines are replaced by `<N>`
Input to the model is code, up to the context length of 1024, with newlines replaced by `<N>`
Here's an example of a quick converter to take your multi-line code and replace the newlines:
```py
inp = """def do_something():
print("Hello")
"""
newlinechar = "<N>"
converted = inp.replace("\n", newlinechar)
print("length:", len(converted))
print(converted)
```
This should give you something like:
`def do_something():<N> print("Hello")<N>`
...which is what the model is expecting as input.
Considerations:
1. This model is intended for educational and research use only. Do not trust model outputs.
2. Model is highly likely to regurgitate code almost exactly as it saw it. It's up to you to determine licensing if you intend to actually use the generated code.
3. All Python code was blindly pulled from github. This means included code is both Python 2 and 3, among other more subtle differences, such as tabs being 2 spaces in some cases and 4 in others...and more non-homologous things.
4. Along with the above, this means the code generated could wind up doing or suggesting just about anything. Run the generated code at own risk...it could be *anything*
|