Text Generation
Transformers
PyTorch
code
gpt2
custom_code
Eval Results
text-generation-inference
santafixer / README.md
codelion's picture
Update README.md
36fa1bd
metadata
license: apache-2.0
datasets:
  - lambdasec/cve-single-line-fixes
  - lambdasec/gh-top-1000-projects-vulns
language:
  - code
tags:
  - code
programming_language:
  - Java
  - JavaScript
  - Python
inference: false
model-index:
  - name: SantaFixer
    results:
      - task:
          type: text-generation
        dataset:
          type: openai/human-eval-infilling
          name: HumanEval
        metrics:
          - name: single-line infilling pass@1
            type: pass@1
            value: 0.47
            verified: false
          - name: single-line infilling pass@10
            type: pass@10
            value: 0.74
            verified: false
      - task:
          type: text-generation
        dataset:
          type: lambdasec/gh-top-1000-projects-vulns
          name: GH Top 1000 Projects Vulnerabilities
        metrics:
          - name: pass@1 (Java)
            type: pass@1
            value: 0.26
            verified: false
          - name: pass@10 (Java)
            type: pass@10
            value: 0.48
            verified: false
          - name: pass@1 (Python)
            type: pass@1
            value: 0.31
            verified: false
          - name: pass@10 (Python)
            type: pass@10
            value: 0.56
            verified: false
          - name: pass@1 (JavaScript)
            type: pass@1
            value: 0.36
            verified: false
          - name: pass@10 (JavaScript)
            type: pass@10
            value: 0.62
            verified: false

Model Card for SantaFixer

This is a LLM for code that is focussed on generating bug fixes using infilling.

Model Details

Model Description

How to Get Started with the Model

Use the code below to get started with the model.

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "lambdasec/santafixer"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
              trust_remote_code=True).to(device)

input_text = "<fim-prefix>def print_hello_world():\n
              <fim-suffix>\n print('Hello world!')
              <fim-middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Training Details

  • GPU: Tesla P100
  • Time: ~5 hrs

Training Data

The model was fine-tuned on the CVE single line fixes dataset

Training Procedure

Supervised Fine Tuning (SFT)

Training Hyperparameters

  • optim: adafactor
  • gradient_accumulation_steps: 4
  • gradient_checkpointing: true
  • fp16: false

Evaluation

The model was tested with the GitHub top 1000 projects vulnerabilities dataset