Large Language Models for instructed and effective code generation using Documentation of APIs

This thesis explores the effective utilization of Large Language Models, specifically the Instruct CodeT5+ 16 Billion model, for the generation of multi-line, ready-to-execute code in Python. Departing from conventional reliance solely on pre-trained LLM knowledge, we employ API documentation to enhance the correctness of generated code for both seen and unseen APIs in the LLM knowledge. We utilize the Retrieval-Augmented Generation technique to incorporate user intents expressed in English, specifically targeting APIs, to select the most suitable segments from the relevant API documentation. Subsequently, these user intents and API documentation segments are utilized in model prompt engineering and fine-tuning procedures. We collect a newly synthesized dataset comprising 938 data points encompassing 46 distinct APIs. Furthermore, we demonstrate significant advancements in code generation accuracy and utility, resulting in a remarkable 0.2 increase in ICE score and a 0.33% elevation in CodeBLEU. Our experimental evaluation provides valuable insights into code generation complexities, including the impact of seen and unseen API documentation on model performance and the effectiveness of prompt engineering strategies. This work underscores the importance of leveraging natural language processing techniques to address real-world challenges in software engineering, with implications for automated software development and enhanced developer productivity.

Downloads last month
0
Safetensors
Model size
16.7B params
Tensor type
FP16
·
U8
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train IslamMesabah/CoderAPI