Papers
arxiv:2410.01215

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Published on Oct 2
ยท Submitted by YerbaPage on Oct 3
#3 Paper of the day
Authors:
,
,

Abstract

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

Community

Paper author Paper submitter
โ€ข
edited Oct 3

MGDebugger, a hierarchical bottom-up LLM code debugger ๐Ÿ”ฅ that can fix bugs from low-level syntax errors to high-level algorithmic flaws.

It achieves an โญ๏ธ 18.9% improvement in accuracy over seed generations in HumanEval and a โญ๏ธ 97.6% repair success rate in HumanEvalFix.

Code and demo available at https://github.com/YerbaPage/MGDebugger.

Brilliant ๐Ÿค—

Approximately, what is the overhead? I.e the ratio between the subtotal tokens (finished code) and total tokens (debugging steps + finished code)

ยท
Paper author

Great question ๐Ÿ‘

Most debugging methods like Self-Debugging, LDB, Reflexion, etc., tend to have a high ratio of debugging tokens to finished code tokens (often > 5), as they perform extensive analyses to identify and resolve bugs. Despite this, they sometimes struggle to detect and fix subtle issues.

In our approach, MGDebugger might incur slightly higher token costs due to the hierarchical decomposition process, where we isolate and debug subfunctions separately. However, the method's effectiveness justifies this overhead since it addresses errors at multiple levels of granularity, allowing it to debug issues that other methods might overlook.

Hope that clarifies things!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author Paper submitter
โ€ข
edited Oct 9

We have released our demo on Hugging Face here: https://huggingface.co/spaces/learnmlf/MGDebugger ๐Ÿš€โœจ.

Thanks to LDB for the inspiration!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.01215 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.01215 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 10