Parse Function Names from LLVM Bitcode
Large Language Models (LLMs) are increasingly used for coding tasks, but handling extensive codebases effectively requires special techniques. Simply feeding the entire codebase to an LLM can lead to hallucinations due to excessive context and significantly increase costs without yielding meaningful results. We'll discuss extensively about this in coming posts. This post focuses on a simple technique.
One effective technique is to avoid providing the entire code file, which often spans thousands of lines. Instead, extract and share only the skeleton of the file—retaining global variables, classes, and their member functions. This approach minimizes context while preserving essential information. You can explore this concept further in the research paper.
But how can we achieve this? How do we generate a skeleton of a source file—essentially a list of exported symbols along with their hierarchy? The paper mentioned above suggests using ctags
, a simple reference lookup tool. However, ctags
lacks the ability to provide hierarchy information. A better alternative lies in leveraging compilers. In this blog post, we’ll explore how to use LLVM to list all the functions in a C program.
Note: As this is the very first post on LLVM, basic detail about the LLVM command-line parser and how to compile an LLVM program using a Makefile will be explained here. This post will act as an onboarding document supporting subsequent LLVM related posts.
Setup
To follow along, you need to have llvm
and clang
installed in your system.
Install llvm and clang | |
---|---|
1 |
|
Test Program
Let's take a simple program that does arithmetic operations.
test.c | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Here we have three functions, main
, add
and sub
. Our parser will list these functions along with some additional information.
The LLVM Parser
Complete parser code can be found at the bottom of the post. Here, let me explain the important parts in detail. We initially have the header file inclusions for all the different LLVM modules. Then we use the llvm
namespace for easier typing.
Then we declare a static variable Filename
to gather the argument passed.
parse_function_names.cpp | |
---|---|
1 2 3 4 5 |
|
Though it is not directly related to parsing or compiling code, it is a feature provided by LLVM under the namespace llvm:cl
. Using this we declare a static variable called FileName
and mark it as a required positional argument. Then inside the main when we call cl::ParseCommandLineOptions
function, the static variable will be automatically filled by the user provided argument. Pretty handy right?
Then, load the input file into memory.
parse_function_names.cpp | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Lexing and parsing are resource-intensive tasks. Reading code from disk one token at a time can severely impact performance. To address this, we use a memory buffer to load the entire file into memory. Most interfaces, such as MemoryBuffer::getFile
, include built-in error handling. Always check the return value for errors before proceeding. This defensive coding style is essential when working with LLVM APIs.
Please remember that test.c
will not be passed as an argument. We'll compile the it and the compiled binary file will be passed as an argument to this parser. So, the memory buffer contains the llvm bitcode no the C code.
Now parse the functions from the loaded bitcode file.
parse_function_names.cpp | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
In LLVM term, the functions are usually called as modules. So, we gather all the modules from the bitcode. Then we iterate one by one and print only the function definitions. Very simple and straight-forward.
Parser In Action
Makefile
We need to compile the original source to be parsed first. Then the LLVM parser should be compiled. Then pass the bitcode compilation of the original source to the LLVM parser. Compiling the LLVM parser by hand is not recommended due to the lengthy flags to be passed. So, let's write a Makefile. The complete file can be seen at the bottom of the post. Here I'll explain the important snippets.
Makefile | |
---|---|
1 2 3 4 5 |
|
llvm-config
is a very helpful tool. It provides standardized and portable way to get the necessary compiler and linker flags to make the developers' life easier. Then we have an option to be verbose or quieter.
With the help of llvm-config
we construct our compiler flags.
Makefile | |
---|---|
1 2 3 4 5 |
|
Below are typical multi-level Makefile with targets - source(.cpp
) --> object(.o
) --> binary - along with a clean
.
Makefile | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
If you notice keenly, we didn't override the CXX
. So, it falls backs to the default g++
. It is okay. We don't need to compile our parser using llvm. We just need to compile and link it along with llvm libraries. So, we compile our parser with gcc
itself.
Execution
First, compile the input source into llvm bitcode. We installed clang
for this purpose only. So far we never used any clang
libraries in the parser. clang
is the preferred LLVM front-end for C, CPP and Objective-C programs. So, we will use it to convert our test.c
into llvm bitcode.
Note: LLVM employs a front-end --> IR --> back-end compilation model. This means that LLVM uses an Intermediate Representation (IR) as an object representation. During compilation, source code is first converted to IR by a front-end. The IR is then transformed into machine code by a back-end. This design allows developers to implement a front-end without needing to account for the target machine, enabling the same front-end to work with multiple architectures by pairing it with the appropriate back-end.
Compile the test code | |
---|---|
1 2 3 4 |
|
Now compile our parser and pass the test.bc
to it.
Build and run the parser | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Congratulations! We have successfully extracted and listed the functions defined in a source file by parsing its LLVM bitcode. Complete code of parse_function_names.cpp
and the Makefile
can be found below. Also, I've explained an alternate method do the same task at the end of this article.
Complete Code
parse_function_names.cpp | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
Makefile | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Alternative way
There is a tool called llvm-dis
which will convert the llvm IR into a human readable format. When the bitcode is in human readable format, we can simply list the function definitions using the keyword define
.
1 2 3 4 5 |
|
While there are plenty of tools out there to automate parsing, rolling your own parser has real advantages. You get to tailor the logic to your exact needs—something off-the-shelf solutions often can't do. This flexibility is especially useful when you're building systems like IDEs designed for LLM workflows, where generic tools just don't cut it. Plus, understanding the nuts and bolts of parsing gives you more control and insight into how your tooling works.