Talk with your code in GitHub using llama-index

Inspired from many posts talking about RAG with code:

Let's build a "Chat with your code" RAG application, step-by-step:
— Akshay 🚀 (@akshay_pachaar) March 7, 2024

And this post github リポジトリを Embedding して質問に答えてもらう

I am also very interested in this topic and want to improve the solution with 2 main points:

Imagine to make embedding for a large file of code, how can we split the code into reasonal smaller parts, like divide a Python file by divide classes/function and make the embedding for each part? We can use the AST tree to split the code into smaller chunks and make the embedding for each part.
For a repository containing multiple files with different languages, we need to use different AST tree splitter intelligently.

I will use llama-index to implement this idea in a Jupiter notebook in this post.

By default in the following code, llama-index will use OpenAI API, with text-embedding-ada-002 model to generate the embeddings for the code files, and with OpenAI gpt-3.5-turbo model to generate the response for the questions.

I will try to create embeddings for my repository semantic-search who uses FastAPI to create a semantic search engine. I will then ask technical questions about the code in the repository and try to get the answers.

Prepare the environment

To read the code from your repository, you will need to have a general token with repo scope. Go to your GitHub settings to create a new token with repo scope then you can save the token as a environment variable GITHUB_TOKEN.
you would also need an OPENAI_API_KEY to use the OpenAI API.
launch a local PostgreSQL database to store the embedding of the code. Use docker compose to launch a PostgreSQL database:

services:
  db:
    image: ankane/pgvector
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: talk_with_code
    ports:
      - '5432:5432'
    healthcheck:
      test: pg_isready -U postgres
      interval: 2s
      timeout: 3s
      retries: 40
    volumes:
      - ./postgres-data:/var/lib/postgresql/data

Install llama_index in your virtual environment.

Load the code from the repository

Load your GITHUB_TOKEN from the environment variable, choose a repository you want to load the code from.

github_token = os.environ.get("GITHUB_TOKEN")
owner = "valeeraZ" # the owner of the repository
repo = "semantic-search" # the name of the repository
branch = "main" # the branch of the repository, or commit

from llama_index.readers.github.repository.github_client import GithubClient
from llama_index.readers.github import GithubRepositoryReader

client = GithubClient(github_token)

documents = GithubRepositoryReader(
    github_client=client,
    owner=owner,
    repo=repo,
    use_parser=False,
    verbose=False,
).load_data(branch=branch) # commit or branch

See more details in about GithubRepositoryReaderDemo

Use len(documents) we can see 41 documents are loaded from the repository.

Let's see the Document object:

from IPython.display import JSON, display
display(documents[1].to_dict())

{'id_': '5d8c952370594f48a2b50a9654fcb979d46166c4',
 'embedding': None,
 'metadata': {'file_path': 'api/__main__.py',
  'file_name': '__main__.py',
  'url': 'https://github.com/valeeraZ/semantic-search/blob/main/api/__main__.py'},
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'relationships': {},
 'text': 'import logging\nimport os\nimport sys\n\nfrom asgi_correlation_id.context import correlation_id\nfrom loguru import logger\nfrom uvicorn import Config, Server\n\nfrom api.settings import settings\n\nLOG_LEVEL = logging.getLevelName(settings.log_level.value.upper())\nJSON_LOGS = True if os.environ.get("JSON_LOGS", "0") == "1" else False\n\n\ndef correlation_id_filter(record):\n    record["correlation_id"] = correlation_id.get()\n    return True\n\n\nclass InterceptHandler(logging.Handler):\n    def emit(self, record):\n        # get corresponding Loguru level if it exists\n        try:\n            level = logger.level(record.levelname).name\n        except ValueError:\n            level = record.levelno\n\n        # find caller from where originated the logged message\n        frame, depth = sys._getframe(6), 6\n        while frame and frame.f_code.co_filename == logging.__file__:\n            frame = frame.f_back\n            depth += 1\n\n        logger.opt(depth=depth, exception=record.exc_info).log(\n            level,\n            record.getMessage(),\n        )\n\n\ndef setup_logging():\n    # intercept everything at the root logger\n    logging.root.handlers = [InterceptHandler()]\n    logging.root.setLevel(LOG_LEVEL)\n\n    # remove every other logger\'s handlers\n    # and propagate to root logger\n    for name in logging.root.manager.loggerDict.keys():\n        logging.getLogger(name).handlers = []\n        logging.getLogger(name).propagate = True\n\n    # configure loguru\n    fmt = "<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | <level>{level: <8}</level> | <red> {correlation_id} </red> | <cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>"\n    logger.remove()\n    logger.add(\n        sink=sys.stdout,\n        format=fmt,\n        filter=correlation_id_filter,\n        serialize=JSON_LOGS,\n    )\n\n\ndef main() -> None:\n    """Entrypoint of the application."""\n    server = Server(\n        Config(\n            "api.web.application:get_app",\n            host=settings.host,\n            port=settings.port,\n            log_level=settings.log_level.value.lower(),\n            reload=settings.reload,\n            workers=settings.workers_count,\n        ),\n    )\n    setup_logging()\n    logger.info("Starting server...")\n    server.run()\n\n\nif __name__ == "__main__":\n    main()\n',
 'start_char_idx': None,
 'end_char_idx': None,
 'text_template': '{metadata_str}\n\n{content}',
 'metadata_template': '{key}: {value}',
 'metadata_seperator': '\n',
 'class_name': 'Document'}

The document object contains the text of the code and the metadata of the code file. Let's see what information we can get from the metadata:

the URL of the code file on GitHub, and also the file path and file name.
the text of the code file.

Get the main programming language of each code file

In a large code base, we may have files with different programming/markup languages used in files. We will use the extension of the file to determine the language of the file.

I use a JSON file to get the programming language by file extension name.

A JSON file like:

[
    {
      "name":"Python",
      "type":"programming",
      "extensions":[
         ".py",
         ".bzl",
         ".cgi",
         ".fcgi",
         ".gyp",
         ".lmi",
         ".pyde",
         ".pyp",
         ".pyt",
         ".pyw",
         ".rpy",
         ".tac",
         ".wsgi",
         ".xpy"
      ]
   },
   {
      "name":"Java",
      "type":"programming",
      "extensions":[
         ".java"
      ]
   },
   ...
]

Given the file extension, we can get the programming language of the file.


import json
f = open("languages.json", "r")
language_list = json.load(f)
f.close()

def get_name_by_extension(extension: str, language_list: list[dict]) -> str | None:
    for item in language_list:
        if extension.lower() in item.get("extensions", []):
            return item["name"].lower()
    return None

Now we have a function to get the programming language name by the file extension, we need to arrange the documents by their programming language.

# group documents by file extension in file_path field in metadata
from llama_index.core.schema import Document
def arrange_documents_by_language(documents: list[Document]) -> dict:
    language_dict = {}
    for doc in documents:
        file_extension = "."+doc.metadata["file_path"].split(".")[-1]
        # get programming language name from file extension
        language_name = get_name_by_extension(file_extension, language_list)
        if language_name:
            if language_name in language_dict:
                language_dict[language_name].append(doc)
            else:
                language_dict[language_name] = [doc]
    return language_dict

file_extension_dict = arrange_documents_by_language(documents)

The file_extension_dict contains the documents grouped by their programming language.

file_extension_dict["python"]

# [Document(_id='a'), Document(_id='b'), ...]

file_extension_dict.keys()
# dict_keys(['yaml', 'markdown', 'python', 'dockerfile', 'json', 'toml'])

Start the RAG pipeline

Prepare a vector store for embedding data

Embeddings created by embedding model are stored in a vector store that offers fast retrieval and similarity search by creating an index over our data. We will use the PostgreSQL database to store the embeddings.

from sqlalchemy import make_url
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore

connection_string = "postgresql://postgres:postgres@localhost:5432"
db_name = "talk_with_code"
url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port="5432",
    user=url.username,
    table_name="valeeraz_github_repo", # the table name to store the embeddings, you can change it
    embed_dim=1536,  # openai embedding dimension
)

Create embeddings for the code files

We can then create embeddings for Python documents and store them in the vector_store.

from llama_index.core.node_parser import CodeSplitter
splitter = CodeSplitter(
    language="python"
)
index = VectorStoreIndex.from_documents(
    documents=file_extension_docs["python"],
    transformations=[splitter],
    storage_context=storage_context,
    show_progress=True,
)

We use the CodeSplitter to split the code into smaller parts and make the embedding for each part. We generated an index object that contains the embeddings of the Python files in the repository.

See more details in the llama index doc

You can access the database and see the embeddings data is stored in the table data_valeeraz_github_repo in your PostgreSQL database.

Query the index

Let's try to use this index, ask a question about a Python file in the repository.


query_engine = index.as_query_engine(streaming=True, similarity_top_k=5) # search top 5 similar documents and generate response

response = query_engine.query('What is the lifetime of this application?')
print(response)

The lifetime of this application involves running startup and shutdown events using a context manager defined in the lifetime.py file. The startup event involves setting up the database connection and creating tables, while the shutdown event disposes of the database engine. These events are managed by the lifespan context manager which ensures proper initialization and cleanup of resources when the FastAPI application starts and stops.

However, as this index doesn't contain all the file embeddings in the repository, ask a question like "What does the Dockerfile do in this repository?" will not return reasonable result.

response = query_engine.query('What does the Dockerfile do in this repository?')

The provided context information does not mention anything about a Dockerfile in the repository.

For the above code, you can find documments with similar usage in the llama index doc

Multi programming languages files

As you can see in the part "Create embeddings for the code files", we can only create embeddings for Python files. We need to also create embeddings for all different type of files and store them in the vector store. We should be able to ask any question about the code in the repository within one index.

from llama_index.core.node_parser import CodeSplitter

all_nodes = []
for language, docs in file_extension_docs.items():
    splitter = CodeSplitter(language=language)
    if splitter:
        print(f"Splitting {language} documents")
        try:
            nodes = splitter.get_nodes_from_documents(docs)
            all_nodes.extend(nodes)
        except Exception as e:
            print(f"Error in {language} documents: {e}")
    else:
        print(f"Skipping {language} documents")

65 nodes are created from 41 documents loaded above.

Use the nodes to create a new VectorStoreIndex to store the embeddings of all the files in the repository.

from llama_index.core.node_parser import CodeSplitter
splitter = CodeSplitter(
    language="python"
)
index = VectorStoreIndex(nodes=all_nodes, storage_context=storage_context, vector_store=vector_store)

See more details about using nodes to create a vector store index in the llama index doc

Now we can ask any question about the code in the repository.

query_engine = index.as_query_engine(streaming=True, similarity_top_k=5) # search top 5 similar documents and generate response

response = query_engine.query('What does the Dockerfile do in this repository?')
print(response)

The Dockerfile in this repository installs necessary dependencies, configures Poetry, copies the project requirements and application code, and then installs the project dependencies using Poetry. Finally, it sets the command to run the application.

Have Chat Memory during the conversation

We want to store the memory of the chat, that's to say the LLM will answer questions with the context of the previous questions. We can use the ChatMemory to store the chat history.

See more details about ChatMemory


from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import ReActAgent
from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.core.memory import ChatMemoryBuffer

from llama_index.core.agent import ReActAgent

chat_store = SimpleChatStore()

chat_memory = ChatMemoryBuffer.from_defaults(
    token_limit=3000,
    chat_store=chat_store,
    chat_store_key="user1",
)
query_engine_tools = QueryEngineTool.from_defaults(
        query_engine=query_engine,
    )

react_agent = ReActAgent.from_tools([query_engine_tools], verbose=True, memory=chat_memory)

In the above code, I use a memory chat store who store the chat conversation messages in the memory. I create an Agent to use the query engine and the memory to answer the questions.

See more details about ReActAgent

react_agent.chat('In this application, what does the function split_text_into_chunks in api/api/web/service/file_chunk.py do?')

Thought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: query_engine_tool
Action Input: {'input': 'What does the function split_text_into_chunks in api/api/web/service/file_chunk.py do?'}
Observation: The function split_text_into_chunks in api/web/service/file_chunk.py splits a given text into chunks based on an ideal token size. It calculates the ideal size for each chunk, divides the text into chunks of that size, and returns a list of these chunks.
Thought: I can answer without using any more tools. I'll use the user's language to answer
Answer: The function `split_text_into_chunks` in `file_chunk.py` splits a given text into chunks based on an ideal token size. It calculates the ideal size for each chunk, divides the text into chunks of that size, and returns a list of these chunks.

react_agent.chat("What is my last question?")

Thought: (Implicit) I can answer without any more tools!
Answer: Your last question was "In this application, what does the function split_text_into_chunks in api/api/web/service/file_chunk.py do?"

You can actually see more details from the output of the code above, who shows that the agent has found relative documents about the question as RAG process.

Conclusion

In this post, I have shown how to use llama-index to create embeddings for code files in a GitHub repository and use the embeddings to ask questions about the code in the repository. I also showed how to use the ChatMemory to store the chat history and use the history to answer the questions.

To improve:

Use a Factory Pattern to get CodeSplitter for different programming languages for production.
Learn more and use other Agent more adapted to our need about asking questons about the code in the repository or search more information from internet.
Use a better OpenAI model to generate the response