AI

Creating a Chatbot for Developer Documentation with Open‑Source LLMs

Creating A Chatbot For Developer Documentation With Open-Source LLMs

Introduction

Developer documentation is critical, but searching through hundreds of pages of API references, tutorials, and wikis can be frustrating. Developers often spend more time finding the right documentation than actually implementing features. A chatbot powered by open-source LLMs provides a faster, more intuitive way to access documentation.

Instead of reading entire docs, developers can ask questions in natural language and receive context-aware answers instantly. Companies like Stripe, MongoDB, and Supabase have implemented documentation chatbots, seeing significant improvements in developer onboarding time and reduced support ticket volume.

In this comprehensive guide, we’ll build a complete documentation chatbot using open-source large language models, covering architecture design, implementation details, and production deployment strategies.

Why Use Open-Source LLMs?

While proprietary models like GPT-4 or Claude are powerful, open-source LLMs offer unique advantages for documentation chatbots:

  • Privacy control: Run the model locally or on your own infrastructure—sensitive internal docs never leave your network
  • Customization: Fine-tune on your team’s documentation, coding style, and domain terminology
  • Cost efficiency: Avoid recurring API fees that scale with usage
  • Transparency: Full access to model architecture, weights, and training decisions
  • Latency control: Self-hosted models can provide faster response times for high-volume use cases

Popular options include Llama 3, Mistral, Mixtral, and Qwen, which can be hosted with libraries like Hugging Face Transformers, Ollama, or vLLM.

System Architecture

A documentation chatbot uses Retrieval-Augmented Generation (RAG) to provide accurate, contextual answers:

// Architecture overview

┌─────────────────────────────────────────────────────────────────┐
│                    Documentation Chatbot                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────────┐   │
│  │   User   │───▶│   Chat UI    │───▶│    Query Processor   │   │
│  │  Query   │    │  (Next.js)   │    │   (Reformulation)    │   │
│  └──────────┘    └──────────────┘    └──────────┬───────────┘   │
│                                                  │                │
│                                                  ▼                │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                   RAG Pipeline                            │   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌────────────────┐  │   │
│  │  │  Embedding  │──▶│   Vector    │──▶│   Retriever    │  │   │
│  │  │   Model     │   │   Store     │   │  (Top-K docs)  │  │   │
│  │  └─────────────┘   └─────────────┘   └───────┬────────┘  │   │
│  │                                               │           │   │
│  │                                               ▼           │   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌────────────────┐  │   │
│  │  │  Response   │◀──│  Open-Source│◀──│   Reranker     │  │   │
│  │  │  Generator  │   │     LLM     │   │  (Optional)    │  │   │
│  │  └─────────────┘   └─────────────┘   └────────────────┘  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Step 1: Collect and Prepare Documentation

Gather all documentation sources and convert them into a consistent format:

# src/ingestion/doc_loader.py
import os
import re
from pathlib import Path
from typing import List, Dict, Any
from dataclasses import dataclass
import markdown
from bs4 import BeautifulSoup
import yaml

@dataclass
class Document:
    """Represents a documentation chunk"""
    content: str
    metadata: Dict[str, Any]
    source: str
    doc_type: str  # 'api', 'guide', 'tutorial', 'readme'

class DocumentationLoader:
    """Load and process documentation from multiple sources"""
    
    def __init__(self, docs_path: str):
        self.docs_path = Path(docs_path)
        self.documents: List[Document] = []
    
    def load_all(self) -> List[Document]:
        """Load all documentation files"""
        # Load Markdown files
        for md_file in self.docs_path.rglob("*.md"):
            self.documents.extend(self._load_markdown(md_file))
        
        # Load OpenAPI specs
        for spec_file in self.docs_path.rglob("openapi*.yaml"):
            self.documents.extend(self._load_openapi(spec_file))
        
        # Load code examples
        for code_file in self.docs_path.rglob("examples/**/*.py"):
            self.documents.extend(self._load_code_example(code_file))
        
        return self.documents
    
    def _load_markdown(self, file_path: Path) -> List[Document]:
        """Load and chunk markdown files"""
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        # Extract frontmatter
        metadata = {}
        if content.startswith('---'):
            parts = content.split('---', 2)
            if len(parts) >= 3:
                metadata = yaml.safe_load(parts[1])
                content = parts[2]
        
        # Determine doc type from path
        doc_type = self._infer_doc_type(file_path)
        
        # Chunk by headers
        chunks = self._chunk_by_headers(content)
        
        documents = []
        for i, chunk in enumerate(chunks):
            documents.append(Document(
                content=chunk['content'],
                metadata={
                    **metadata,
                    'section': chunk['header'],
                    'chunk_index': i,
                    'file_path': str(file_path),
                },
                source=str(file_path.relative_to(self.docs_path)),
                doc_type=doc_type,
            ))
        
        return documents
    
    def _chunk_by_headers(self, content: str) -> List[Dict[str, str]]:
        """Split content into chunks based on markdown headers"""
        # Match headers at different levels
        header_pattern = r'^(#{1,3})\s+(.+)$'
        lines = content.split('\n')
        chunks = []
        current_chunk = {'header': 'Introduction', 'content': ''}
        
        for line in lines:
            header_match = re.match(header_pattern, line)
            if header_match:
                # Save current chunk if it has content
                if current_chunk['content'].strip():
                    chunks.append(current_chunk)
                
                # Start new chunk
                current_chunk = {
                    'header': header_match.group(2),
                    'content': line + '\n'
                }
            else:
                current_chunk['content'] += line + '\n'
        
        # Add final chunk
        if current_chunk['content'].strip():
            chunks.append(current_chunk)
        
        return chunks
    
    def _load_openapi(self, file_path: Path) -> List[Document]:
        """Load OpenAPI spec and create documents for each endpoint"""
        with open(file_path, 'r') as f:
            spec = yaml.safe_load(f)
        
        documents = []
        for path, methods in spec.get('paths', {}).items():
            for method, details in methods.items():
                if method in ['get', 'post', 'put', 'delete', 'patch']:
                    content = self._format_endpoint_doc(path, method, details)
                    documents.append(Document(
                        content=content,
                        metadata={
                            'endpoint': path,
                            'method': method.upper(),
                            'tags': details.get('tags', []),
                            'operation_id': details.get('operationId'),
                        },
                        source=str(file_path.relative_to(self.docs_path)),
                        doc_type='api',
                    ))
        
        return documents
    
    def _format_endpoint_doc(self, path: str, method: str, details: Dict) -> str:
        """Format API endpoint as readable documentation"""
        doc = f"# {method.upper()} {path}\n\n"
        doc += f"{details.get('summary', '')}\n\n"
        doc += f"{details.get('description', '')}\n\n"
        
        # Parameters
        if params := details.get('parameters'):
            doc += "## Parameters\n\n"
            for param in params:
                required = '(required)' if param.get('required') else '(optional)'
                doc += f"- **{param['name']}** {required}: {param.get('description', '')}\n"
        
        # Request body
        if request_body := details.get('requestBody'):
            doc += "\n## Request Body\n\n"
            doc += f"{request_body.get('description', '')}\n"
        
        # Responses
        if responses := details.get('responses'):
            doc += "\n## Responses\n\n"
            for status, response in responses.items():
                doc += f"- **{status}**: {response.get('description', '')}\n"
        
        return doc
    
    def _load_code_example(self, file_path: Path) -> List[Document]:
        """Load code examples as documentation"""
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        # Extract docstring as description
        description = ""
        if '"""' in content:
            match = re.search(r'"""(.+?)"""', content, re.DOTALL)
            if match:
                description = match.group(1).strip()
        
        return [Document(
            content=f"# Code Example: {file_path.stem}\n\n{description}\n\n```python\n{content}\n```",
            metadata={
                'file_name': file_path.name,
                'language': 'python',
            },
            source=str(file_path.relative_to(self.docs_path)),
            doc_type='example',
        )]
    
    def _infer_doc_type(self, file_path: Path) -> str:
        """Infer document type from file path"""
        path_str = str(file_path).lower()
        if 'api' in path_str:
            return 'api'
        elif 'guide' in path_str or 'tutorial' in path_str:
            return 'guide'
        elif 'readme' in path_str:
            return 'readme'
        return 'general'

Step 2: Index Documentation with Embeddings

Create embeddings and store them in a vector database for fast retrieval:

# src/indexing/vector_store.py
from typing import List, Optional
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

class DocumentIndexer:
    """Index documents in a vector database"""
    
    def __init__(
        self,
        embedding_model: str = "BAAI/bge-large-en-v1.5",
        collection_name: str = "documentation",
        persist_directory: str = "./chroma_db"
    ):
        # Load embedding model
        self.embedder = SentenceTransformer(embedding_model)
        
        # Initialize ChromaDB
        self.chroma_client = chromadb.PersistentClient(
            path=persist_directory,
            settings=Settings(anonymized_telemetry=False)
        )
        
        # Get or create collection
        self.collection = self.chroma_client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}  # Use cosine similarity
        )
    
    def index_documents(self, documents: List[Document], batch_size: int = 100):
        """Index documents in batches"""
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]
            
            # Generate embeddings
            texts = [doc.content for doc in batch]
            embeddings = self.embedder.encode(texts, show_progress_bar=True)
            
            # Prepare data for ChromaDB
            ids = [f"{doc.source}_{doc.metadata.get('chunk_index', i)}" 
                   for i, doc in enumerate(batch)]
            metadatas = [
                {
                    "source": doc.source,
                    "doc_type": doc.doc_type,
                    **{k: str(v) for k, v in doc.metadata.items()}
                }
                for doc in batch
            ]
            
            # Add to collection
            self.collection.add(
                ids=ids,
                embeddings=embeddings.tolist(),
                documents=texts,
                metadatas=metadatas
            )
        
        print(f"Indexed {len(documents)} documents")
    
    def search(
        self,
        query: str,
        n_results: int = 5,
        doc_type_filter: Optional[str] = None
    ) -> List[dict]:
        """Search for relevant documents"""
        # Generate query embedding
        query_embedding = self.embedder.encode([query])[0]
        
        # Build filter
        where_filter = None
        if doc_type_filter:
            where_filter = {"doc_type": doc_type_filter}
        
        # Query collection
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            where=where_filter,
            include=["documents", "metadatas", "distances"]
        )
        
        # Format results
        formatted_results = []
        for i in range(len(results['ids'][0])):
            formatted_results.append({
                'id': results['ids'][0][i],
                'content': results['documents'][0][i],
                'metadata': results['metadatas'][0][i],
                'score': 1 - results['distances'][0][i],  # Convert distance to similarity
            })
        
        return formatted_results

# Alternative: Using Qdrant for production
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

class QdrantIndexer:
    """Production-ready indexer using Qdrant"""
    
    def __init__(
        self,
        embedding_model: str = "BAAI/bge-large-en-v1.5",
        collection_name: str = "documentation",
        qdrant_url: str = "http://localhost:6333"
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.embedding_dim = self.embedder.get_sentence_embedding_dimension()
        self.client = QdrantClient(url=qdrant_url)
        self.collection_name = collection_name
        
        # Create collection if not exists
        collections = [c.name for c in self.client.get_collections().collections]
        if collection_name not in collections:
            self.client.create_collection(
                collection_name=collection_name,
                vectors_config=VectorParams(
                    size=self.embedding_dim,
                    distance=Distance.COSINE
                )
            )
    
    def index_documents(self, documents: List[Document], batch_size: int = 100):
        """Index documents with Qdrant"""
        points = []
        for i, doc in enumerate(documents):
            embedding = self.embedder.encode(doc.content)
            points.append(PointStruct(
                id=i,
                vector=embedding.tolist(),
                payload={
                    "content": doc.content,
                    "source": doc.source,
                    "doc_type": doc.doc_type,
                    **doc.metadata
                }
            ))
        
        # Upsert in batches
        for i in range(0, len(points), batch_size):
            batch = points[i:i + batch_size]
            self.client.upsert(
                collection_name=self.collection_name,
                points=batch
            )

Step 3: Connect an Open-Source LLM

Set up the LLM for generating responses based on retrieved documentation:

# src/llm/model_server.py
from typing import List, Optional, Generator
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread

class DocumentationLLM:
    """LLM wrapper for documentation Q&A"""
    
    def __init__(
        self,
        model_name: str = "mistralai/Mistral-7B-Instruct-v0.3",
        device: str = "cuda",
        max_new_tokens: int = 1024,
    ):
        self.device = device
        self.max_new_tokens = max_new_tokens
        
        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        
        # Set padding token
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
    
    def generate_response(
        self,
        query: str,
        context_docs: List[dict],
        conversation_history: Optional[List[dict]] = None,
    ) -> str:
        """Generate a response based on retrieved documents"""
        # Build context from retrieved documents
        context = self._format_context(context_docs)
        
        # Build prompt
        prompt = self._build_prompt(query, context, conversation_history)
        
        # Tokenize
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=4096
        ).to(self.device)
        
        # Generate
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=self.max_new_tokens,
                temperature=0.7,
                top_p=0.9,
                do_sample=True,
                pad_token_id=self.tokenizer.pad_token_id,
            )
        
        # Decode response
        response = self.tokenizer.decode(
            outputs[0][inputs['input_ids'].shape[1]:],
            skip_special_tokens=True
        )
        
        return response.strip()
    
    def generate_stream(
        self,
        query: str,
        context_docs: List[dict],
        conversation_history: Optional[List[dict]] = None,
    ) -> Generator[str, None, None]:
        """Generate response with streaming"""
        context = self._format_context(context_docs)
        prompt = self._build_prompt(query, context, conversation_history)
        
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=4096
        ).to(self.device)
        
        # Set up streamer
        streamer = TextIteratorStreamer(
            self.tokenizer,
            skip_special_tokens=True,
            skip_prompt=True
        )
        
        # Generate in separate thread
        generation_kwargs = {
            **inputs,
            "max_new_tokens": self.max_new_tokens,
            "temperature": 0.7,
            "top_p": 0.9,
            "do_sample": True,
            "streamer": streamer,
        }
        
        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()
        
        # Yield tokens as they're generated
        for token in streamer:
            yield token
    
    def _format_context(self, docs: List[dict]) -> str:
        """Format retrieved documents as context"""
        context_parts = []
        for i, doc in enumerate(docs, 1):
            source = doc.get('metadata', {}).get('source', 'Unknown')
            context_parts.append(
                f"[Document {i}]\nSource: {source}\n{doc['content']}\n"
            )
        return "\n---\n".join(context_parts)
    
    def _build_prompt(self, query: str, context: str, history: Optional[List[dict]]) -> str:
        """Build the full prompt with system instructions"""
        system_prompt = """You are a helpful documentation assistant. Your role is to answer 
questions about the codebase and APIs based on the provided documentation context.

Guidelines:
- Answer based ONLY on the provided documentation context
- If the answer is not in the context, say "I don't have information about that in the documentation"
- Include code examples when relevant
- Reference the source document when appropriate
- Be concise but comprehensive
- Format code blocks with appropriate language tags"""
        
        # Mistral-style prompt format
        prompt = f"[INST] {system_prompt}\n\n"
        prompt += f"Documentation Context:\n{context}\n\n"
        
        # Add conversation history
        if history:
            for msg in history[-4:]:  # Keep last 4 exchanges
                if msg['role'] == 'user':
                    prompt += f"User: {msg['content']}\n"
                else:
                    prompt += f"Assistant: {msg['content']}\n"
        
        prompt += f"User Question: {query} [/INST]"
        
        return prompt

# Using Ollama for simpler deployment
import ollama

class OllamaLLM:
    """Use Ollama for easy local LLM deployment"""
    
    def __init__(self, model: str = "llama3:8b"):
        self.model = model
    
    def generate_response(
        self,
        query: str,
        context_docs: List[dict],
        conversation_history: Optional[List[dict]] = None,
    ) -> str:
        context = "\n---\n".join([doc['content'] for doc in context_docs])
        
        messages = [
            {
                "role": "system",
                "content": "You are a documentation assistant. Answer questions based only on the provided context. Include code examples when helpful."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}"
            }
        ]
        
        response = ollama.chat(model=self.model, messages=messages)
        return response['message']['content']

Step 4: Build the Complete Chatbot

Combine all components into a complete RAG chatbot:

# src/chatbot/documentation_chatbot.py
from typing import List, Optional, Generator
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)

@dataclass
class ChatMessage:
    role: str  # 'user' or 'assistant'
    content: str
    sources: Optional[List[str]] = None

class DocumentationChatbot:
    """Complete RAG chatbot for documentation"""
    
    def __init__(
        self,
        indexer: DocumentIndexer,
        llm: DocumentationLLM,
        reranker: Optional["Reranker"] = None,
        retrieval_k: int = 5,
        rerank_k: int = 3,
    ):
        self.indexer = indexer
        self.llm = llm
        self.reranker = reranker
        self.retrieval_k = retrieval_k
        self.rerank_k = rerank_k
        self.conversation_history: List[ChatMessage] = []
    
    def chat(self, query: str) -> ChatMessage:
        """Process a user query and return a response"""
        # Step 1: Retrieve relevant documents
        retrieved_docs = self.indexer.search(
            query=query,
            n_results=self.retrieval_k
        )
        
        logger.info(f"Retrieved {len(retrieved_docs)} documents")
        
        # Step 2: Rerank if reranker is available
        if self.reranker and len(retrieved_docs) > self.rerank_k:
            retrieved_docs = self.reranker.rerank(
                query=query,
                documents=retrieved_docs,
                top_k=self.rerank_k
            )
        
        # Step 3: Generate response
        history = [
            {"role": msg.role, "content": msg.content}
            for msg in self.conversation_history[-4:]
        ]
        
        response_text = self.llm.generate_response(
            query=query,
            context_docs=retrieved_docs,
            conversation_history=history
        )
        
        # Step 4: Extract sources
        sources = [
            doc.get('metadata', {}).get('source', 'Unknown')
            for doc in retrieved_docs
        ]
        
        # Step 5: Update conversation history
        self.conversation_history.append(ChatMessage(
            role='user',
            content=query
        ))
        
        response_message = ChatMessage(
            role='assistant',
            content=response_text,
            sources=list(set(sources))  # Deduplicate
        )
        self.conversation_history.append(response_message)
        
        return response_message
    
    def chat_stream(self, query: str) -> Generator[str, None, None]:
        """Stream response for real-time display"""
        retrieved_docs = self.indexer.search(
            query=query,
            n_results=self.retrieval_k
        )
        
        if self.reranker and len(retrieved_docs) > self.rerank_k:
            retrieved_docs = self.reranker.rerank(
                query=query,
                documents=retrieved_docs,
                top_k=self.rerank_k
            )
        
        history = [
            {"role": msg.role, "content": msg.content}
            for msg in self.conversation_history[-4:]
        ]
        
        full_response = ""
        for token in self.llm.generate_stream(
            query=query,
            context_docs=retrieved_docs,
            conversation_history=history
        ):
            full_response += token
            yield token
        
        # Update history after streaming completes
        self.conversation_history.append(ChatMessage(role='user', content=query))
        self.conversation_history.append(ChatMessage(
            role='assistant',
            content=full_response,
            sources=[doc.get('metadata', {}).get('source') for doc in retrieved_docs]
        ))
    
    def clear_history(self):
        """Clear conversation history"""
        self.conversation_history = []

Step 5: API and UI

# src/api/main.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import List, Optional
import json

app = FastAPI(title="Documentation Chatbot API")

# Initialize chatbot (in production, use dependency injection)
chatbot = None

def get_chatbot():
    global chatbot
    if chatbot is None:
        loader = DocumentationLoader("./docs")
        docs = loader.load_all()
        
        indexer = DocumentIndexer()
        indexer.index_documents(docs)
        
        llm = OllamaLLM(model="llama3:8b")
        chatbot = DocumentationChatbot(indexer=indexer, llm=llm)
    return chatbot

class ChatRequest(BaseModel):
    query: str
    stream: bool = False

class ChatResponse(BaseModel):
    response: str
    sources: List[str]

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """Chat endpoint for documentation queries"""
    bot = get_chatbot()
    
    if request.stream:
        async def generate():
            for token in bot.chat_stream(request.query):
                yield f"data: {json.dumps({'token': token})}\n\n"
            yield "data: [DONE]\n\n"
        
        return StreamingResponse(
            generate(),
            media_type="text/event-stream"
        )
    
    response = bot.chat(request.query)
    return ChatResponse(
        response=response.content,
        sources=response.sources or []
    )

@app.post("/clear")
async def clear_history():
    """Clear conversation history"""
    get_chatbot().clear_history()
    return {"status": "cleared"}

@app.get("/health")
async def health():
    return {"status": "healthy"}

Common Mistakes to Avoid

Poor chunking strategy: Splitting documents at arbitrary character limits breaks context. Always chunk at logical boundaries (headers, paragraphs, code blocks).

Ignoring metadata: Source attribution and document types enable better filtering and user trust. Always preserve and expose this information.

Not handling out-of-scope queries: Your prompt must instruct the LLM to acknowledge when it doesn’t have relevant information rather than hallucinating answers.

Skipping reranking: Initial retrieval often returns marginally relevant documents. A reranker significantly improves response quality for the cost of slight latency.

Stale embeddings: Documentation changes frequently. Set up automated re-indexing when docs are updated.

Ignoring evaluation: Track metrics like answer relevance, retrieval accuracy, and user feedback to continuously improve your chatbot.

Conclusion

Building a chatbot for developer documentation with open-source LLMs can transform how teams interact with their knowledge base. It saves time, reduces frustration, and makes onboarding smoother—studies show developers can find answers 3-5x faster with a well-implemented documentation chatbot.

The best approach combines an open-source LLM with a well-structured RAG pipeline: smart chunking, quality embeddings, optional reranking, and clear prompting. This ensures your chatbot becomes a reliable assistant that developers actually want to use.

Start with a simple setup using Ollama and ChromaDB, then scale to production-grade solutions like vLLM and Qdrant as your needs grow. The key is iterating based on real user queries and continuously improving retrieval quality.

If you’re interested in applying AI to your workflow, see our post on Automating Documentation with AI. For hands-on tutorials, check out LangChain’s documentation.

Leave a Comment