Knowledge Hub: Technical Deep Dive

A companion to "Building a Knowledge Hub in 13 Days"

What It Is

Knowledge Hub is an enterprise knowledge aggregation platform that syncs data from multiple sources into a searchable vector database with AI-powered retrieval-augmented generation (RAG).

Primary use case: Ask natural language questions across all company data and get AI-generated answers with source citations.

Tech Stack

Layer	Technology
Backend	Python 3.11, Flask 3.1, Gunicorn
Database	PostgreSQL (production), SQLite (development)
ORM	Flask-SQLAlchemy with Flask-Migrate (Alembic)
Vector Store	Qdrant (1536-dimensional embeddings)
Embeddings	OpenAI text-embedding-3-small
LLM	Anthropic Claude (claude-sonnet-4-20250514)
Reranking	Cohere API with local fallback
Search	Hybrid: semantic + BM25 keyword + cross-encoder reranking
Spell Correction	SymSpellPy
Auth	Google OAuth 2.0 with encrypted credential storage
Background Jobs	APScheduler
Slack	Slack Bolt SDK with Socket Mode
MCP	FastMCP (stdio + SSE transports)
Document Processing	PDFPlumber, python-docx, openpyxl, python-pptx, Pillow
Deployment	Replit

Architecture

flowchart TB
    subgraph SUPERVISOR[" ☑ SUPERVISOR "]
        health["Health Checks<br/>Auto-Restart"]
    end

    subgraph SERVICES[" ⚡ SERVICES "]
        flask["Flask App<br/>119 Endpoints"]
        mcp["MCP Server<br/>Claude Integration"]
        slack["Slackbot<br/>Team Q&A"]
    end

    subgraph CORE[" ⚙ CORE - 36 Modules "]
        sync["Sync<br/>Manager"]
        rag["RAG<br/>Engine"]
        query["Query<br/>Processor"]
        search["Hybrid<br/>Search"]
        rerank["Reranker"]
        circuit["Circuit<br/>Breaker"]
    end

    subgraph CONNECTORS[" 🔌 CONNECTORS "]
        sources["Gmail • Drive • Slack • Zendesk<br/>Attio • Granola • ChatGPT • Dropbox"]
    end

    subgraph STORAGE[" 💾 STORAGE "]
        postgres[("PostgreSQL<br/>Users, OAuth<br/>Sync State")]
        qdrant[("Qdrant<br/>Vectors<br/>BM25 Index")]
    end

    subgraph EXTERNAL[" 🌐 EXTERNAL APIs "]
        openai["OpenAI<br/>Embeddings"]
        anthropic["Anthropic<br/>Claude LLM"]
        cohere["Cohere<br/>Reranking"]
    end

    SUPERVISOR --> flask & mcp & slack
    flask & mcp & slack --> CORE
    CORE --> CONNECTORS
    CORE --> postgres & qdrant
    CORE --> openai & anthropic & cohere

Component Overview

Component	Purpose
Supervisor	Process manager with health checks and auto-restart
Flask App	Web dashboard, 119 REST API endpoints, Google OAuth
MCP Server	Enables Claude Desktop/Web to query the knowledge base
Slackbot	Team members can ask questions from Slack
Sync Manager	Orchestrates parallel data sync (up to 6 concurrent sources)
Query Processor	Spell correction, intent classification, query optimization
RAG Engine	Answer synthesis with source citations
Hybrid Search	Combines semantic vectors + BM25 keyword search
Reranker	Cross-encoder reranking via Cohere with local fallback
Circuit Breaker	Resilience pattern for external service failures

Data Sources

Source	What's Synced	Special Features
Gmail	Emails, attachments (PDF, DOCX, images)	20 parallel workers, attachment extraction
Google Drive	Docs, Sheets, Slides, PDFs	Format conversion, 10MB file limit
Slack	Messages, threads, channels	User enrichment, rate-limit aware
Zendesk	Investment opportunities, deal tracking	HTML stripping, 5 parallel workers
Attio	Companies, contacts, notes, lists	Configurable object filtering, 365-day recency
Granola	AI meeting notes, transcripts	ProseMirror JSON parsing
ChatGPT	Exported conversation history	Staging DB with approval workflow
Dropbox	Documents, PDFs, text files	10+ file types, 10MB limit

Base Connector Features:

HTTP session pooling for connection reuse
Retry logic with exponential backoff (max 3 retries)
Rate limit handling (429 status code detection)
Request timeout handling (30 seconds default)

Key Features

Search & Retrieval

Hybrid search – Vector similarity + BM25 keyword matching
Query intent classification – Factual, Exploratory, Navigational, Troubleshooting, Person Lookup, Temporal
Spell correction – SymSpellPy integration
Cross-encoder reranking – Cohere API with local fallback
HyDE – Hypothetical Document Embeddings (optional advanced retrieval)
Dynamic relevancy thresholding – Adjusts cutoff based on result quality
Source-specific weighting – Freshness decay per source type
Query caching – LRU cache with TTL

RAG (Retrieval-Augmented Generation)

Multi-turn conversation context (up to 10 turns)
Session-based history tracking
Entity mention tracking across turns
Source citations in responses

Data Sync

Parallel multi-source syncing (up to 6 concurrent)
Full and incremental sync modes
Automatic stall detection (watchdog)
Scheduled syncing (hourly/daily/weekly/monthly)
Comprehensive sync logging to database

Resilience

Circuit breaker pattern – Auto state transitions (CLOSED → OPEN → HALF_OPEN)
Retry queue – Failed sync items automatically retried
Health monitoring – Database, Qdrant, OpenAI API checks
Auto-restart – Failed services automatically recovered

Security & Privacy

Multi-user with data isolation (user-scoped vector queries)
Role-based access control (user/admin)
GDPR-compliant user deletion with verification
OAuth credential encryption
API key management with scopes and rate limiting

Integrations

Slack – @mention handling, company-specific search, context-aware answers
Claude Desktop – Local MCP server (stdio transport)
Claude.ai – Remote MCP server (SSE transport)

Codebase Statistics

Lines of Code

Category	Lines
Total Python	~43,800
Main application (app.py)	6,290
Sync Manager	1,515
Vector DB wrapper	1,352
API v1 endpoints	1,053
Auth module	708
Database models	193
Templates (HTML)	~12,400

File Counts

Type	Count
Python files	95
HTML templates	18
Core modules	36
Data connectors	8
Test files	12
API endpoints	119

Project Structure

knowledgehub/
├── app.py                    # Main Flask app (6,290 LOC, 119 endpoints)
├── supervisor.py             # Process manager
├── run_slackbot.py          # Slack bot runner
├── mcp_server.py            # Claude Desktop MCP (stdio)
├── remote_mcp_server.py     # Claude Web MCP (SSE)
├── src/
│   ├── core/                # 36 core modules
│   │   ├── sync_manager.py  # Parallel sync orchestration
│   │   ├── vector_db.py     # Qdrant wrapper
│   │   ├── embeddings.py    # OpenAI embeddings with cache
│   │   ├── rag_generator.py # Answer synthesis
│   │   ├── query_processor.py # Intent classification
│   │   ├── hybrid_search.py # Vector + BM25 fusion
│   │   ├── reranker.py      # Cross-encoder reranking
│   │   ├── circuit_breaker.py # Resilience pattern
│   │   ├── health_monitor.py # Component health checks
│   │   ├── user_deletion.py # GDPR compliance
│   │   └── ...
│   ├── connectors/          # 8 data source connectors
│   │   ├── base.py          # Common patterns
│   │   ├── gmail.py
│   │   ├── google_drive.py
│   │   ├── slack.py
│   │   ├── zendesk.py
│   │   ├── attio.py
│   │   ├── granola.py
│   │   ├── chatgpt/staging_db.py
│   │   └── dropbox.py
│   └── api/                 # REST API
│       ├── v1/__init__.py   # API v1 endpoints
│       └── openapi.py       # OpenAPI spec
├── templates/               # 18 HTML templates
├── static/                  # CSS, JS assets
└── tests/                   # 12 test files

Performance Optimizations

Optimization	Implementation
Parallel embedding generation	20 workers for Gmail, configurable per connector
Batched vector writes	4 parallel workers for Qdrant operations
Embedding cache	LRU cache (1000 entries, 1-hour TTL)
Lazy module loading	Heavy modules imported only when needed
Dashboard cache	5-second TTL on expensive calculations
HTTP connection pooling	Session reuse across all connectors
Stall detection	Watchdog timer with auto-recovery

Development Statistics

Timeline

Milestone	Date
Project started	January 13, 2026
Current state	January 25, 2026
Total duration	13 days

Git Activity

Metric	Count
Total commits	~275
Pull requests	95
Avg commits/day	~21

Authorship

All code was generated by AI:

Tool	Percentage
Replit Agent	~77%
Claude Code	~23%

A human directed the work through prompts and reviewed pull requests, but wrote zero lines of code directly.

Dependencies

Production dependencies: 42
Development dependencies: 4

Key Challenges

MCP Server Reliability

Required multiple iterations to achieve stable process supervision, health monitoring, and automatic recovery. The MCP protocol's transport mechanisms and authentication requirements were learned through extensive debugging.

External API Integration

Handling timeouts, rate limits, and varying API response formats across 8 different services. Each connector required custom retry logic and error handling.

Sync State Management

Ensuring reliable state persistence and recovery after interruptions. Implemented watchdog timer to detect stalled syncs and automatic restart capabilities.

Performance at Scale

Initial sync was extremely slow with sequential processing. Achieved 10-50x speedup through parallel processing, batching, and connection pooling.

Query Quality

Building an effective search pipeline required combining multiple techniques: semantic search, keyword matching, spell correction, intent classification, and cross-encoder reranking.

Summary

Metric	Value
Development time	13 days
Lines of code	~43,800
Data sources	8
API endpoints	119
Core modules	36
Commits	~275
Human-written code	0 lines

Last updated: January 25, 2026

chrija76/KH_TECH.md

Select an option

No results found