A companion to "Building a Knowledge Hub in 13 Days"
Knowledge Hub is an enterprise knowledge aggregation platform that syncs data from multiple sources into a searchable vector database with AI-powered retrieval-augmented generation (RAG).
Primary use case: Ask natural language questions across all company data and get AI-generated answers with source citations.
| Layer | Technology |
|---|---|
| Backend | Python 3.11, Flask 3.1, Gunicorn |
| Database | PostgreSQL (production), SQLite (development) |
| ORM | Flask-SQLAlchemy with Flask-Migrate (Alembic) |
| Vector Store | Qdrant (1536-dimensional embeddings) |
| Embeddings | OpenAI text-embedding-3-small |
| LLM | Anthropic Claude (claude-sonnet-4-20250514) |
| Reranking | Cohere API with local fallback |
| Search | Hybrid: semantic + BM25 keyword + cross-encoder reranking |
| Spell Correction | SymSpellPy |
| Auth | Google OAuth 2.0 with encrypted credential storage |
| Background Jobs | APScheduler |
| Slack | Slack Bolt SDK with Socket Mode |
| MCP | FastMCP (stdio + SSE transports) |
| Document Processing | PDFPlumber, python-docx, openpyxl, python-pptx, Pillow |
| Deployment | Replit |
flowchart TB
subgraph SUPERVISOR[" β SUPERVISOR "]
health["Health Checks<br/>Auto-Restart"]
end
subgraph SERVICES[" β‘ SERVICES "]
flask["Flask App<br/>119 Endpoints"]
mcp["MCP Server<br/>Claude Integration"]
slack["Slackbot<br/>Team Q&A"]
end
subgraph CORE[" β CORE - 36 Modules "]
sync["Sync<br/>Manager"]
rag["RAG<br/>Engine"]
query["Query<br/>Processor"]
search["Hybrid<br/>Search"]
rerank["Reranker"]
circuit["Circuit<br/>Breaker"]
end
subgraph CONNECTORS[" π CONNECTORS "]
sources["Gmail β’ Drive β’ Slack β’ Zendesk<br/>Attio β’ Granola β’ ChatGPT β’ Dropbox"]
end
subgraph STORAGE[" πΎ STORAGE "]
postgres[("PostgreSQL<br/>Users, OAuth<br/>Sync State")]
qdrant[("Qdrant<br/>Vectors<br/>BM25 Index")]
end
subgraph EXTERNAL[" π EXTERNAL APIs "]
openai["OpenAI<br/>Embeddings"]
anthropic["Anthropic<br/>Claude LLM"]
cohere["Cohere<br/>Reranking"]
end
SUPERVISOR --> flask & mcp & slack
flask & mcp & slack --> CORE
CORE --> CONNECTORS
CORE --> postgres & qdrant
CORE --> openai & anthropic & cohere
| Component | Purpose |
|---|---|
| Supervisor | Process manager with health checks and auto-restart |
| Flask App | Web dashboard, 119 REST API endpoints, Google OAuth |
| MCP Server | Enables Claude Desktop/Web to query the knowledge base |
| Slackbot | Team members can ask questions from Slack |
| Sync Manager | Orchestrates parallel data sync (up to 6 concurrent sources) |
| Query Processor | Spell correction, intent classification, query optimization |
| RAG Engine | Answer synthesis with source citations |
| Hybrid Search | Combines semantic vectors + BM25 keyword search |
| Reranker | Cross-encoder reranking via Cohere with local fallback |
| Circuit Breaker | Resilience pattern for external service failures |
| Source | What's Synced | Special Features |
|---|---|---|
| Gmail | Emails, attachments (PDF, DOCX, images) | 20 parallel workers, attachment extraction |
| Google Drive | Docs, Sheets, Slides, PDFs | Format conversion, 10MB file limit |
| Slack | Messages, threads, channels | User enrichment, rate-limit aware |
| Zendesk | Investment opportunities, deal tracking | HTML stripping, 5 parallel workers |
| Attio | Companies, contacts, notes, lists | Configurable object filtering, 365-day recency |
| Granola | AI meeting notes, transcripts | ProseMirror JSON parsing |
| ChatGPT | Exported conversation history | Staging DB with approval workflow |
| Dropbox | Documents, PDFs, text files | 10+ file types, 10MB limit |
Base Connector Features:
- HTTP session pooling for connection reuse
- Retry logic with exponential backoff (max 3 retries)
- Rate limit handling (429 status code detection)
- Request timeout handling (30 seconds default)
- Hybrid search β Vector similarity + BM25 keyword matching
- Query intent classification β Factual, Exploratory, Navigational, Troubleshooting, Person Lookup, Temporal
- Spell correction β SymSpellPy integration
- Cross-encoder reranking β Cohere API with local fallback
- HyDE β Hypothetical Document Embeddings (optional advanced retrieval)
- Dynamic relevancy thresholding β Adjusts cutoff based on result quality
- Source-specific weighting β Freshness decay per source type
- Query caching β LRU cache with TTL
- Multi-turn conversation context (up to 10 turns)
- Session-based history tracking
- Entity mention tracking across turns
- Source citations in responses
- Parallel multi-source syncing (up to 6 concurrent)
- Full and incremental sync modes
- Automatic stall detection (watchdog)
- Scheduled syncing (hourly/daily/weekly/monthly)
- Comprehensive sync logging to database
- Circuit breaker pattern β Auto state transitions (CLOSED β OPEN β HALF_OPEN)
- Retry queue β Failed sync items automatically retried
- Health monitoring β Database, Qdrant, OpenAI API checks
- Auto-restart β Failed services automatically recovered
- Multi-user with data isolation (user-scoped vector queries)
- Role-based access control (user/admin)
- GDPR-compliant user deletion with verification
- OAuth credential encryption
- API key management with scopes and rate limiting
- Slack β @mention handling, company-specific search, context-aware answers
- Claude Desktop β Local MCP server (stdio transport)
- Claude.ai β Remote MCP server (SSE transport)
| Category | Lines |
|---|---|
| Total Python | ~43,800 |
| Main application (app.py) | 6,290 |
| Sync Manager | 1,515 |
| Vector DB wrapper | 1,352 |
| API v1 endpoints | 1,053 |
| Auth module | 708 |
| Database models | 193 |
| Templates (HTML) | ~12,400 |
| Type | Count |
|---|---|
| Python files | 95 |
| HTML templates | 18 |
| Core modules | 36 |
| Data connectors | 8 |
| Test files | 12 |
| API endpoints | 119 |
knowledgehub/
βββ app.py # Main Flask app (6,290 LOC, 119 endpoints)
βββ supervisor.py # Process manager
βββ run_slackbot.py # Slack bot runner
βββ mcp_server.py # Claude Desktop MCP (stdio)
βββ remote_mcp_server.py # Claude Web MCP (SSE)
βββ src/
β βββ core/ # 36 core modules
β β βββ sync_manager.py # Parallel sync orchestration
β β βββ vector_db.py # Qdrant wrapper
β β βββ embeddings.py # OpenAI embeddings with cache
β β βββ rag_generator.py # Answer synthesis
β β βββ query_processor.py # Intent classification
β β βββ hybrid_search.py # Vector + BM25 fusion
β β βββ reranker.py # Cross-encoder reranking
β β βββ circuit_breaker.py # Resilience pattern
β β βββ health_monitor.py # Component health checks
β β βββ user_deletion.py # GDPR compliance
β β βββ ...
β βββ connectors/ # 8 data source connectors
β β βββ base.py # Common patterns
β β βββ gmail.py
β β βββ google_drive.py
β β βββ slack.py
β β βββ zendesk.py
β β βββ attio.py
β β βββ granola.py
β β βββ chatgpt/staging_db.py
β β βββ dropbox.py
β βββ api/ # REST API
β βββ v1/__init__.py # API v1 endpoints
β βββ openapi.py # OpenAPI spec
βββ templates/ # 18 HTML templates
βββ static/ # CSS, JS assets
βββ tests/ # 12 test files
| Optimization | Implementation |
|---|---|
| Parallel embedding generation | 20 workers for Gmail, configurable per connector |
| Batched vector writes | 4 parallel workers for Qdrant operations |
| Embedding cache | LRU cache (1000 entries, 1-hour TTL) |
| Lazy module loading | Heavy modules imported only when needed |
| Dashboard cache | 5-second TTL on expensive calculations |
| HTTP connection pooling | Session reuse across all connectors |
| Stall detection | Watchdog timer with auto-recovery |
| Milestone | Date |
|---|---|
| Project started | January 13, 2026 |
| Current state | January 25, 2026 |
| Total duration | 13 days |
| Metric | Count |
|---|---|
| Total commits | ~275 |
| Pull requests | 95 |
| Avg commits/day | ~21 |
All code was generated by AI:
| Tool | Percentage |
|---|---|
| Replit Agent | ~77% |
| Claude Code | ~23% |
A human directed the work through prompts and reviewed pull requests, but wrote zero lines of code directly.
- Production dependencies: 42
- Development dependencies: 4
Required multiple iterations to achieve stable process supervision, health monitoring, and automatic recovery. The MCP protocol's transport mechanisms and authentication requirements were learned through extensive debugging.
Handling timeouts, rate limits, and varying API response formats across 8 different services. Each connector required custom retry logic and error handling.
Ensuring reliable state persistence and recovery after interruptions. Implemented watchdog timer to detect stalled syncs and automatic restart capabilities.
Initial sync was extremely slow with sequential processing. Achieved 10-50x speedup through parallel processing, batching, and connection pooling.
Building an effective search pipeline required combining multiple techniques: semantic search, keyword matching, spell correction, intent classification, and cross-encoder reranking.
| Metric | Value |
|---|---|
| Development time | 13 days |
| Lines of code | ~43,800 |
| Data sources | 8 |
| API endpoints | 119 |
| Core modules | 36 |
| Commits | ~275 |
| Human-written code | 0 lines |
Last updated: January 25, 2026