Skip to content

Instantly share code, notes, and snippets.

@swalke16
Last active February 12, 2026 14:32
Show Gist options
  • Select an option

  • Save swalke16/29b7225637ab43b124fe2a5e6c89f240 to your computer and use it in GitHub Desktop.

Select an option

Save swalke16/29b7225637ab43b124fe2a5e6c89f240 to your computer and use it in GitHub Desktop.
Secure MongoDB Query Agent for Bizy - RubyLLM/Agent Framework approach
name overview todos isProject
MongoDB Query Agent
Build a secure, company-scoped MongoDB query capability for Bizy using RubyLLM's native tool pattern, with defense-in-depth security architecture.
id content status
infrastructure
Create read-only MongoDB user and configure bizy_readonly Mongoid client
pending
id content status
security-executor
Create MongodbQueryExecutor with read-only enforcement and secondary-only mode
pending
id content status
security-registry
Create MongodbCollectionRegistry with scope types for company/user collections
pending
id content status
security-validator
Create MongodbSecurityValidator with all 11 attack vector mitigations
pending
id content status
security-scoper
Create MongodbQueryScoper for company/user filter injection
pending
id content status
tools-find
Create MongodbFind RubyLLM tool with scoping integration
pending
id content status
tools-aggregate
Create MongodbAggregate RubyLLM tool with pipeline validation
pending
id content status
tools-discovery
Create ListCollections, DescribeCollection, and Docs tools
pending
id content status
agent
Create Bizy::Agents::Data and register in Bizy::Chat
pending
id content status
documentation
Create curated collection docs, freshness checker, generator, and CI safety net
pending
false

Secure MongoDB Query Agent

1. Overview

What We're Building

A secure MongoDB query capability for Bizy that allows company admins to explore their data through natural language. The LLM can formulate and execute MongoDB queries, with all results automatically scoped to the user's company.

Goals

Goal Why It Matters
Admin-only access Only company admins can query raw data
Company scoping All queries automatically filtered to user's company
Read-only Zero ability to modify data, enforced at multiple layers
Flexible queries Support find, aggregate, count operations
LLM-friendly Documentation and discovery tools help LLM formulate queries

Integration with Bizy Agent Framework

This capability integrates with the Bizy Agent Framework as a specialized sub-agent:

┌─────────────────────────────────────────────────────────────┐
│ Bizy::Chat (Main Orchestrator)                               │
│   Core tools: GetOrgContext, GetRecognition, GetMilestones   │
│   Agents:     MeetingAgent, AnalyticsAgent, DataAgent        │
│                                                              │
│   User (admin): "How many bonuses were given last month      │
│                  grouped by department?"                     │
│         ↓                                                    │
│   Bizy decides: This needs raw data → invoke DataAgent       │
│         ↓                                                    │
│   ┌─────────────────────────────────────────┐                │
│   │ DataAgent                                │                │
│   │   Tools: MongodbFind, MongodbAggregate   │                │
│   │   Runs aggregation pipeline              │                │
│   │   Returns: Grouped results               │                │
│   └─────────────────────────────────────────┘                │
└─────────────────────────────────────────────────────────────┘

2. Architecture

High-Level Flow

sequenceDiagram
    participant User
    participant BizyChat as Bizy::Chat
    participant DataAgent as DataAgent
    participant Tool as MongodbFind/Aggregate
    participant Scoper as QueryScoper
    participant Validator as SecurityValidator
    participant Executor as QueryExecutor
    participant DB as MongoDB Secondary

    User->>BizyChat: "Show me top recognizers"
    BizyChat->>BizyChat: Check admin? Yes
    BizyChat->>DataAgent: Delegate question
    DataAgent->>Tool: execute(collection, pipeline)
    Tool->>Scoper: scope_aggregate(collection, pipeline)
    Scoper->>Validator: validate_pipeline!
    Validator-->>Scoper: Valid
    Scoper-->>Tool: Scoped pipeline
    Tool->>Executor: aggregate(collection, pipeline)
    Executor->>DB: Execute on secondary
    DB-->>Executor: Results
    Executor-->>Tool: Documents
    Tool-->>DataAgent: Formatted results
    DataAgent-->>BizyChat: Answer
    BizyChat-->>User: "Here are your top recognizers..."
Loading

Design Approach: Allowlisted Collections with Query Rewriting

This approach balances flexibility with security:

  1. Admin-only access - Only company admins can use the DataAgent
  2. Allowlist collections - Only permit queries against pre-defined collections
  3. Auto-inject scoping filter - Based on collection type:
    • Company-scoped: Inject company_id = user.company_id
    • User-scoped: Validate user_id IN (all company user IDs)
  4. Validate aggregation pipelines - Ensure $match is first stage, inject scoping filter
  5. Block dangerous operators - Disallow $out, $merge, and unrestricted $lookup
  6. Read-only - Only support find, aggregate, count operations

Two Scoping Strategies

flowchart TD
    subgraph auth [Authorization]
        A[Request from Bizy::Chat] --> B[DataAgent.available?]
        B --> C{Is user admin?}
        C -->|No| D[Agent not available]
        C -->|Yes| E[Get company from user]
    end

    subgraph scope [Scoping]
        E --> F{Collection type?}
        F -->|company_id field| G["Inject: company_id = X"]
        F -->|user_id field| H[Validate all user_ids in company]
        H --> I["Keep filter if valid"]
    end

    G --> J[Execute query]
    I --> J
Loading

Company-scoped collections (most collections):

  • Have a company_id field
  • Scoper injects: { company_id: company.id }
  • Any user-provided company_id is overwritten (prevents injection)

User-scoped collections (e.g., app_notifications, llm_runs):

  • Have a user_id field but no company_id
  • Caller must specify user_id or { user_id: { $in: [...] } }
  • Scoper validates ALL specified user_ids belong to the company
  • Rejects entire query if any user_id is outside the company

3. Security Architecture

Defense in Depth

CRITICAL: The query capability must have ZERO ability to modify data. This is enforced at multiple layers:

Layer 1: Database-Level (Strongest)

Create a dedicated MongoDB user with read-only permissions:

// MongoDB shell command to create read-only user
db.createUser({
  user: "bizy_readonly",
  pwd: "<secure-password>",
  roles: [
    { role: "read", db: "bonusly_production" }
  ]
})

Configure a separate Mongoid client with secondary-only reads:

# config/mongoid.yml
production:
  clients:
    default:
      # ... normal read/write connection
    bizy_readonly:
      uri: <%= ENV['MONGODB_BIZY_READONLY_URI'] %>
      options:
        read:
          mode: :secondary  # NOT secondary_preferred - ONLY secondary replicas
        max_pool_size: 5
        server_selection_timeout: 10  # Fail fast if no secondary available

Important: Use :secondary mode, NOT :secondary_preferred. This ensures:

  • Queries NEVER touch the primary database
  • If no secondary is available, the query fails (rather than falling back to primary)
  • Production primary is completely isolated from query load

Why this is the strongest layer: Even if application code has bugs, MongoDB itself will reject any write operations with a permission error.

Layer 2: Code-Level (API Surface)

Only expose read operations - no insert, update, delete, or bulk write methods:

# lib/bizy/mongodb_query_executor.rb
module Bizy
  class MongodbQueryExecutor
    MAX_LIMIT = 100
    DEFAULT_LIMIT = 20
    QUERY_TIMEOUT_MS = 30_000

    def initialize
      @client = Mongoid.client(:bizy_readonly)
    end

    # The ONLY three public methods - no write operations exist
    def find(collection:, filter:, projection: nil, sort: nil, limit: DEFAULT_LIMIT)
      limit = [limit.to_i, MAX_LIMIT].min

      cursor = @client[collection]
        .find(filter)
        .projection(projection || {})
        .sort(sort || { _id: -1 })
        .limit(limit)
        .max_time_ms(QUERY_TIMEOUT_MS)

      cursor.to_a
    end

    def aggregate(collection:, pipeline:)
      @client[collection].aggregate(
        pipeline,
        max_time_ms: QUERY_TIMEOUT_MS
      ).to_a
    end

    def count(collection:, filter:)
      @client[collection]
        .count_documents(filter, max_time_ms: QUERY_TIMEOUT_MS)
    end

    # NO insert, update, delete, bulk_write methods exist
    # The class literally cannot write - it only has read methods

    private

    attr_reader :client
  end
end

Layer 3: Pipeline Validation

Block ALL aggregation stages that can write or modify data:

# lib/bizy/mongodb_pipeline_validator.rb
module Bizy
  class MongodbPipelineValidator
    FORBIDDEN_STAGES = %w[
      $out           # Writes to collection
      $merge         # Writes/updates to collection
      $lookup        # Cross-collection join (data leakage)
      $graphLookup   # Recursive cross-collection join
      $unionWith     # Unions unfiltered collection data
      $documents     # Creates documents from scratch
    ].freeze

    def self.validate!(pipeline)
      new.validate!(pipeline)
    end

    def validate!(pipeline)
      validate_pipeline_recursive!(pipeline)
    end

    private

    def validate_pipeline_recursive!(pipeline)
      pipeline.each do |stage|
        stage_name = stage.keys.first.to_s

        if FORBIDDEN_STAGES.include?(stage_name)
          raise Bizy::ForbiddenOperatorError,
            "Aggregation stage '#{stage_name}' is not allowed"
        end

        # Recursively validate $facet sub-pipelines
        if stage_name == "$facet"
          stage["$facet"].each_value do |sub_pipeline|
            validate_pipeline_recursive!(sub_pipeline)
          end
        end
      end
    end
  end
end

Layer 4: Query Validation

Reject any query operators that could bypass security:

# lib/bizy/mongodb_security_validator.rb
module Bizy
  class MongodbSecurityValidator
    # Filter operators that can bypass field-level security
    FORBIDDEN_FILTER_OPERATORS = %w[
      $where      # JavaScript execution - can bypass any filter
      $function   # JavaScript execution - can bypass any filter
    ].freeze
    # NOTE: $expr is ALLOWED - it cannot bypass top-level field matches
    #       and is required for existing analytics pipelines

    MAX_REGEX_LENGTH = 100
    VALID_FIELD_NAME = /\A[\w.$]+\z/

    def self.validate_filter!(filter)
      new.validate_filter!(filter)
    end

    def validate_filter!(filter)
      validate_filter_recursive!(filter)
    end

    private

    def validate_filter_recursive!(filter)
      return unless filter.is_a?(Hash)

      filter.each do |key, value|
        validate_field_name!(key)
        validate_not_forbidden_operator!(key)
        validate_regex_length!(key, value)

        validate_filter_recursive!(value) if value.is_a?(Hash)
        value.each { |v| validate_filter_recursive!(v) if v.is_a?(Hash) } if value.is_a?(Array)
      end
    end

    def validate_field_name!(key)
      unless key.to_s.match?(VALID_FIELD_NAME)
        raise Bizy::InvalidFilterError, "Invalid field name format"
      end
    end

    def validate_not_forbidden_operator!(key)
      if FORBIDDEN_FILTER_OPERATORS.include?(key.to_s)
        raise Bizy::ForbiddenOperatorError,
          "Operator '#{key}' is not allowed"
      end
    end

    def validate_regex_length!(key, value)
      if key.to_s == "$regex" && value.to_s.length > MAX_REGEX_LENGTH
        raise Bizy::InvalidFilterError, "Regex pattern too long"
      end
    end
  end
end

Summary: Defense in Depth

Layer What it blocks Failure mode
Read-Only DB User All writes MongoDB permission error
Secondary-Only Mode Primary DB access Query fails if no secondary available
API Surface No write methods exposed Method doesn't exist
Pipeline Validation $out, $merge, $lookup Application error before query runs
Query Validation $where, $function Application error before query runs

All layers are mandatory:

  • Read-only DB user: Prevents any data modification at the database level
  • Secondary-only mode: Isolates primary from query load; queries can never affect write performance
  • API surface: Defense against code bugs - write methods don't exist
  • Pipeline validation: Defense against LLM-crafted malicious aggregation pipelines
  • Query validation: Defense against JavaScript injection and regex attacks

4. Security Audit: Attack Vectors and Mitigations

This section documents all considered attack vectors and their mitigations.

Attack 1: Nested Filter Injection (CRITICAL)

Attack:

filter: {
  "$or": [
    { "company_id": "other_company_id", "amount": { "$gt": 0 } },
    { "amount": { "$lt": 0 } }
  ]
}

Problem: Simple filter.merge("company_id" => X) only sets top-level company_id. The $or clause still contains the attacker's company_id and will match documents from other companies.

Mitigation: Recursively sanitize company_id from entire filter structure:

def sanitize_scope_field!(filter, scope_field)
  filter.each do |key, value|
    if key.to_s == scope_field
      filter.delete(key)  # Remove any user-provided scope field
    elsif value.is_a?(Hash)
      sanitize_scope_field!(value, scope_field)
    elsif value.is_a?(Array)
      value.each { |v| sanitize_scope_field!(v, scope_field) if v.is_a?(Hash) }
    end
  end
end

# Then inject our scope field at top level
filter[scope_field] = @company_id

Attack 2: $where JavaScript Injection (CRITICAL)

Attack:

filter: { "$where": "this.company_id == 'other_company_id'" }

Problem: $where executes JavaScript and can bypass any field-level filters.

Mitigation: Block dangerous filter operators (see MongodbSecurityValidator above).

Attack 3: $unionWith Collection Bypass (CRITICAL)

Attack:

pipeline: [
  { "$match": { "company_id": my_company } },
  { "$unionWith": { "coll": "bonuses" } },  # Unions ALL bonuses, unfiltered!
  { "$match": { "company_id": "other_company" } }
]

Problem: $unionWith brings in data from another collection without applying the company filter.

Mitigation: Block in forbidden stages list (see MongodbPipelineValidator above).

Attack 4: $expr Filter Override (RE-EVALUATED: SAFE TO ALLOW)

Initial concern:

filter: { "$expr": { "$eq": ["$company_id", "other_company_id"] } }

Analysis: After deeper review, $expr cannot bypass top-level field matches.

MongoDB query semantics treat all top-level conditions as an implicit AND:

# After our injection:
{
  "company_id" => my_company,  # Our top-level field match
  "$expr" => { "$eq": ["$company_id", "other"] }  # Attacker's $expr
}

Both conditions must be true:

  1. company_id == my_company (TRUE for company's documents)
  2. $company_id == "other"my_company == "other" (FALSE)

Result: No documents match. The attacker's $expr cannot override our field match.

Decision: $expr is ALLOWED - removed from FORBIDDEN_FILTER_OPERATORS.

Attack 5: Type Confusion in User ID Validation (MEDIUM)

Attack:

# Attacker sends string ID
filter: { "user_id": "507f1f77bcf86cd799439011" }
# Validation might check ObjectId, but filter uses string

Problem: Mismatch between validation type and query type could allow bypass.

Mitigation: Normalize IDs in the filter itself:

def normalize_user_ids_in_filter!(filter, scope_field)
  user_id_value = filter[scope_field]
  return unless user_id_value

  filter[scope_field] = case user_id_value
  when String
    BSON::ObjectId(user_id_value)
  when BSON::ObjectId
    user_id_value
  when Hash
    if user_id_value["$in"]
      { "$in" => user_id_value["$in"].map { |id| BSON::ObjectId(id.to_s) } }
    else
      user_id_value
    end
  when Array
    { "$in" => user_id_value.map { |id| BSON::ObjectId(id.to_s) } }
  end
end

Attack 6: Regex Denial of Service (MEDIUM)

Attack:

filter: { "reason": { "$regex": "(a+)+$" } }

Problem: Catastrophic backtracking in regex causes CPU exhaustion (ReDoS).

Mitigation: Limit regex length (see MongodbSecurityValidator above).

Attack 7: Information Disclosure via Errors (MEDIUM)

Attack:

# Probe for collection names
collection: "secret_internal_collection"
# Error reveals: "Collection 'secret_internal_collection' is not available"

Problem: Detailed error messages leak information about data existence.

Mitigation: Use generic error messages:

# Bad - reveals information
raise "User ID 'abc' does not belong to this company"
raise "Collection 'xyz' is not available"

# Good - generic errors
raise Bizy::NotAuthorizedError, "Query not permitted"

Attack 8: Data Enumeration via Rate Abuse (MEDIUM)

Attack:

# Systematically extract all company data
loop do
  results = query(filter: { "_id": { "$gt": last_id } }, limit: 100)
  break if results.empty?
  last_id = results.last["_id"]
end

Problem: Even with proper scoping, attackers can extract entire datasets.

Mitigation: Audit logging for security review:

# lib/bizy/mongodb_audit_logger.rb
module Bizy
  class MongodbAuditLogger
    def self.log_query(user:, company:, collection:, filter:)
      Rails.logger.info("[Bizy MongoDB Query]", {
        user_id: user.id.to_s,
        company_id: company.id.to_s,
        collection: collection,
        filter_keys: filter.keys,  # Log structure, not values (PII)
        timestamp: Time.current.iso8601
      })
    end
  end
end

Attack 9: $facet Nested Pipeline Bypass (MEDIUM)

Attack:

pipeline: [
  { "$match": { "company_id": my_company } },
  { "$facet": {
    "legitimate": [{ "$match": { "amount": 10 } }],
    "attack": [{ "$unionWith": { "coll": "users" } }]  # Nested forbidden stage!
  }}
]

Problem: $facet contains nested pipelines that might contain forbidden stages.

Mitigation: Recursively validate nested pipelines (see MongodbPipelineValidator above).

Attack 10: Field Name Injection (LOW)

Attack:

filter: { "company_id\x00": "other" }      # Null byte
filter: { "company_id\u200B": "other" }    # Zero-width space

Problem: Unicode tricks might bypass string matching for field names.

Mitigation: Validate field name characters (see MongodbSecurityValidator above).

Attack 11: Token Theft (LOW)

Attack: Steal or replay a valid session token to impersonate another user.

Mitigation: Already handled by existing authentication:

  • Users authenticated via standard Rails session
  • Company context derived from authenticated user
  • No separate token system needed (unlike MCP approach)

Attack Vector Summary

Priority Attack Vector Mitigation Status
CRITICAL Nested filter injection Recursive sanitization of scope field Required
CRITICAL $where JavaScript Block $where, $function Required
CRITICAL $unionWith bypass Block $unionWith, $documents Required
SAFE $expr override ALLOWED - cannot bypass field matches Safe to allow
MEDIUM Type confusion Normalize IDs in filter Required
MEDIUM ReDoS Limit regex length Required
MEDIUM Info disclosure Generic error messages Required
MEDIUM Enumeration Audit logging Required
MEDIUM $facet nesting Recursive pipeline validation Required
LOW Field name tricks Validate field name characters Required
LOW Token theft Existing authentication Already handled

5. Core Components

Collection Registry

Defines which collections are queryable and their scoping rules:

# lib/bizy/mongodb_collection_registry.rb
module Bizy
  class MongodbCollectionRegistry
    # Scoping types:
    # - :company - Filter by company_id field
    # - :user    - Filter by user_id IN (validated company user IDs)

    COLLECTIONS = {
      # Company-scoped collections (have company_id field)
      "bonuses" => {
        scope_type: :company,
        scope_field: "company_id",
        description: "Recognition given between employees",
        docs: ["collections/bonuses"]
      },
      "users" => {
        scope_type: :company,
        scope_field: "company_id",
        description: "Employee records",
        docs: ["collections/users"]
      },
      "check_ins" => {
        scope_type: :company,
        scope_field: "company_id",
        description: "Employee check-ins",
        docs: ["collections/check_ins"]
      },
      "monthly_user_statistics" => {
        scope_type: :company,
        scope_field: "company_id",
        description: "Monthly aggregated user metrics",
        docs: ["collections/monthly_user_statistics"]
      },

      # User-scoped collections (have user_id but no company_id)
      "app_notifications" => {
        scope_type: :user,
        scope_field: "user_id",
        description: "User notifications",
        docs: ["collections/app_notifications"]
      },
      "llm_runs" => {
        scope_type: :user,
        scope_field: "user_id",
        description: "LLM query history",
        docs: ["collections/llm_runs"]
      },
      "user_content_dismissals" => {
        scope_type: :user,
        scope_field: "user_id",
        description: "Dismissed content by user",
        docs: ["collections/user_content_dismissals"]
      },
      # ... add more as needed
    }.freeze

    class << self
      def allowed?(collection_name)
        COLLECTIONS.key?(collection_name)
      end

      def config(collection_name)
        COLLECTIONS[collection_name]
      end

      def scope_type(collection_name)
        COLLECTIONS.dig(collection_name, :scope_type)
      end

      def scope_field(collection_name)
        COLLECTIONS.dig(collection_name, :scope_field)
      end

      def available_names
        COLLECTIONS.keys
      end

      def all_with_descriptions
        COLLECTIONS.map do |name, config|
          {
            name: name,
            description: config[:description],
            scope_type: config[:scope_type],
            doc_paths: config[:docs] || []
          }
        end
      end
    end
  end
end

Query Scoper

Handles query rewriting for both company-scoped and user-scoped collections:

# lib/bizy/mongodb_query_scoper.rb
module Bizy
  class MongodbQueryScoper
    def initialize(company:)
      @company = company
      @company_id = company.id
    end

    def scope_find(collection:, filter:)
      validate_collection!(collection)
      filter = filter.deep_dup
      MongodbSecurityValidator.validate_filter!(filter)
      merge_scope_filter(collection, filter)
    end

    def scope_aggregate(collection:, pipeline:)
      validate_collection!(collection)
      pipeline = pipeline.deep_dup
      MongodbPipelineValidator.validate!(pipeline)
      inject_scope_into_pipeline(collection, pipeline)
    end

    private

    def validate_collection!(collection)
      unless MongodbCollectionRegistry.allowed?(collection)
        raise Bizy::UnauthorizedCollectionError, "Query not permitted"
      end
    end

    def merge_scope_filter(collection, filter)
      config = MongodbCollectionRegistry.config(collection)

      case config[:scope_type]
      when :company
        scope_company_filter(config[:scope_field], filter)
      when :user
        scope_user_filter(config[:scope_field], filter)
      else
        raise "Unknown scope type: #{config[:scope_type]}"
      end
    end

    def scope_company_filter(scope_field, filter)
      # Remove any nested company_id to prevent injection
      sanitize_scope_field!(filter, scope_field)
      # Inject our company_id at top level
      filter.merge(scope_field => @company_id)
    end

    def scope_user_filter(scope_field, filter)
      requested_user_ids = extract_user_ids(filter[scope_field])

      if requested_user_ids.present?
        # Validate ALL requested user_ids belong to this company
        validate_user_ids_in_company!(requested_user_ids)
        # Normalize IDs in the filter
        normalize_user_ids_in_filter!(filter, scope_field)
        filter
      else
        # No user_ids specified - require them for user-scoped queries
        raise Bizy::InvalidFilterError,
          "user_id is required for this collection"
      end
    end

    def sanitize_scope_field!(filter, scope_field)
      filter.each do |key, value|
        if key.to_s == scope_field
          filter.delete(key)
        elsif value.is_a?(Hash)
          sanitize_scope_field!(value, scope_field)
        elsif value.is_a?(Array)
          value.each { |v| sanitize_scope_field!(v, scope_field) if v.is_a?(Hash) }
        end
      end
    end

    def extract_user_ids(user_id_filter)
      case user_id_filter
      when BSON::ObjectId, String
        [normalize_id(user_id_filter)]
      when Hash
        user_id_filter["$in"]&.map { |id| normalize_id(id) }
      when Array
        user_id_filter.map { |id| normalize_id(id) }
      else
        nil
      end
    end

    def validate_user_ids_in_company!(user_ids)
      valid_count = @company.users.where(:_id.in => user_ids).count
      unless valid_count == user_ids.length
        raise Bizy::UnauthorizedUserError, "Query not permitted"
      end
    end

    def normalize_id(id)
      id.is_a?(BSON::ObjectId) ? id : BSON::ObjectId(id.to_s)
    end

    def normalize_user_ids_in_filter!(filter, scope_field)
      user_id_value = filter[scope_field]
      return unless user_id_value

      filter[scope_field] = case user_id_value
      when String
        BSON::ObjectId(user_id_value)
      when BSON::ObjectId
        user_id_value
      when Hash
        if user_id_value["$in"]
          { "$in" => user_id_value["$in"].map { |id| BSON::ObjectId(id.to_s) } }
        else
          user_id_value
        end
      when Array
        { "$in" => user_id_value.map { |id| BSON::ObjectId(id.to_s) } }
      end
    end

    def inject_scope_into_pipeline(collection, pipeline)
      config = MongodbCollectionRegistry.config(collection)

      case config[:scope_type]
      when :company
        inject_company_filter_into_pipeline(config[:scope_field], pipeline)
      when :user
        validate_user_filter_in_pipeline(config[:scope_field], pipeline)
      end

      pipeline
    end

    def inject_company_filter_into_pipeline(scope_field, pipeline)
      if pipeline.first&.key?("$match")
        # Sanitize and inject into existing $match
        sanitize_scope_field!(pipeline.first["$match"], scope_field)
        pipeline.first["$match"][scope_field] = @company_id
      else
        # Prepend new $match stage
        pipeline.unshift({ "$match" => { scope_field => @company_id } })
      end
    end

    def validate_user_filter_in_pipeline(scope_field, pipeline)
      first_match = pipeline.first&.dig("$match")
      unless first_match
        raise Bizy::InvalidFilterError,
          "Pipeline must start with $match containing user_id"
      end

      user_ids = extract_user_ids(first_match[scope_field])
      unless user_ids.present?
        raise Bizy::InvalidFilterError,
          "user_id is required in first $match stage"
      end

      validate_user_ids_in_company!(user_ids)
    end
  end
end

Error Classes

# lib/bizy/errors.rb
module Bizy
  class UnauthorizedCollectionError < StandardError; end
  class UnauthorizedUserError < StandardError; end
  class InvalidFilterError < StandardError; end
  class ForbiddenOperatorError < StandardError; end
end

6. RubyLLM Tools

Data Agent

# app/lib/bizy/agents/data.rb
class Bizy::Agents::Data < Bizy::Agents::Base
  description <<~DESC
    Query MongoDB collections for data analysis and reporting.

    Use this agent when the user asks about:
    - Raw data from collections (bonuses, users, check_ins)
    - Custom aggregations, groupings, or counts
    - Data exploration and discovery

    Only available to company admins. Results are automatically
    scoped to the user's company. Start with list_collections
    or describe_collection to understand the data model.
  DESC

  def self.available?(user:, metadata:)
    user.company_admin? || user.can_administer?(:global)
  end

  private

  def tools
    [
      Bizy::Tools::MongodbFind.new(company: @user.company),
      Bizy::Tools::MongodbAggregate.new(company: @user.company),
      Bizy::Tools::MongodbCount.new(company: @user.company),
      Bizy::Tools::MongodbListCollections.new,
      Bizy::Tools::MongodbDescribeCollection.new(company: @user.company),
      Bizy::Tools::MongodbDocs.new,
    ]
  end

  def instructions
    <<~PROMPT
      You are a data query assistant. Help admins explore their company's data.

      Available collections: #{Bizy::MongodbCollectionRegistry.available_names.join(', ')}

      All queries are automatically scoped to the user's company.
      For user-scoped collections (app_notifications, llm_runs), you must specify user_id(s).

      Workflow:
      1. Use list_collections to see available collections
      2. Use describe_collection to understand fields
      3. Use mongodb_docs for detailed field documentation
      4. Use mongodb_find for simple queries, mongodb_aggregate for complex analysis
    PROMPT
  end
end

MongoDB Find Tool

# app/lib/bizy/tools/mongodb_find.rb
class Bizy::Tools::MongodbFind < RubyLLM::Tool
  description <<~DESC
    Query documents from a MongoDB collection. Results are automatically
    scoped to your company. Maximum 100 documents per query.

    Use this for simple lookups and filtered queries. For complex
    aggregations with grouping or calculations, use mongodb_aggregate instead.

    Example filters:
    - { "amount": { "$gt": 10 } } - bonuses over 10 points
    - { "created_at": { "$gte": "2025-01-01" } } - recent records
    - { "hashtags": "#teamwork" } - records with specific hashtag
  DESC

  param :collection, desc: "Collection name (e.g., 'bonuses', 'users')"
  param :filter, type: :object, desc: "MongoDB query filter", required: false
  param :projection, type: :object, desc: "Fields to include/exclude (e.g., { 'amount': 1, 'reason': 1 })", required: false
  param :sort, type: :object, desc: "Sort order (e.g., { 'created_at': -1 })", required: false
  param :limit, type: :integer, desc: "Max documents to return (default: 20, max: 100)", required: false

  def initialize(company:)
    @company = company
  end

  def execute(collection:, filter: {}, projection: nil, sort: nil, limit: 20)
    scoper = Bizy::MongodbQueryScoper.new(company: @company)
    scoped_filter = scoper.scope_find(collection: collection, filter: filter)

    executor = Bizy::MongodbQueryExecutor.new
    results = executor.find(
      collection: collection,
      filter: scoped_filter,
      projection: projection,
      sort: sort,
      limit: limit
    )

    Bizy::MongodbAuditLogger.log_query(
      user: @company.users.first, # Will be replaced with actual user
      company: @company,
      collection: collection,
      filter: filter
    )

    { collection: collection, count: results.length, documents: results }
  rescue Bizy::UnauthorizedCollectionError, Bizy::InvalidFilterError,
         Bizy::ForbiddenOperatorError => e
    { error: "Query not permitted" }
  end
end

MongoDB Aggregate Tool

# app/lib/bizy/tools/mongodb_aggregate.rb
class Bizy::Tools::MongodbAggregate < RubyLLM::Tool
  description <<~DESC
    Run an aggregation pipeline on a MongoDB collection. The first $match
    stage is automatically scoped to your company.

    Forbidden stages: $out, $merge, $lookup, $graphLookup, $unionWith, $documents

    Use for grouping, counting, and complex data analysis.

    Example pipeline:
    [
      { "$match": { "created_at": { "$gte": "2025-01-01" } } },
      { "$group": { "_id": "$giver_id", "total": { "$sum": "$amount" } } },
      { "$sort": { "total": -1 } },
      { "$limit": 10 }
    ]
  DESC

  param :collection, desc: "Collection name"
  param :pipeline, type: :array, desc: "MongoDB aggregation pipeline stages"

  def initialize(company:)
    @company = company
  end

  def execute(collection:, pipeline:)
    scoper = Bizy::MongodbQueryScoper.new(company: @company)
    scoped_pipeline = scoper.scope_aggregate(collection: collection, pipeline: pipeline)

    executor = Bizy::MongodbQueryExecutor.new
    results = executor.aggregate(collection: collection, pipeline: scoped_pipeline)

    { collection: collection, result_count: results.length, results: results }
  rescue Bizy::UnauthorizedCollectionError, Bizy::InvalidFilterError,
         Bizy::ForbiddenOperatorError => e
    { error: "Query not permitted" }
  end
end

MongoDB Count Tool

# app/lib/bizy/tools/mongodb_count.rb
class Bizy::Tools::MongodbCount < RubyLLM::Tool
  description <<~DESC
    Count documents in a MongoDB collection matching a filter.
    Results are automatically scoped to your company.

    Use this for quick counts without fetching documents.
  DESC

  param :collection, desc: "Collection name"
  param :filter, type: :object, desc: "MongoDB query filter", required: false

  def initialize(company:)
    @company = company
  end

  def execute(collection:, filter: {})
    scoper = Bizy::MongodbQueryScoper.new(company: @company)
    scoped_filter = scoper.scope_find(collection: collection, filter: filter)

    executor = Bizy::MongodbQueryExecutor.new
    count = executor.count(collection: collection, filter: scoped_filter)

    { collection: collection, count: count }
  rescue Bizy::UnauthorizedCollectionError, Bizy::InvalidFilterError,
         Bizy::ForbiddenOperatorError => e
    { error: "Query not permitted" }
  end
end

Discovery Tools

# app/lib/bizy/tools/mongodb_list_collections.rb
class Bizy::Tools::MongodbListCollections < RubyLLM::Tool
  description <<~DESC
    List available MongoDB collections with descriptions and scope types.
    Use this first to discover what data is available to query.
  DESC

  def execute
    collections = Bizy::MongodbCollectionRegistry.all_with_descriptions
    { collections: collections }
  end
end
# app/lib/bizy/tools/mongodb_describe_collection.rb
class Bizy::Tools::MongodbDescribeCollection < RubyLLM::Tool
  description <<~DESC
    Get detailed information about a collection including sample fields
    and links to related documentation. Use this to understand the
    structure before writing queries.
  DESC

  param :collection, desc: "Collection name to describe"

  def initialize(company:)
    @company = company
  end

  def execute(collection:)
    config = Bizy::MongodbCollectionRegistry.config(collection)
    return { error: "Collection not available" } unless config

    # Get sample document to show field structure
    scoper = Bizy::MongodbQueryScoper.new(company: @company)
    executor = Bizy::MongodbQueryExecutor.new

    begin
      scoped_filter = scoper.scope_find(collection: collection, filter: {})
      sample = executor.find(collection: collection, filter: scoped_filter, limit: 1).first
      sample_fields = sample&.keys || []
    rescue Bizy::InvalidFilterError
      # User-scoped collection without user_id - can't get sample
      sample_fields = ["(requires user_id to sample)"]
    end

    {
      name: collection,
      description: config[:description],
      scope_type: config[:scope_type],
      scope_field: config[:scope_field],
      sample_fields: sample_fields,
      doc_paths: config[:docs] || []
    }
  end
end
# app/lib/bizy/tools/mongodb_docs.rb
class Bizy::Tools::MongodbDocs < RubyLLM::Tool
  description <<~DESC
    Get detailed documentation about a MongoDB collection's fields,
    semantics, and common query patterns. Use this to understand
    what data means and how to query it effectively.
  DESC

  DOCS_PATH = Rails.root.join("doc/bizy_data")

  param :collection, desc: "Collection name to get documentation for"

  def execute(collection:)
    # Validate collection name to prevent path traversal
    unless collection.match?(/\A[\w]+\z/)
      return { error: "Invalid collection name" }
    end

    doc_path = DOCS_PATH.join("collections/#{collection}.md")

    # Ensure resolved path is within DOCS_PATH (prevent traversal)
    unless doc_path.to_s.start_with?(DOCS_PATH.to_s)
      return { error: "Invalid collection name" }
    end

    return { error: "No documentation available for #{collection}" } unless File.exist?(doc_path)

    { documentation: File.read(doc_path) }
  end
end

7. Testing Strategy

Query Scoper Specs

# spec/lib/bizy/mongodb_query_scoper_spec.rb
RSpec.describe Bizy::MongodbQueryScoper do
  let(:company) { create(:company) }
  let(:other_company) { create(:company) }
  let(:user1) { create(:user, company: company) }
  let(:user2) { create(:user, company: company) }
  let(:other_user) { create(:user, company: other_company) }

  subject { described_class.new(company: company) }

  describe "#scope_find" do
    context "company-scoped collections" do
      it "injects company_id into filter" do
        result = subject.scope_find(collection: "bonuses", filter: { "amount" => 10 })
        expect(result).to eq({ "amount" => 10, "company_id" => company.id })
      end

      it "overwrites any user-provided company_id" do
        result = subject.scope_find(
          collection: "bonuses",
          filter: { "company_id" => other_company.id }
        )
        expect(result["company_id"]).to eq(company.id)
      end

      it "sanitizes nested company_id in $or clauses" do
        result = subject.scope_find(
          collection: "bonuses",
          filter: {
            "$or" => [
              { "company_id" => other_company.id, "amount" => 10 },
              { "amount" => 20 }
            ]
          }
        )
        expect(result["company_id"]).to eq(company.id)
        expect(result["$or"].first).not_to have_key("company_id")
      end
    end

    context "user-scoped collections" do
      it "allows querying with valid user_id" do
        result = subject.scope_find(
          collection: "app_notifications",
          filter: { "user_id" => user1.id, "seen_at" => nil }
        )
        expect(result["user_id"]).to eq(user1.id)
      end

      it "allows querying with array of valid user_ids" do
        result = subject.scope_find(
          collection: "app_notifications",
          filter: { "user_id" => { "$in" => [user1.id, user2.id] } }
        )
        expect(result["user_id"]["$in"]).to contain_exactly(user1.id, user2.id)
      end

      it "rejects user_id from another company" do
        expect {
          subject.scope_find(
            collection: "app_notifications",
            filter: { "user_id" => other_user.id }
          )
        }.to raise_error(Bizy::UnauthorizedUserError)
      end

      it "rejects if any user_id in array is from another company" do
        expect {
          subject.scope_find(
            collection: "app_notifications",
            filter: { "user_id" => { "$in" => [user1.id, other_user.id] } }
          )
        }.to raise_error(Bizy::UnauthorizedUserError)
      end

      it "requires user_id for user-scoped collections" do
        expect {
          subject.scope_find(collection: "app_notifications", filter: {})
        }.to raise_error(Bizy::InvalidFilterError, /user_id is required/)
      end
    end

    it "rejects non-allowlisted collections" do
      expect {
        subject.scope_find(collection: "secret_data", filter: {})
      }.to raise_error(Bizy::UnauthorizedCollectionError)
    end
  end
end

Security Validator Specs

# spec/lib/bizy/mongodb_security_validator_spec.rb
RSpec.describe Bizy::MongodbSecurityValidator do
  describe ".validate_filter!" do
    it "blocks $where operator" do
      expect {
        described_class.validate_filter!({ "$where" => "this.x == 1" })
      }.to raise_error(Bizy::ForbiddenOperatorError, /\$where/)
    end

    it "blocks $function operator" do
      expect {
        described_class.validate_filter!({ "$function" => { body: "..." } })
      }.to raise_error(Bizy::ForbiddenOperatorError, /\$function/)
    end

    it "allows $expr operator" do
      expect {
        described_class.validate_filter!({ "$expr" => { "$eq" => ["$a", "$b"] } })
      }.not_to raise_error
    end

    it "blocks overly long regex patterns" do
      long_regex = "a" * 200
      expect {
        described_class.validate_filter!({ "field" => { "$regex" => long_regex } })
      }.to raise_error(Bizy::InvalidFilterError, /Regex pattern too long/)
    end

    it "blocks field names with null bytes" do
      expect {
        described_class.validate_filter!({ "field\x00" => "value" })
      }.to raise_error(Bizy::InvalidFilterError, /Invalid field name/)
    end

    it "validates nested filters recursively" do
      expect {
        described_class.validate_filter!({
          "$or" => [{ "$where" => "1==1" }]
        })
      }.to raise_error(Bizy::ForbiddenOperatorError, /\$where/)
    end
  end
end

Pipeline Validator Specs

# spec/lib/bizy/mongodb_pipeline_validator_spec.rb
RSpec.describe Bizy::MongodbPipelineValidator do
  describe ".validate!" do
    it "blocks $out stage" do
      pipeline = [{ "$match" => {} }, { "$out" => "output_collection" }]
      expect { described_class.validate!(pipeline) }
        .to raise_error(Bizy::ForbiddenOperatorError, /\$out/)
    end

    it "blocks $merge stage" do
      pipeline = [{ "$match" => {} }, { "$merge" => { into: "target" } }]
      expect { described_class.validate!(pipeline) }
        .to raise_error(Bizy::ForbiddenOperatorError, /\$merge/)
    end

    it "blocks $lookup stage" do
      pipeline = [{ "$lookup" => { from: "other", as: "joined" } }]
      expect { described_class.validate!(pipeline) }
        .to raise_error(Bizy::ForbiddenOperatorError, /\$lookup/)
    end

    it "blocks $unionWith stage" do
      pipeline = [{ "$match" => {} }, { "$unionWith" => { coll: "other" } }]
      expect { described_class.validate!(pipeline) }
        .to raise_error(Bizy::ForbiddenOperatorError, /\$unionWith/)
    end

    it "blocks forbidden stages nested in $facet" do
      pipeline = [
        { "$match" => {} },
        { "$facet" => {
          "good" => [{ "$match" => { "x" => 1 } }],
          "bad" => [{ "$unionWith" => { coll: "other" } }]
        }}
      ]
      expect { described_class.validate!(pipeline) }
        .to raise_error(Bizy::ForbiddenOperatorError, /\$unionWith/)
    end

    it "allows safe stages like $match, $group, $project" do
      pipeline = [
        { "$match" => { status: "active" } },
        { "$group" => { _id: "$type", count: { "$sum" => 1 } } },
        { "$project" => { type: "$_id", count: 1 } }
      ]
      expect { described_class.validate!(pipeline) }.not_to raise_error
    end
  end
end

Read-Only Enforcement Specs

# spec/lib/bizy/mongodb_query_executor_spec.rb
RSpec.describe Bizy::MongodbQueryExecutor do
  describe "read-only enforcement" do
    subject { described_class.new }

    it "does not expose insert methods" do
      expect(subject).not_to respond_to(:insert)
      expect(subject).not_to respond_to(:insert_one)
      expect(subject).not_to respond_to(:insert_many)
    end

    it "does not expose update methods" do
      expect(subject).not_to respond_to(:update)
      expect(subject).not_to respond_to(:update_one)
      expect(subject).not_to respond_to(:update_many)
    end

    it "does not expose delete methods" do
      expect(subject).not_to respond_to(:delete)
      expect(subject).not_to respond_to(:delete_one)
      expect(subject).not_to respond_to(:delete_many)
    end

    it "uses the bizy_readonly MongoDB client" do
      client = subject.send(:client)
      expect(client).to eq(Mongoid.client(:bizy_readonly))
    end

    it "enforces secondary-only read mode" do
      client = subject.send(:client)
      read_preference = client.options[:read]
      expect(read_preference[:mode]).to eq(:secondary)
    end
  end
end

Agent Availability Specs

# spec/lib/bizy/agents/data_spec.rb
RSpec.describe Bizy::Agents::Data do
  describe ".available?" do
    let(:company) { create(:company) }

    context "with non-admin user" do
      let(:user) { create(:user, company: company, company_admin: false) }

      it "returns false" do
        expect(described_class.available?(user: user, metadata: {})).to be false
      end
    end

    context "with company admin" do
      let(:user) { create(:user, company: company, company_admin: true) }

      it "returns true" do
        expect(described_class.available?(user: user, metadata: {})).to be true
      end
    end

    context "with global admin" do
      let(:user) { create(:user, company: company, company_admin: false) }

      before do
        allow(user).to receive(:can_administer?).with(:global).and_return(true)
      end

      it "returns true" do
        expect(described_class.available?(user: user, metadata: {})).to be true
      end
    end
  end
end

8. Documentation Strategy

Overview

The DataAgent's effectiveness depends on the LLM having enough context to formulate meaningful queries. Collection and field names alone don't explain what the data means or how it's calculated. This section describes the full documentation infrastructure.

Curated Documentation Folder

Create a dedicated folder for LLM-consumable documentation, separate from engineering docs:

doc/
├── internal/              # Engineering docs (NOT exposed to LLM)
│   └── security/
│       └── how_scoping_works.md
│
└── bizy_data/             # Curated for LLM consumption
    ├── collections/
    │   ├── bonuses.md
    │   ├── users.md
    │   ├── check_ins.md
    │   ├── monthly_user_statistics.md
    │   ├── app_notifications.md
    │   ├── llm_runs.md
    │   └── user_content_dismissals.md
    └── examples/
        └── common_queries.md

Security Considerations for Documentation

Exposing internal docs to the LLM introduces risks:

Risk Example Mitigation
Implementation detail leakage Class names, file paths, dependencies Curated docs only - not entire /doc folder
Security mechanism exposure "Filter by company_id at query layer" Separate user-facing docs from engineering docs
Accidental secrets API keys, hostnames in examples Review process + automated scanning
Collection enumeration References to non-allowlisted collections Docs only reference allowlisted collections

Doc Frontmatter (Tracking Metadata)

Each collection doc includes YAML frontmatter for automated freshness tracking:

---
collection: bonuses
documented_fields: [giver_id, receiver_ids, amount, reason, created_at, hashtags]
registry_description: "Recognition given between employees"
last_generated: 2025-02-01T14:30:00Z
---

# Bonuses Collection
...

The frontmatter enables the freshness checker to detect when:

  • New fields appear in the collection that aren't documented
  • Documented fields have been removed from the collection
  • The registry description has changed

Documentation Content Guidelines

Curated docs should:

  1. Be written for admin audience - Not engineering implementation details
  2. Focus on data meaning - What does this field represent? What questions can it answer?
  3. Include example queries - Show common use cases with actual query patterns
  4. Reference only allowlisted collections - Don't mention internal collections
  5. Avoid implementation details - No class names, file paths, or internal architecture

Example Collection Documentation

---
collection: bonuses
documented_fields: [_id, giver_id, receiver_ids, amount, reason, created_at, hashtags, via]
registry_description: "Recognition given between employees"
last_generated: 2025-02-01T14:30:00Z
---

# Bonuses Collection

Recognition events where employees appreciate each other's contributions.

## Key Fields

- `giver_id` - User who gave the recognition (ObjectId)
- `receiver_ids` - Users who received the recognition (array of ObjectIds)
- `amount` - Points awarded (integer)
- `reason` - Message explaining the recognition (string)
- `created_at` - When the recognition was given (datetime)
- `hashtags` - Company values tagged (array of strings, e.g., "#teamwork")
- `via` - Source of the recognition (e.g., "web", "slack", "ms_teams")

## Common Query Patterns

**Recognition given by a specific user:**
```json
{ "giver_id": "<user_id>" }

Recognition in a date range:

{ "created_at": { "$gte": "2025-01-01", "$lte": "2025-01-31" } }

Recognition mentioning a specific value:

{ "hashtags": "#teamwork" }

Top givers aggregation:

[
  { "$match": { "created_at": { "$gte": "2025-01-01" } } },
  { "$group": { "_id": "$giver_id", "total_given": { "$sum": "$amount" } } },
  { "$sort": { "total_given": -1 } },
  { "$limit": 10 }
]

### Review Process for Documentation

Before any doc is added to `doc/bizy_data/`:

1. **Security review** - No internal paths, class names, dependencies, or security mechanisms
2. **No secrets** - Scan for API keys, tokens, hostnames, real user/company data
3. **Allowlisted references only** - Only mention collections in the registry
4. **User-facing language** - Written for admins, not engineers
5. **PR approval required** - Standard review process

### Documentation Freshness Automation

Keep docs in sync with collection schemas using **schema-based detection** (not source file tracking). The process runs locally first, with CI as a backup safety net.

#### Detection Strategy

Only regenerate docs when the **actual data structure** changes:

| Trigger | Detected By | Frequency |
| -------------------------------- | ----------------------- | --------- |
| New field added to collection | Sample doc comparison | Rare |
| Field removed from collection | Sample doc comparison | Rare |
| Registry description changed | Registry vs frontmatter | Rare |
| New collection added to registry | Missing doc file | Rare |
| Source code refactoring | **Not tracked** | N/A |

#### Doc Parser

Parses YAML frontmatter from documentation files:

```ruby
# lib/bizy/mongodb_doc_parser.rb
module Bizy
  class MongodbDocParser
    Result = Struct.new(:metadata, :content, keyword_init: true)

    def self.parse(path)
      new.parse(path)
    end

    def parse(path)
      raw = File.read(path)

      if raw.start_with?("---")
        parts = raw.split("---", 3)
        metadata = YAML.safe_load(parts[1], symbolize_names: true) || {}
        content = parts[2]&.strip || ""
      else
        metadata = {}
        content = raw
      end

      Result.new(metadata: metadata, content: content)
    end
  end
end

Freshness Checker

Detects stale documentation by comparing frontmatter against actual collection state:

# lib/bizy/mongodb_doc_freshness_checker.rb
module Bizy
  class MongodbDocFreshnessChecker
    DOCS_PATH = Rails.root.join("doc/bizy_data/collections")

    def stale_docs
      stale = []

      MongodbCollectionRegistry::COLLECTIONS.each do |name, config|
        doc_path = DOCS_PATH.join("#{name}.md")

        unless File.exist?(doc_path)
          stale << { collection: name, reason: { missing: true } }
          next
        end

        doc = MongodbDocParser.parse(doc_path)
        sample = fetch_sample_document(name)

        current_fields = sample.keys.sort
        documented_fields = doc.metadata[:documented_fields]&.sort || []

        added = current_fields - documented_fields
        removed = documented_fields - current_fields
        description_changed = config[:description] != doc.metadata[:registry_description]

        if added.any? || removed.any? || description_changed
          stale << {
            collection: name,
            reason: {
              added_fields: added,
              removed_fields: removed,
              description_changed: description_changed
            }
          }
        end
      end

      stale
    end

    private

    def fetch_sample_document(name)
      Mongoid.client(:bizy_readonly)[name].find({}).limit(1).first || {}
    end
  end
end

LLM-Assisted Doc Generator

Generates or regenerates documentation using an LLM, with PII-scrubbed sample data:

# lib/bizy/mongodb_doc_generator.rb
module Bizy
  class MongodbDocGenerator
    DOCS_PATH = Rails.root.join("doc/bizy_data/collections")

    def generate(collection:, config:)
      sample = fetch_sample_document(collection)

      content = generate_with_llm(
        collection: collection,
        description: config[:description],
        fields: sample.keys,
        sample: sanitize_sample(sample)
      )

      frontmatter = {
        collection: collection,
        documented_fields: sample.keys.sort,
        registry_description: config[:description],
        last_generated: Time.current.iso8601
      }

      write_doc(collection, frontmatter, content)
    end

    private

    def fetch_sample_document(collection)
      Mongoid.client(:bizy_readonly)[collection].find({}).limit(1).first || {}
    end

    def generate_with_llm(collection:, description:, fields:, sample:)
      prompt = <<~PROMPT
        Generate user-facing documentation for a MongoDB collection.

        Collection: #{collection}
        Description: #{description}
        Fields: #{fields.join(', ')}
        Sample document (sanitized): #{sample.to_json}

        Guidelines:
        - Write for admin users, not engineers
        - Explain what each field means semantically
        - Include 3-5 common query patterns with example filters
        - Do not mention internal class names, file paths, or implementation details
        - Use markdown format with ## headers for sections

        Required sections:
        1. Brief description (1-2 sentences)
        2. Key Fields (list with descriptions)
        3. Common Query Patterns (with example filters)
      PROMPT

      Bizy::LlmClient.complete(prompt)
    end

    def sanitize_sample(sample)
      # Remove PII, replace IDs with placeholders
      sample.transform_values do |v|
        case v
        when BSON::ObjectId then "<ObjectId>"
        when String then v.truncate(50)
        when Array then v.first(3)
        else v
        end
      end
    end

    def write_doc(collection, frontmatter, content)
      path = DOCS_PATH.join("#{collection}.md")
      FileUtils.mkdir_p(DOCS_PATH)

      File.write(path, <<~DOC)
        ---
        #{frontmatter.to_yaml.lines[1..].join}---

        #{content}
      DOC
    end
  end
end

Rake Tasks

Local workflow commands for managing documentation:

# lib/tasks/bizy_docs.rake
namespace :bizy do
  desc "Check for stale MongoDB documentation"
  task check_docs: :environment do
    checker = Bizy::MongodbDocFreshnessChecker.new
    stale = checker.stale_docs

    if stale.empty?
      puts "All Bizy data docs are up to date."
    else
      puts "Stale documentation detected:"
      stale.each do |s|
        reasons = []
        if s[:reason].is_a?(Hash)
          reasons << "new fields: #{s[:reason][:added_fields].join(', ')}" if s[:reason][:added_fields]&.any?
          reasons << "removed fields: #{s[:reason][:removed_fields].join(', ')}" if s[:reason][:removed_fields]&.any?
          reasons << "description changed" if s[:reason][:description_changed]
          reasons << "doc missing" if s[:reason][:missing]
        else
          reasons << s[:reason].to_s
        end
        puts "  #{s[:collection]}: #{reasons.join(', ')}"
      end
      puts "\nRun 'rake bizy:generate_docs' to update."
      exit 1
    end
  end

  desc "Generate/update documentation for collections"
  task :generate_docs, [:collection] => :environment do |_, args|
    generator = Bizy::MongodbDocGenerator.new

    collections = if args[:collection]
      { args[:collection] => Bizy::MongodbCollectionRegistry.config(args[:collection]) }
    else
      Bizy::MongodbCollectionRegistry::COLLECTIONS
    end

    collections.each do |name, config|
      puts "Generating docs for #{name}..."
      generator.generate(collection: name, config: config)
    end

    puts "Done. Review changes and commit."
  end
end

Local Workflow

# Check which docs are stale
rake bizy:check_docs

# Regenerate all stale docs using LLM
rake bizy:generate_docs

# Regenerate a specific collection
rake bizy:generate_docs[bonuses]

CI Safety Net

A spec that fails CI if docs are stale, ensuring developers don't forget to update docs when collection schemas change:

# spec/lib/bizy/mongodb_documentation_freshness_spec.rb
RSpec.describe "Bizy Data Documentation Freshness" do
  it "has up-to-date documentation for all collections" do
    checker = Bizy::MongodbDocFreshnessChecker.new
    stale = checker.stale_docs

    if stale.any?
      message = stale.map do |s|
        if s[:reason].is_a?(Hash) && s[:reason][:missing]
          "#{s[:collection]}: missing doc file"
        elsif s[:reason].is_a?(Hash)
          reasons = []
          reasons << "new fields: #{s[:reason][:added_fields]}" if s[:reason][:added_fields]&.any?
          reasons << "removed fields: #{s[:reason][:removed_fields]}" if s[:reason][:removed_fields]&.any?
          reasons << "description changed" if s[:reason][:description_changed]
          "#{s[:collection]}: #{reasons.join(', ')}"
        else
          "#{s[:collection]}: #{s[:reason]}"
        end
      end.join("\n")

      fail "Stale Bizy data docs detected:\n#{message}\n\nRun locally: rake bizy:generate_docs"
    end
  end
end

Workflow Summary

flowchart TD
    A[Developer adds/changes collection] --> B[Run rake bizy:check_docs locally]
    B --> C{Stale docs?}
    C -->|No| D[Continue with PR]
    C -->|Yes| E[Run rake bizy:generate_docs]
    E --> F[LLM generates updated docs]
    F --> G[Developer reviews & edits]
    G --> H[Commit changes]
    H --> D
    D --> I[CI runs freshness spec]
    I --> J{Passes?}
    J -->|Yes| K[PR mergeable]
    J -->|No| L[CI fails with instructions]
    L --> B
Loading

9. Alternative Approaches Considered

Template-Based Queries (Rejected)

Pre-define parameterized query templates that the LLM fills in:

TEMPLATES = {
  "recognition_by_user" => {
    collection: "bonuses",
    filter: ->(params) { { receiver_ids: params[:user_id] } }
  }
}

Pros: Maximum security, no query parsing Cons: Very limited flexibility, new use cases require code changes

Rejected because it defeats the purpose of flexible queries and requires constant maintenance.

MongoDB Views (Rejected)

Create company-filtered views in MongoDB itself.

Pros: Database-level security Cons: Can't create per-company views (impractical), views are static

Full MCP Server (Original Approach - Superseded)

The original plan used MCP (Model Context Protocol) with JSON-RPC endpoints.

Pros: Standard protocol, credential separation Cons: Extra network hop, more infrastructure, doesn't align with new Bizy framework

Superseded by RubyLLM native tools which provide the same security with less complexity.


10. Implementation Order

Phase 1: Infrastructure

  1. Create read-only MongoDB user (ops/infra)
    • Create bizy_readonly user with read role only
    • Store credentials in environment variable
  2. Configure Mongoid client
    • Add bizy_readonly client to config/mongoid.yml
    • Configure secondary-only reads

Phase 2: Security Layer

  1. MongodbQueryExecutor + spec
    • Read-only client, only find/aggregate/count methods
    • Spec: read-only enforcement, secondary-only mode
  2. MongodbCollectionRegistry + spec
    • Collection allowlist with scope types and doc references
    • Spec: allowed/disallowed collections
  3. MongodbSecurityValidator + spec
    • Forbidden operators, recursive validation, field validation
    • Spec: all attack vectors ($where, $function, regex, field names)
  4. MongodbPipelineValidator + spec
    • Forbidden stages, recursive $facet validation
    • Spec: $out, $merge, $lookup, $unionWith, nested stages
  5. MongodbQueryScoper + spec
    • Company/user filter injection using validators
    • Spec: company scoping, user scoping, nested injection protection
  6. Error classes and audit logger

Phase 3: RubyLLM Tools

  1. MongodbFind tool + spec
    • Uses scoper and executor
    • Spec: scoped results, error handling
  2. MongodbAggregate tool + spec
    • Pipeline validation
    • Spec: pipeline validation, scoped aggregation
  3. MongodbCount tool + spec
    • Scoped count queries
  4. Discovery tools + specs
    • MongodbListCollections
    • MongodbDescribeCollection
    • MongodbDocs (with path traversal protection)

Phase 4: Agent Integration

  1. Bizy::Agents::Data + spec
    • Admin availability check
    • Tool registration
    • Instructions
  2. Register in Bizy::Chat
    • Add to SUB_AGENTS array
  3. Integration tests
    • End-to-end query execution

Phase 5: Documentation

  1. Create MongodbDocParser + spec
    • Parse YAML frontmatter from doc files
    • Spec: parsing with/without frontmatter
  2. Create MongodbDocFreshnessChecker + spec
    • Schema-based staleness detection using frontmatter
    • Spec: missing docs, new fields, removed fields, description changes
  3. Create MongodbDocGenerator + spec
    • LLM-assisted documentation generation with PII scrubbing
    • Spec: sample sanitization, frontmatter generation
  4. Create rake tasks (bizy:check_docs, bizy:generate_docs)
    • Local workflow for checking and regenerating docs
    • Support single-collection targeting
  5. Generate initial curated docs in doc/bizy_data/collections/
    • One file per allowlisted collection with frontmatter
    • Example queries per collection
  6. Create common query examples in doc/bizy_data/examples/
    • Cross-collection query patterns
  7. Create CI freshness spec as safety net
    • Fails CI if docs are stale with actionable instructions

11. Files to Create

Security Infrastructure

File Purpose
lib/bizy/mongodb_query_executor.rb Read-only query execution
lib/bizy/mongodb_collection_registry.rb Collection allowlist
lib/bizy/mongodb_security_validator.rb Filter validation
lib/bizy/mongodb_pipeline_validator.rb Pipeline validation
lib/bizy/mongodb_query_scoper.rb Company/user scoping
lib/bizy/mongodb_audit_logger.rb Query audit logging
lib/bizy/errors.rb Error classes

RubyLLM Tools

File Purpose
app/lib/bizy/agents/data.rb Data query agent
app/lib/bizy/tools/mongodb_find.rb Find tool
app/lib/bizy/tools/mongodb_aggregate.rb Aggregate tool
app/lib/bizy/tools/mongodb_count.rb Count tool
app/lib/bizy/tools/mongodb_list_collections.rb Discovery tool
app/lib/bizy/tools/mongodb_describe_collection.rb Discovery tool
app/lib/bizy/tools/mongodb_docs.rb Documentation tool

Documentation Infrastructure

File Purpose
lib/bizy/mongodb_doc_parser.rb Parse frontmatter from doc files
lib/bizy/mongodb_doc_freshness_checker.rb Schema-based staleness detection
lib/bizy/mongodb_doc_generator.rb LLM-assisted doc generation
lib/tasks/bizy_docs.rake Rake tasks for doc management

Configuration

File Purpose
config/mongoid.yml Add bizy_readonly client

Curated Documentation

File Purpose
doc/bizy_data/collections/bonuses.md Bonuses documentation
doc/bizy_data/collections/users.md Users documentation
doc/bizy_data/collections/check_ins.md Check-ins documentation
doc/bizy_data/collections/monthly_user_statistics.md Monthly metrics documentation
doc/bizy_data/collections/app_notifications.md Notifications documentation
doc/bizy_data/collections/llm_runs.md LLM runs documentation
doc/bizy_data/collections/user_content_dismissals.md Dismissals documentation
doc/bizy_data/examples/common_queries.md Cross-collection query examples

Specs

File Purpose
spec/lib/bizy/mongodb_query_executor_spec.rb Read-only tests
spec/lib/bizy/mongodb_collection_registry_spec.rb Allowlist tests
spec/lib/bizy/mongodb_security_validator_spec.rb Attack vector tests
spec/lib/bizy/mongodb_pipeline_validator_spec.rb Pipeline tests
spec/lib/bizy/mongodb_query_scoper_spec.rb Scoping tests
spec/lib/bizy/mongodb_doc_parser_spec.rb Doc parser tests
spec/lib/bizy/mongodb_doc_freshness_checker_spec.rb Freshness detection tests
spec/lib/bizy/mongodb_doc_generator_spec.rb Doc generator tests
spec/lib/bizy/mongodb_documentation_freshness_spec.rb CI safety net spec
spec/lib/bizy/agents/data_spec.rb Agent availability tests
spec/lib/bizy/tools/mongodb_find_spec.rb Tool integration tests
spec/lib/bizy/tools/mongodb_aggregate_spec.rb Tool integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment