Skip to content

Instantly share code, notes, and snippets.

@huytd
Last active February 5, 2026 03:35
Show Gist options
  • Select an option

  • Save huytd/6b1e9f2271dd677346430c1b92893b57 to your computer and use it in GitHub Desktop.

Select an option

Save huytd/6b1e9f2271dd677346430c1b92893b57 to your computer and use it in GitHub Desktop.
Running Qwen3-Coder-Next (Q3_K_M) on 16 GB RTX 5060 Ti

Hardware:

  • AMD Ryzen 7 7700X
  • RAM 32 GB DDR5-6000
  • RTX 5060 Ti 16 GB

Model: unsloth/Qwen3-Coder-Next-GGUF Q3_K_M

Llama.cpp version: llama.cpp@b7940

The command

llama-server -m ./Qwen3-Coder-Next-Q3_K_M.gguf -c 32768 -np 1 -t 8 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --jinja --fit on -fa 1

Benchmark:

Metric Task 2 (Large Context) Task 190 (Med Context) Task 327 (Small Context)
Prompt Eval (Prefill) 154.08 t/s 225.14 t/s 118.98 t/s
Generation (Decode) 16.90 t/s 16.82 t/s 18.46 t/s

Some screenshots:

image image
╭─── Claude Code v2.1.20 ──────────────────────────────────────────────────────╮
│ │ Tips for getting started │
│ Welcome back! │ Run /init to create a CLAUDE.md file w… │
│ │ ─────────────────────────────────────── │
│ ▐▛███▜▌ │ Recent activity │
│ ▝▜█████▛▘ │ No recent activity │
│ ▘▘ ▝▝ │ │
│ │ │
│ Sonnet 4.5 · API Usage Billing │ │
│ ~/code/clichess │ │
╰──────────────────────────────────────────────────────────────────────────────╯
/model to try Opus 4.5
╭──────────────────────────────────────────────────────────────────────────────╮
│ Plan to implement │
│ │
│ Chess API & Frontend Implementation Plan │
│ │
│ Project Structure │
│ │
│ clichess/ │
│ ├── backend/ │
│ │ ├── main.py # FastAPI application │
│ │ ├── api/ │
│ │ │ ├── __init__.py │
│ │ │ ├── routes.py # API endpoints │
│ │ │ └── schemas.py # Pydantic models │
│ │ ├── game/ │
│ │ │ ├── __init__.py │
│ │ │ └── engine.py # AI engine (refactored) │
│ │ └── requirements.txt │
│ │ │
│ ├── frontend/ │
│ │ ├── index.html │
│ │ ├── style/ │
│ │ │ └── main.css │
│ │ └── scripts/ │
│ │ ├── api.js │
│ │ ├── board.js │
│ │ └── game.js │
│ │ │
│ ├── main.py # Original CLI (backup) │
│ └── README.md │
│ │
│ Backend (FastAPI) │
│ │
│ Dependencies │
│ │
│ - fastapi - Web framework │
│ - uvicorn[standard] - ASGI server │
│ - python-chess - Board state & AI │
│ - websockets - For WebSocket support │
│ │
│ API Endpoints │
│ ┌───────────┬────────┬─────────────────────────────────┐ │
│ │ Endpoint │ Method │ Description │ │
│ ├───────────┼────────┼─────────────────────────────────┤ │
│ │ /health │ GET │ Health check │ │
│ ├───────────┼────────┼─────────────────────────────────┤ │
│ │ /status │ GET │ Current game state │ │
│ ├───────────┼────────┼─────────────────────────────────┤ │
│ │ /move │ POST │ Submit user move │ │
│ ├───────────┼────────┼─────────────────────────────────┤ │
│ │ /new-game │ POST │ Start new game │ │
│ ├───────────┼────────┼─────────────────────────────────┤ │
│ │ /ai-move │ POST │ Trigger AI move │ │
│ ├───────────┼────────┼─────────────────────────────────┤ │
│ │ /ws │ WS │ WebSocket for real-time updates │ │
│ └───────────┴────────┴─────────────────────────────────┘ │
│ Response Schema │
│ │
│ { │
│ "fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1", │
│ "turn": "white", │
│ "history": ["e2e4", "e7e5"], │
│ "game_over": false, │
│ "result": null, │
│ "check": false, │
│ "legal_moves": ["e2e4", ...] │
│ } │
│ │
│ Frontend (HTML/JS) │
│ │
│ Libraries │
│ │
│ - chess.js v1.0.0 - Board state & move validation │
│ - chessboard.js v1.0.0 - Drag-and-drop UI │
│ - jQuery (dependency of chessboard.js) │
│ │
│ Files │
│ ┌───────────────────────────┬─────────────────────────────────────┐ │
│ │ File │ Purpose │ │
│ ├───────────────────────────┼─────────────────────────────────────┤ │
│ │ frontend/index.html │ Page structure with board container │ │
│ ├───────────────────────────┼─────────────────────────────────────┤ │
│ │ frontend/style/main.css │ Responsive board styling │ │
│ ├───────────────────────────┼─────────────────────────────────────┤ │
│ │ frontend/scripts/api.js │ API client (fetch/WS) │ │
│ ├───────────────────────────┼─────────────────────────────────────┤ │
│ │ frontend/scripts/board.js │ Chessboard UI initialization │ │
│ ├───────────────────────────┼─────────────────────────────────────┤ │
│ │ frontend/scripts/game.js │ Game state management │ │
│ └───────────────────────────┴─────────────────────────────────────┘ │
│ Key Features │
│ │
│ - Interactive drag-and-drop board │
│ - Real-time updates via WebSocket │
│ - Move history log │
│ - Game over state display │
│ - Visual indicators for check/checkmate │
│ │
│ Implementation Steps │
│ │
│ 1. Create backend directory structure │
│ 2. Extract AI logic to backend/game/engine.py │
│ 3. Implement FastAPI endpoints with CORS │
│ 4. Set up WebSocket for real-time updates │
│ 5. Create frontend HTML structure │
│ 6. Integrate chess.js and chessboard.js │
│ 7. Connect frontend to backend API │
│ 8. Test end-to-end game flow │
│ │
│ Verification │
│ │
│ 1. Start backend: cd backend && uvicorn main:app --reload │
│ 2. Start frontend: Serve frontend/ directory │
│ 3. Test new game creation │
│ 4. Test user move submission │
│ 5. Verify AI response via WebSocket │
│ 6. Test game over conditions │
╰──────────────────────────────────────────────────────────────────────────────╯
● I'll implement this Chess API and Frontend plan. Let me start by exploring the
existing codebase structure and understanding what's already in place.
● Explore(Explore codebase structure)
⎿  Done (5 tool uses · 13.1k tokens · 2m 33s)
● Now let me read the existing main.py to understand the game logic:
● Read 1 file (ctrl+o to expand)
● Now I have a clear picture. Let me create the task list and start implementing
the plan:
● Let me start implementing the tasks. I'll work on multiple files in parallel:
● Let me try creating directories and files directly:
● Bash(mkdir -p /home/king/code/clichess/backend/api
/home/king/code/clichess/backend/game
/home/king/code/clichess/frontend/style
/home/king/code/clichess/frontend/sc…)
⎿  (No content)
● Write(backend/api/__init__.py)
⎿  Wrote 1 lines to backend/api/__init__.py
1 # API package
⎿  Error writing file
● Write(backend/game/__init__.py)
⎿  Wrote 1 lines to backend/game/__init__.py
1 # Game package
● Write(backend/game/engine.py)
⎿  Wrote 119 lines to backend/game/engine.py
1 """AI Chess Engine with Minimax algorithm."""
2 import chess
3
4
5 def evaluate_board(board: chess.Board) -> int:
6 """Evaluate the board position for the current player."""
7 if board.is_checkmate():
8 return -10000 if board.turn == chess.WHITE else 10000
9 if board.is_stalemate() or board.is_draw():
10 return 0
… +109 lines (ctrl+o to expand)
⎿  Error writing file
● Write(backend/api/schemas.py)
⎿  Wrote 31 lines to backend/api/schemas.py
1 """Pydantic models for API requests and responses."""
2 from pydantic import BaseModel
3 from typing import Optional, List
4
5
6 class GameStatus(BaseModel):
7 """Response model for game status."""
8 fen: str
9 turn: str
10 history: List[str]
… +21 lines (ctrl+o to expand)
● Write(backend/api/routes.py)
⎿  Wrote 190 lines to backend/api/routes.py
1 """FastAPI API routes for the Chess application."""
2 from fastapi import APIRouter, WebSocket, WebSocketDisconnect
3 from fastapi.responses import JSONResponse
4 import chess
5 import chess.engine
6 from typing import Dict, Any, Optional, List
7 import json
8 import asyncio
9
10 from .schemas import GameStatus, MoveRequest, NewGameRequest, Ne
wGameResponse
… +180 lines (ctrl+o to expand)
● Write(backend/main.py)
⎿  Wrote 46 lines to backend/main.py
1 """FastAPI application for Chess API."""
2 import uvicorn
3 from fastapi import FastAPI
4 from fastapi.middleware.cors import CORSMiddleware
5
6 from api.routes import router
7
8 app = FastAPI(
9 title="Chess API",
10 description="A chess API with AI opponent and real-time upda
tes",
… +36 lines (ctrl+o to expand)
● Write(backend/requirements.txt)
⎿  Wrote 5 lines to backend/requirements.txt
1 fastapi>=0.104.0
2 uvicorn[standard]>=0.24.0
3 python-chess>=1.10.0
4 websockets>=12.0
5 pydantic>=2.0.0
● Now let me create the frontend files:
● Write(frontend/index.html)
⎿  Wrote 64 lines to frontend/index.html
1 <!DOCTYPE html>
2 <html lang="en">
3 <head>
4 <meta charset="UTF-8">
5 <meta name="viewport" content="width=device-width, initial-s
cale=1.0">
6 <title>Chess vs AI</title>
7 <link rel="stylesheet" href="https://cdnjs.cloudflare.com/aj
ax/libs/chessboard-js/1.0.0/chessboard-1.0.0.min.css">
8 <link rel="stylesheet" href="https://code.jquery.com/ui/1.13
.0/themes/base/jquery-ui.css">
9 <link rel="stylesheet" href="style/main.css">
10 </head>
… +54 lines (ctrl+o to expand)
● Write(frontend/style/main.css)
⎿  Wrote 300 lines to frontend/style/main.css
1 /* Main stylesheet for Chess frontend */
2
3 * {
4 margin: 0;
5 padding: 0;
6 box-sizing: border-box;
7 }
8
9 body {
10 font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI',
Roboto, Oxygen, Ubuntu, sans-serif;
… +290 lines (ctrl+o to expand)
● Write(frontend/scripts/api.js)
⎿  Wrote 216 lines to frontend/scripts/api.js
1 /**
2 * API Client for Chess Application
3 * Handles HTTP requests and WebSocket connections
4 */
5 class ChessAPI {
6 constructor(baseUrl = 'http://localhost:8000') {
7 this.baseUrl = baseUrl;
8 this.ws = null;
9 this.wsUrl = null;
10 this.messageHandlers = {};
… +206 lines (ctrl+o to expand)
● Write(frontend/scripts/board.js)
⎿  Wrote 183 lines to frontend/scripts/board.js
1 /**
2 * Chessboard UI Initialization
3 * Sets up the chessboard.js board
4 */
5 $(document).ready(function() {
6 // Chessboard configuration
7 const boardConfig = {
8 draggable: true,
9 position: 'start',
10 onDragStart: onDragStart,
… +173 lines (ctrl+o to expand)
● Write(frontend/scripts/game.js)
⎿  Wrote 222 lines to frontend/scripts/game.js
1 /**
2 * Game State Management
3 * Coordinates between UI, API, and game logic
4 */
5 $(document).ready(function() {
6 // DOM Elements
7 const $newGameBtn = $('#new-game-btn');
8 const $aiMoveBtn = $('#ai-move-btn');
9 const $submitMoveBtn = $('#submit-move-btn');
10 const $moveInput = $('#move-input');
… +212 lines (ctrl+o to expand)
● Now let me update the existing main.py to backup and create the README:
● Bash(mv /home/king/code/clichess/main.py
/home/king/code/clichess/main.py.backup)
⎿  (No content)
● A web-based chess application with AI opponent, built with FastAPI backend and
modern frontend.
Project Structure
clichess/
├── backend/ # FastAPI application
│ ├── main.py # FastAPI server entry point
│ ├── api/ # API endpoints
│ │ ├── __init__.py
│ │ ├── routes.py # REST endpoints
│ │ └── schemas.py # Pydantic models
│ ├── game/ # Game logic
│ │ ├── __init__.py
│ │ └── engine.py # AI engine (Minimax)
│ └── requirements.txt
├── frontend/ # Web frontend
│ ├── index.html
│ ├── style/
│ │ └── main.css
│ �
✻ Cogitated for 15m 41s
qwen3-coder-next % ./run --log-file log.txt
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
build: 7939 (b536eb023) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model './Qwen3-Coder-Next-Q3_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 37528 MiB of device memory vs. 15082 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 23470 MiB
llama_params_fit_impl: context size set by user to 32768 -> no change
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 11494 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 5060 Ti): 49 layers, 3355 MiB used, 11726 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 5060 Ti): 49 layers (34 overflowing), 13834 MiB used, 1247 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 3.91 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5060 Ti) (0000:01:00.0) - 15158 MiB free
llama_model_loader: loaded meta data with 52 key-value pairs and 843 tensors from ./Qwen3-Coder-Next-Q3_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3next
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 40
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Qwen3-Coder-Next
llama_model_loader: - kv 6: general.basename str = Qwen3-Coder-Next
llama_model_loader: - kv 7: general.quantized_by str = Unsloth
llama_model_loader: - kv 8: general.size_label str = 512x2.5B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3 Coder Next
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 17: qwen3next.block_count u32 = 48
llama_model_loader: - kv 18: qwen3next.context_length u32 = 262144
llama_model_loader: - kv 19: qwen3next.embedding_length u32 = 2048
llama_model_loader: - kv 20: qwen3next.feed_forward_length u32 = 5120
llama_model_loader: - kv 21: qwen3next.attention.head_count u32 = 16
llama_model_loader: - kv 22: qwen3next.attention.head_count_kv u32 = 2
llama_model_loader: - kv 23: qwen3next.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 24: qwen3next.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen3next.expert_used_count u32 = 10
llama_model_loader: - kv 26: qwen3next.attention.key_length u32 = 256
llama_model_loader: - kv 27: qwen3next.attention.value_length u32 = 256
llama_model_loader: - kv 28: qwen3next.expert_count u32 = 512
llama_model_loader: - kv 29: qwen3next.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 30: qwen3next.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 31: qwen3next.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 32: qwen3next.ssm.state_size u32 = 128
llama_model_loader: - kv 33: qwen3next.ssm.group_count u32 = 16
llama_model_loader: - kv 34: qwen3next.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 35: qwen3next.ssm.inner_size u32 = 4096
llama_model_loader: - kv 36: qwen3next.rope.dimension_count u32 = 64
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 38: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 42: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 45: tokenizer.chat_template str = {% macro render_extra_keys(json_dict,...
llama_model_loader: - kv 46: general.quantization_version u32 = 2
llama_model_loader: - kv 47: general.file_type u32 = 12
llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3-Coder-Next-GGUF/imatrix_unsloth...
llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-Coder-Next.txt
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 576
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 154
llama_model_loader: - type f32: 313 tensors
llama_model_loader: - type q3_K: 229 tensors
llama_model_loader: - type q4_K: 245 tensors
llama_model_loader: - type q5_K: 7 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type bf16: 48 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 35.65 GiB (3.84 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3next
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5120
print_info: n_expert = 512
print_info: n_expert_used = 10
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 4096
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 32
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 80B.A3B
print_info: model params = 79.67 B
print_info: general.name = Qwen3-Coder-Next
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 36263.77 MiB
load_tensors: CUDA0 model buffer size = 12571.13 MiB
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 768.00 MiB
llama_kv_cache: size = 768.00 MiB ( 32768 cells, 12 layers, 1/1 seqs), K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_memory_recurrent: CUDA0 RS buffer size = 75.38 MiB
llama_memory_recurrent: size = 75.38 MiB ( 1 cells, 48 layers, 1 seqs), R (f32): 3.38 MiB, S (f32): 72.00 MiB
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 420.06 MiB
sched_reserve: CUDA_Host compute buffer size = 72.01 MiB
sched_reserve: graph nodes = 14666 (with bs=512), 5918 (with bs=1)
sched_reserve: graph splits = 104 (with bs=512), 74 (with bs=1)
sched_reserve: reserve took 587.99 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 1
no implementations specified for speculative decoding
slot load_model: id 0 | task -1 | speculative decoding context not initialized
slot load_model: id 0 | task -1 | new slot, n_ctx = 32768
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use `--cache-ram 0` to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
srv init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 158
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 94, batch.n_tokens = 94, progress = 0.594937
srv params_from_: Chat format: Qwen3 Coder
slot update_slots: id 0 | task 0 | n_tokens = 94, memory_seq_rm [94, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 158, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_tokens = 158, batch.n_tokens = 64
slot init_sampler: id 0 | task 0 | init sampler, took 0.02 ms, tokens: text = 158, total = 158
slot update_slots: id 0 | task 0 | created context checkpoint 1 of 8 (pos_min = 93, pos_max = 93, size = 75.376 MiB)
slot print_timing: id 0 | task 0 |
prompt eval time = 20427.03 ms / 158 tokens ( 129.29 ms per token, 7.73 tokens per second)
eval time = 852.28 ms / 11 tokens ( 77.48 ms per token, 12.91 tokens per second)
total time = 21279.31 ms / 169 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 168, truncated = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10688094175
srv get_availabl: updating prompt cache
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv prompt_save: - saving prompt with length 168, total state size = 79.316 MiB
srv load: - looking for better prompt, base f_keep = 0.125, sim = 0.001
srv update: - cache state: 1 prompts, 154.691 MiB (limits: 8192.000 MiB, 32768 tokens, 32768 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv get_availabl: prompt cache update took 422.29 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2 | processing task, is_child = 0
slot update_slots: id 0 | task 2 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 16962
slot update_slots: id 0 | task 2 | n_past = 21, slot.prompt.tokens.size() = 168, seq_id = 0, pos_min = 167, n_swa = 1
slot update_slots: id 0 | task 2 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 2 | erased invalidated context checkpoint (pos_min = 93, pos_max = 93, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 2 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.120740
slot update_slots: id 0 | task 2 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.241481
slot update_slots: id 0 | task 2 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.362221
slot update_slots: id 0 | task 2 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.482962
slot update_slots: id 0 | task 2 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.603702
slot update_slots: id 0 | task 2 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.724443
slot update_slots: id 0 | task 2 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.845183
slot update_slots: id 0 | task 2 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.965924
slot update_slots: id 0 | task 2 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 16898, batch.n_tokens = 514, progress = 0.996227
slot update_slots: id 0 | task 2 | n_tokens = 16898, memory_seq_rm [16898, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_tokens = 16962, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2 | prompt done, n_tokens = 16962, batch.n_tokens = 64
slot init_sampler: id 0 | task 2 | init sampler, took 1.31 ms, tokens: text = 16962, total = 16962
slot update_slots: id 0 | task 2 | created context checkpoint 1 of 8 (pos_min = 16897, pos_max = 16897, size = 75.376 MiB)
slot print_timing: id 0 | task 2 |
prompt eval time = 110084.95 ms / 16962 tokens ( 6.49 ms per token, 154.08 tokens per second)
eval time = 9879.50 ms / 167 tokens ( 59.16 ms per token, 16.90 tokens per second)
total time = 119964.45 ms / 17129 tokens
slot release: id 0 | task 2 | stop processing: n_tokens = 17128, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10808634380
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 17128, total state size = 477.010 MiB
srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.004
srv update: - cache state: 2 prompts, 707.077 MiB (limits: 8192.000 MiB, 32768 tokens, 200386 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e42624d83a0: 17128 tokens, checkpoints: 1, 552.385 MiB
srv get_availabl: prompt cache update took 3980.01 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 190 | processing task, is_child = 0
slot update_slots: id 0 | task 190 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 9779
slot update_slots: id 0 | task 190 | n_past = 40, slot.prompt.tokens.size() = 17128, seq_id = 0, pos_min = 17127, n_swa = 1
slot update_slots: id 0 | task 190 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 190 | erased invalidated context checkpoint (pos_min = 16897, pos_max = 16897, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 190 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 190 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.209428
slot update_slots: id 0 | task 190 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 0 | task 190 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.418857
slot update_slots: id 0 | task 190 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 0 | task 190 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.628285
slot update_slots: id 0 | task 190 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 0 | task 190 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.837713
slot update_slots: id 0 | task 190 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 0 | task 190 | prompt processing progress, n_tokens = 9715, batch.n_tokens = 1523, progress = 0.993455
slot update_slots: id 0 | task 190 | n_tokens = 9715, memory_seq_rm [9715, end)
slot update_slots: id 0 | task 190 | prompt processing progress, n_tokens = 9779, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 190 | prompt done, n_tokens = 9779, batch.n_tokens = 64
slot init_sampler: id 0 | task 190 | init sampler, took 0.80 ms, tokens: text = 9779, total = 9779
slot update_slots: id 0 | task 190 | created context checkpoint 1 of 8 (pos_min = 9714, pos_max = 9714, size = 75.376 MiB)
slot print_timing: id 0 | task 190 |
prompt eval time = 43435.43 ms / 9779 tokens ( 4.44 ms per token, 225.14 tokens per second)
eval time = 2675.64 ms / 45 tokens ( 59.46 ms per token, 16.82 tokens per second)
total time = 46111.07 ms / 9824 tokens
slot release: id 0 | task 190 | stop processing: n_tokens = 9823, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10859055595
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 9823, total state size = 305.715 MiB
srv params_from_: Chat format: Qwen3 Coder
srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.031
srv update: - cache state: 3 prompts, 1088.168 MiB (limits: 8192.000 MiB, 32768 tokens, 204158 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e42624d83a0: 17128 tokens, checkpoints: 1, 552.385 MiB
srv update: - prompt 0x5e42681ad3f0: 9823 tokens, checkpoints: 1, 381.091 MiB
srv get_availabl: prompt cache update took 1279.15 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 241 | processing task, is_child = 0
slot update_slots: id 0 | task 241 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 639
slot update_slots: id 0 | task 241 | n_past = 20, slot.prompt.tokens.size() = 9823, seq_id = 0, pos_min = 9822, n_swa = 1
slot update_slots: id 0 | task 241 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 241 | erased invalidated context checkpoint (pos_min = 9714, pos_max = 9714, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 241 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 241 | prompt processing progress, n_tokens = 575, batch.n_tokens = 575, progress = 0.899844
slot update_slots: id 0 | task 241 | n_tokens = 575, memory_seq_rm [575, end)
slot update_slots: id 0 | task 241 | prompt processing progress, n_tokens = 639, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 241 | prompt done, n_tokens = 639, batch.n_tokens = 64
slot init_sampler: id 0 | task 241 | init sampler, took 0.06 ms, tokens: text = 639, total = 639
slot update_slots: id 0 | task 241 | created context checkpoint 1 of 8 (pos_min = 574, pos_max = 574, size = 75.376 MiB)
slot print_timing: id 0 | task 241 |
prompt eval time = 5353.33 ms / 639 tokens ( 8.38 ms per token, 119.36 tokens per second)
eval time = 1218.51 ms / 23 tokens ( 52.98 ms per token, 18.88 tokens per second)
total time = 6571.85 ms / 662 tokens
slot release: id 0 | task 241 | stop processing: n_tokens = 661, truncated = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10867785632
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 661, total state size = 90.876 MiB
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv load: - looking for better prompt, base f_keep = 0.030, sim = 0.002
srv load: - found better prompt with f_keep = 1.000, sim = 0.964
srv update: - cache state: 3 prompts, 873.329 MiB (limits: 8192.000 MiB, 32768 tokens, 168440 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e42624d83a0: 17128 tokens, checkpoints: 1, 552.385 MiB
srv update: - prompt 0x5e4267889560: 661 tokens, checkpoints: 1, 166.252 MiB
srv get_availabl: prompt cache update took 1476.50 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 242 | processing task, is_child = 0
slot update_slots: id 0 | task 242 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 10190
slot update_slots: id 0 | task 242 | n_tokens = 9823, memory_seq_rm [9823, end)
slot update_slots: id 0 | task 242 | prompt processing progress, n_tokens = 10126, batch.n_tokens = 303, progress = 0.993719
slot update_slots: id 0 | task 242 | n_tokens = 10126, memory_seq_rm [10126, end)
slot update_slots: id 0 | task 242 | prompt processing progress, n_tokens = 10190, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 242 | prompt done, n_tokens = 10190, batch.n_tokens = 64
slot init_sampler: id 0 | task 242 | init sampler, took 0.81 ms, tokens: text = 10190, total = 10190
slot update_slots: id 0 | task 242 | created context checkpoint 2 of 8 (pos_min = 10125, pos_max = 10125, size = 75.376 MiB)
slot print_timing: id 0 | task 242 |
prompt eval time = 3482.72 ms / 367 tokens ( 9.49 ms per token, 105.38 tokens per second)
eval time = 3469.33 ms / 59 tokens ( 58.80 ms per token, 17.01 tokens per second)
total time = 6952.05 ms / 426 tokens
slot release: id 0 | task 242 | stop processing: n_tokens = 10248, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10876215074
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 10248, total state size = 315.681 MiB
srv params_from_: Chat format: Qwen3 Coder
srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.035
srv update: - cache state: 4 prompts, 1339.761 MiB (limits: 8192.000 MiB, 32768 tokens, 172460 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e42624d83a0: 17128 tokens, checkpoints: 1, 552.385 MiB
srv update: - prompt 0x5e4267889560: 661 tokens, checkpoints: 1, 166.252 MiB
srv update: - prompt 0x5e42699ac270: 10248 tokens, checkpoints: 2, 466.433 MiB
srv get_availabl: prompt cache update took 1718.98 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 327 | processing task, is_child = 0
slot update_slots: id 0 | task 327 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 573
slot update_slots: id 0 | task 327 | n_past = 20, slot.prompt.tokens.size() = 10248, seq_id = 0, pos_min = 10247, n_swa = 1
slot update_slots: id 0 | task 327 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 327 | erased invalidated context checkpoint (pos_min = 9714, pos_max = 9714, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 327 | erased invalidated context checkpoint (pos_min = 10125, pos_max = 10125, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 327 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 327 | prompt processing progress, n_tokens = 509, batch.n_tokens = 509, progress = 0.888307
slot update_slots: id 0 | task 327 | n_tokens = 509, memory_seq_rm [509, end)
slot update_slots: id 0 | task 327 | prompt processing progress, n_tokens = 573, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 327 | prompt done, n_tokens = 573, batch.n_tokens = 64
slot init_sampler: id 0 | task 327 | init sampler, took 0.05 ms, tokens: text = 573, total = 573
slot update_slots: id 0 | task 327 | created context checkpoint 1 of 8 (pos_min = 508, pos_max = 508, size = 75.376 MiB)
slot print_timing: id 0 | task 327 |
prompt eval time = 4815.77 ms / 573 tokens ( 8.40 ms per token, 118.98 tokens per second)
eval time = 1245.78 ms / 23 tokens ( 54.16 ms per token, 18.46 tokens per second)
total time = 6061.55 ms / 596 tokens
slot release: id 0 | task 327 | stop processing: n_tokens = 595, truncated = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10884113885
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 595, total state size = 89.328 MiB
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv load: - looking for better prompt, base f_keep = 0.034, sim = 0.002
srv load: - found better prompt with f_keep = 1.000, sim = 0.972
srv update: - cache state: 4 prompts, 1038.033 MiB (limits: 8192.000 MiB, 32768 tokens, 146409 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e42624d83a0: 17128 tokens, checkpoints: 1, 552.385 MiB
srv update: - prompt 0x5e4267889560: 661 tokens, checkpoints: 1, 166.252 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv get_availabl: prompt cache update took 908.73 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 328 | processing task, is_child = 0
slot update_slots: id 0 | task 328 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 10538
slot update_slots: id 0 | task 328 | n_tokens = 10248, memory_seq_rm [10248, end)
slot update_slots: id 0 | task 328 | prompt processing progress, n_tokens = 10474, batch.n_tokens = 226, progress = 0.993927
slot update_slots: id 0 | task 328 | n_tokens = 10474, memory_seq_rm [10474, end)
slot update_slots: id 0 | task 328 | prompt processing progress, n_tokens = 10538, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 328 | prompt done, n_tokens = 10538, batch.n_tokens = 64
slot init_sampler: id 0 | task 328 | init sampler, took 0.82 ms, tokens: text = 10538, total = 10538
slot update_slots: id 0 | task 328 | created context checkpoint 3 of 8 (pos_min = 10473, pos_max = 10473, size = 75.376 MiB)
slot print_timing: id 0 | task 328 |
prompt eval time = 3459.38 ms / 290 tokens ( 11.93 ms per token, 83.83 tokens per second)
eval time = 2195.82 ms / 39 tokens ( 56.30 ms per token, 17.76 tokens per second)
total time = 5655.20 ms / 329 tokens
slot release: id 0 | task 328 | stop processing: n_tokens = 10576, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.995 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 393 | processing task, is_child = 0
slot update_slots: id 0 | task 393 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 10631
slot update_slots: id 0 | task 393 | n_tokens = 10576, memory_seq_rm [10576, end)
slot update_slots: id 0 | task 393 | prompt processing progress, n_tokens = 10631, batch.n_tokens = 55, progress = 1.000000
slot update_slots: id 0 | task 393 | prompt done, n_tokens = 10631, batch.n_tokens = 55
slot init_sampler: id 0 | task 393 | init sampler, took 0.86 ms, tokens: text = 10631, total = 10631
slot update_slots: id 0 | task 393 | created context checkpoint 4 of 8 (pos_min = 10575, pos_max = 10575, size = 75.376 MiB)
slot print_timing: id 0 | task 393 |
prompt eval time = 1321.29 ms / 55 tokens ( 24.02 ms per token, 41.63 tokens per second)
eval time = 1632.40 ms / 30 tokens ( 54.41 ms per token, 18.38 tokens per second)
total time = 2953.69 ms / 85 tokens
slot release: id 0 | task 393 | stop processing: n_tokens = 10660, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.802 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 424 | processing task, is_child = 0
slot update_slots: id 0 | task 424 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13296
slot update_slots: id 0 | task 424 | n_tokens = 10660, memory_seq_rm [10660, end)
slot update_slots: id 0 | task 424 | prompt processing progress, n_tokens = 12708, batch.n_tokens = 2048, progress = 0.955776
slot update_slots: id 0 | task 424 | n_tokens = 12708, memory_seq_rm [12708, end)
slot update_slots: id 0 | task 424 | prompt processing progress, n_tokens = 13232, batch.n_tokens = 524, progress = 0.995187
slot update_slots: id 0 | task 424 | n_tokens = 13232, memory_seq_rm [13232, end)
slot update_slots: id 0 | task 424 | prompt processing progress, n_tokens = 13296, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 424 | prompt done, n_tokens = 13296, batch.n_tokens = 64
slot init_sampler: id 0 | task 424 | init sampler, took 1.08 ms, tokens: text = 13296, total = 13296
slot update_slots: id 0 | task 424 | created context checkpoint 5 of 8 (pos_min = 13231, pos_max = 13231, size = 75.376 MiB)
slot print_timing: id 0 | task 424 |
prompt eval time = 19034.14 ms / 2636 tokens ( 7.22 ms per token, 138.49 tokens per second)
eval time = 1692.73 ms / 32 tokens ( 52.90 ms per token, 18.90 tokens per second)
total time = 20726.87 ms / 2668 tokens
slot release: id 0 | task 424 | stop processing: n_tokens = 13327, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.985 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 459 | processing task, is_child = 0
slot update_slots: id 0 | task 459 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13529
slot update_slots: id 0 | task 459 | n_tokens = 13327, memory_seq_rm [13327, end)
slot update_slots: id 0 | task 459 | prompt processing progress, n_tokens = 13465, batch.n_tokens = 138, progress = 0.995269
slot update_slots: id 0 | task 459 | n_tokens = 13465, memory_seq_rm [13465, end)
slot update_slots: id 0 | task 459 | prompt processing progress, n_tokens = 13529, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 459 | prompt done, n_tokens = 13529, batch.n_tokens = 64
slot init_sampler: id 0 | task 459 | init sampler, took 1.10 ms, tokens: text = 13529, total = 13529
slot update_slots: id 0 | task 459 | created context checkpoint 6 of 8 (pos_min = 13464, pos_max = 13464, size = 75.376 MiB)
slot print_timing: id 0 | task 459 |
prompt eval time = 3199.59 ms / 202 tokens ( 15.84 ms per token, 63.13 tokens per second)
eval time = 3666.16 ms / 62 tokens ( 59.13 ms per token, 16.91 tokens per second)
total time = 6865.74 ms / 264 tokens
slot release: id 0 | task 459 | stop processing: n_tokens = 13590, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.994 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 523 | processing task, is_child = 0
slot update_slots: id 0 | task 523 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13668
slot update_slots: id 0 | task 523 | n_tokens = 13590, memory_seq_rm [13590, end)
slot update_slots: id 0 | task 523 | prompt processing progress, n_tokens = 13604, batch.n_tokens = 14, progress = 0.995318
slot update_slots: id 0 | task 523 | n_tokens = 13604, memory_seq_rm [13604, end)
slot update_slots: id 0 | task 523 | prompt processing progress, n_tokens = 13668, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 523 | prompt done, n_tokens = 13668, batch.n_tokens = 64
slot init_sampler: id 0 | task 523 | init sampler, took 1.31 ms, tokens: text = 13668, total = 13668
slot update_slots: id 0 | task 523 | created context checkpoint 7 of 8 (pos_min = 13603, pos_max = 13603, size = 75.376 MiB)
slot print_timing: id 0 | task 523 |
prompt eval time = 1302.44 ms / 78 tokens ( 16.70 ms per token, 59.89 tokens per second)
eval time = 3741.01 ms / 67 tokens ( 55.84 ms per token, 17.91 tokens per second)
total time = 5043.44 ms / 145 tokens
slot release: id 0 | task 523 | stop processing: n_tokens = 13734, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.994 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 592 | processing task, is_child = 0
slot update_slots: id 0 | task 592 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13823
slot update_slots: id 0 | task 592 | n_tokens = 13734, memory_seq_rm [13734, end)
slot update_slots: id 0 | task 592 | prompt processing progress, n_tokens = 13759, batch.n_tokens = 25, progress = 0.995370
slot update_slots: id 0 | task 592 | n_tokens = 13759, memory_seq_rm [13759, end)
slot update_slots: id 0 | task 592 | prompt processing progress, n_tokens = 13823, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 592 | prompt done, n_tokens = 13823, batch.n_tokens = 64
slot init_sampler: id 0 | task 592 | init sampler, took 1.36 ms, tokens: text = 13823, total = 13823
slot update_slots: id 0 | task 592 | created context checkpoint 8 of 8 (pos_min = 13758, pos_max = 13758, size = 75.376 MiB)
slot print_timing: id 0 | task 592 |
prompt eval time = 2094.38 ms / 89 tokens ( 23.53 ms per token, 42.49 tokens per second)
eval time = 4097.58 ms / 73 tokens ( 56.13 ms per token, 17.82 tokens per second)
total time = 6191.97 ms / 162 tokens
slot release: id 0 | task 592 | stop processing: n_tokens = 13895, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.994 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 667 | processing task, is_child = 0
slot update_slots: id 0 | task 667 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13976
slot update_slots: id 0 | task 667 | n_tokens = 13895, memory_seq_rm [13895, end)
slot update_slots: id 0 | task 667 | prompt processing progress, n_tokens = 13912, batch.n_tokens = 17, progress = 0.995421
slot update_slots: id 0 | task 667 | n_tokens = 13912, memory_seq_rm [13912, end)
slot update_slots: id 0 | task 667 | prompt processing progress, n_tokens = 13976, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 667 | prompt done, n_tokens = 13976, batch.n_tokens = 64
slot init_sampler: id 0 | task 667 | init sampler, took 1.12 ms, tokens: text = 13976, total = 13976
slot update_slots: id 0 | task 667 | erasing old context checkpoint (pos_min = 9714, pos_max = 9714, size = 75.376 MiB)
slot update_slots: id 0 | task 667 | created context checkpoint 8 of 8 (pos_min = 13911, pos_max = 13911, size = 75.376 MiB)
slot print_timing: id 0 | task 667 |
prompt eval time = 1541.54 ms / 81 tokens ( 19.03 ms per token, 52.54 tokens per second)
eval time = 2121.52 ms / 37 tokens ( 57.34 ms per token, 17.44 tokens per second)
total time = 3663.06 ms / 118 tokens
slot release: id 0 | task 667 | stop processing: n_tokens = 14012, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.991 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 706 | processing task, is_child = 0
slot update_slots: id 0 | task 706 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 14137
slot update_slots: id 0 | task 706 | n_tokens = 14012, memory_seq_rm [14012, end)
slot update_slots: id 0 | task 706 | prompt processing progress, n_tokens = 14073, batch.n_tokens = 61, progress = 0.995473
slot update_slots: id 0 | task 706 | n_tokens = 14073, memory_seq_rm [14073, end)
slot update_slots: id 0 | task 706 | prompt processing progress, n_tokens = 14137, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 706 | prompt done, n_tokens = 14137, batch.n_tokens = 64
slot init_sampler: id 0 | task 706 | init sampler, took 1.06 ms, tokens: text = 14137, total = 14137
slot update_slots: id 0 | task 706 | erasing old context checkpoint (pos_min = 10125, pos_max = 10125, size = 75.376 MiB)
slot update_slots: id 0 | task 706 | created context checkpoint 8 of 8 (pos_min = 14072, pos_max = 14072, size = 75.376 MiB)
slot print_timing: id 0 | task 706 |
prompt eval time = 2541.72 ms / 125 tokens ( 20.33 ms per token, 49.18 tokens per second)
eval time = 1790.02 ms / 31 tokens ( 57.74 ms per token, 17.32 tokens per second)
total time = 4331.74 ms / 156 tokens
slot release: id 0 | task 706 | stop processing: n_tokens = 14167, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.987 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 739 | processing task, is_child = 0
slot update_slots: id 0 | task 739 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 14349
slot update_slots: id 0 | task 739 | n_tokens = 14167, memory_seq_rm [14167, end)
slot update_slots: id 0 | task 739 | prompt processing progress, n_tokens = 14285, batch.n_tokens = 118, progress = 0.995540
slot update_slots: id 0 | task 739 | n_tokens = 14285, memory_seq_rm [14285, end)
slot update_slots: id 0 | task 739 | prompt processing progress, n_tokens = 14349, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 739 | prompt done, n_tokens = 14349, batch.n_tokens = 64
slot init_sampler: id 0 | task 739 | init sampler, took 1.27 ms, tokens: text = 14349, total = 14349
slot update_slots: id 0 | task 739 | erasing old context checkpoint (pos_min = 10473, pos_max = 10473, size = 75.376 MiB)
slot update_slots: id 0 | task 739 | created context checkpoint 8 of 8 (pos_min = 14284, pos_max = 14284, size = 75.376 MiB)
slot print_timing: id 0 | task 739 |
prompt eval time = 2861.86 ms / 182 tokens ( 15.72 ms per token, 63.59 tokens per second)
eval time = 1670.05 ms / 31 tokens ( 53.87 ms per token, 18.56 tokens per second)
total time = 4531.91 ms / 213 tokens
slot release: id 0 | task 739 | stop processing: n_tokens = 14379, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.987 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 772 | processing task, is_child = 0
slot update_slots: id 0 | task 772 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 14573
slot update_slots: id 0 | task 772 | n_tokens = 14379, memory_seq_rm [14379, end)
slot update_slots: id 0 | task 772 | prompt processing progress, n_tokens = 14509, batch.n_tokens = 130, progress = 0.995608
slot update_slots: id 0 | task 772 | n_tokens = 14509, memory_seq_rm [14509, end)
slot update_slots: id 0 | task 772 | prompt processing progress, n_tokens = 14573, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 772 | prompt done, n_tokens = 14573, batch.n_tokens = 64
slot init_sampler: id 0 | task 772 | init sampler, took 1.18 ms, tokens: text = 14573, total = 14573
slot update_slots: id 0 | task 772 | erasing old context checkpoint (pos_min = 10575, pos_max = 10575, size = 75.376 MiB)
slot update_slots: id 0 | task 772 | created context checkpoint 8 of 8 (pos_min = 14508, pos_max = 14508, size = 75.376 MiB)
slot print_timing: id 0 | task 772 |
prompt eval time = 2477.22 ms / 194 tokens ( 12.77 ms per token, 78.31 tokens per second)
eval time = 2595.48 ms / 48 tokens ( 54.07 ms per token, 18.49 tokens per second)
total time = 5072.70 ms / 242 tokens
slot release: id 0 | task 772 | stop processing: n_tokens = 14620, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10957331969
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 14620, total state size = 418.200 MiB
srv params_from_: Chat format: Qwen3 Coder
srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.061
srv load: - found better prompt with f_keep = 0.451, sim = 0.903
srv update: - cache state: 4 prompts, 1892.987 MiB (limits: 8192.000 MiB, 32768 tokens, 140693 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e42624d83a0: 17128 tokens, checkpoints: 1, 552.385 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e424eb36820: 14620 tokens, checkpoints: 8, 1021.206 MiB
srv get_availabl: prompt cache update took 7377.53 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 822 | processing task, is_child = 0
slot update_slots: id 0 | task 822 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 330
slot update_slots: id 0 | task 822 | n_past = 298, slot.prompt.tokens.size() = 661, seq_id = 0, pos_min = 660, n_swa = 1
slot update_slots: id 0 | task 822 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 822 | erased invalidated context checkpoint (pos_min = 574, pos_max = 574, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 822 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 822 | prompt processing progress, n_tokens = 266, batch.n_tokens = 266, progress = 0.806061
slot update_slots: id 0 | task 822 | n_tokens = 266, memory_seq_rm [266, end)
slot update_slots: id 0 | task 822 | prompt processing progress, n_tokens = 330, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 822 | prompt done, n_tokens = 330, batch.n_tokens = 64
slot init_sampler: id 0 | task 822 | init sampler, took 0.03 ms, tokens: text = 330, total = 330
slot update_slots: id 0 | task 822 | created context checkpoint 1 of 8 (pos_min = 265, pos_max = 265, size = 75.376 MiB)
slot print_timing: id 0 | task 822 |
prompt eval time = 4005.07 ms / 330 tokens ( 12.14 ms per token, 82.40 tokens per second)
eval time = 1264.52 ms / 23 tokens ( 54.98 ms per token, 18.19 tokens per second)
total time = 5269.59 ms / 353 tokens
slot release: id 0 | task 822 | stop processing: n_tokens = 352, truncated = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 10970068296
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 352, total state size = 83.630 MiB
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv load: - looking for better prompt, base f_keep = 0.057, sim = 0.001
srv load: - found better prompt with f_keep = 1.000, sim = 0.996
srv update: - cache state: 4 prompts, 1030.787 MiB (limits: 8192.000 MiB, 32768 tokens, 144983 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e42624d83a0: 17128 tokens, checkpoints: 1, 552.385 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv get_availabl: prompt cache update took 283.02 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 823 | processing task, is_child = 0
slot update_slots: id 0 | task 823 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 14674
slot update_slots: id 0 | task 823 | n_tokens = 14620, memory_seq_rm [14620, end)
slot update_slots: id 0 | task 823 | prompt processing progress, n_tokens = 14674, batch.n_tokens = 54, progress = 1.000000
slot update_slots: id 0 | task 823 | prompt done, n_tokens = 14674, batch.n_tokens = 54
slot init_sampler: id 0 | task 823 | init sampler, took 1.45 ms, tokens: text = 14674, total = 14674
slot update_slots: id 0 | task 823 | erasing old context checkpoint (pos_min = 13231, pos_max = 13231, size = 75.376 MiB)
slot update_slots: id 0 | task 823 | created context checkpoint 8 of 8 (pos_min = 14619, pos_max = 14619, size = 75.376 MiB)
slot print_timing: id 0 | task 823 |
prompt eval time = 1103.75 ms / 54 tokens ( 20.44 ms per token, 48.92 tokens per second)
eval time = 37767.49 ms / 662 tokens ( 57.05 ms per token, 17.53 tokens per second)
total time = 38871.24 ms / 716 tokens
slot release: id 0 | task 823 | stop processing: n_tokens = 15335, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 11009224197
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 15335, total state size = 434.966 MiB
srv load: - looking for better prompt, base f_keep = 0.003, sim = 0.002
srv load: - found better prompt with f_keep = 1.000, sim = 0.961
srv update: - cache state: 4 prompts, 1516.374 MiB (limits: 8192.000 MiB, 32768 tokens, 88868 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426a0950c0: 15335 tokens, checkpoints: 8, 1037.972 MiB
srv get_availabl: prompt cache update took 6211.05 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 1510 | processing task, is_child = 0
slot update_slots: id 0 | task 1510 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 17825
slot update_slots: id 0 | task 1510 | n_tokens = 17128, memory_seq_rm [17128, end)
slot update_slots: id 0 | task 1510 | prompt processing progress, n_tokens = 17761, batch.n_tokens = 633, progress = 0.996410
slot update_slots: id 0 | task 1510 | n_tokens = 17761, memory_seq_rm [17761, end)
slot update_slots: id 0 | task 1510 | prompt processing progress, n_tokens = 17825, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 1510 | prompt done, n_tokens = 17825, batch.n_tokens = 64
slot init_sampler: id 0 | task 1510 | init sampler, took 1.38 ms, tokens: text = 17825, total = 17825
slot update_slots: id 0 | task 1510 | created context checkpoint 2 of 8 (pos_min = 17760, pos_max = 17760, size = 75.376 MiB)
slot print_timing: id 0 | task 1510 |
prompt eval time = 8561.37 ms / 697 tokens ( 12.28 ms per token, 81.41 tokens per second)
eval time = 2397.15 ms / 44 tokens ( 54.48 ms per token, 18.36 tokens per second)
total time = 10958.52 ms / 741 tokens
slot release: id 0 | task 1510 | stop processing: n_tokens = 17868, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.873 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 1556 | processing task, is_child = 0
slot update_slots: id 0 | task 1556 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 20474
slot update_slots: id 0 | task 1556 | n_tokens = 17868, memory_seq_rm [17868, end)
slot update_slots: id 0 | task 1556 | prompt processing progress, n_tokens = 19916, batch.n_tokens = 2048, progress = 0.972746
slot update_slots: id 0 | task 1556 | n_tokens = 19916, memory_seq_rm [19916, end)
slot update_slots: id 0 | task 1556 | prompt processing progress, n_tokens = 20410, batch.n_tokens = 494, progress = 0.996874
slot update_slots: id 0 | task 1556 | n_tokens = 20410, memory_seq_rm [20410, end)
slot update_slots: id 0 | task 1556 | prompt processing progress, n_tokens = 20474, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 1556 | prompt done, n_tokens = 20474, batch.n_tokens = 64
slot init_sampler: id 0 | task 1556 | init sampler, took 1.64 ms, tokens: text = 20474, total = 20474
slot update_slots: id 0 | task 1556 | created context checkpoint 3 of 8 (pos_min = 20409, pos_max = 20409, size = 75.376 MiB)
slot print_timing: id 0 | task 1556 |
prompt eval time = 22336.34 ms / 2606 tokens ( 8.57 ms per token, 116.67 tokens per second)
eval time = 1963.70 ms / 34 tokens ( 57.76 ms per token, 17.31 tokens per second)
total time = 24300.04 ms / 2640 tokens
slot release: id 0 | task 1556 | stop processing: n_tokens = 20507, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.951 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 1593 | processing task, is_child = 0
slot update_slots: id 0 | task 1593 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21552
slot update_slots: id 0 | task 1593 | n_past = 20503, slot.prompt.tokens.size() = 20507, seq_id = 0, pos_min = 20506, n_swa = 1
slot update_slots: id 0 | task 1593 | restored context checkpoint (pos_min = 20409, pos_max = 20409, size = 75.376 MiB)
slot update_slots: id 0 | task 1593 | n_tokens = 20410, memory_seq_rm [20410, end)
slot update_slots: id 0 | task 1593 | prompt processing progress, n_tokens = 21488, batch.n_tokens = 1078, progress = 0.997030
slot update_slots: id 0 | task 1593 | n_tokens = 21488, memory_seq_rm [21488, end)
slot update_slots: id 0 | task 1593 | prompt processing progress, n_tokens = 21552, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 1593 | prompt done, n_tokens = 21552, batch.n_tokens = 64
slot init_sampler: id 0 | task 1593 | init sampler, took 1.66 ms, tokens: text = 21552, total = 21552
slot update_slots: id 0 | task 1593 | created context checkpoint 4 of 8 (pos_min = 21487, pos_max = 21487, size = 75.376 MiB)
slot print_timing: id 0 | task 1593 |
prompt eval time = 6580.21 ms / 1142 tokens ( 5.76 ms per token, 173.55 tokens per second)
eval time = 20194.20 ms / 380 tokens ( 53.14 ms per token, 18.82 tokens per second)
total time = 26774.40 ms / 1522 tokens
slot release: id 0 | task 1593 | stop processing: n_tokens = 21931, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 11078044457
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 21931, total state size = 589.635 MiB
srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.004
srv update: - cache state: 5 prompts, 2407.512 MiB (limits: 8192.000 MiB, 32768 tokens, 130598 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426a0950c0: 15335 tokens, checkpoints: 8, 1037.972 MiB
srv update: - prompt 0x5e426bba13c0: 21931 tokens, checkpoints: 4, 891.138 MiB
srv get_availabl: prompt cache update took 3427.35 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 1975 | processing task, is_child = 0
slot update_slots: id 0 | task 1975 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 10107
slot update_slots: id 0 | task 1975 | n_past = 40, slot.prompt.tokens.size() = 21931, seq_id = 0, pos_min = 21930, n_swa = 1
slot update_slots: id 0 | task 1975 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 1975 | erased invalidated context checkpoint (pos_min = 16897, pos_max = 16897, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 1975 | erased invalidated context checkpoint (pos_min = 17760, pos_max = 17760, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 1975 | erased invalidated context checkpoint (pos_min = 20409, pos_max = 20409, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 1975 | erased invalidated context checkpoint (pos_min = 21487, pos_max = 21487, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 1975 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 1975 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.202632
slot update_slots: id 0 | task 1975 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 0 | task 1975 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.405264
slot update_slots: id 0 | task 1975 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 0 | task 1975 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.607895
slot update_slots: id 0 | task 1975 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 0 | task 1975 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.810527
slot update_slots: id 0 | task 1975 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 0 | task 1975 | prompt processing progress, n_tokens = 10043, batch.n_tokens = 1851, progress = 0.993668
slot update_slots: id 0 | task 1975 | n_tokens = 10043, memory_seq_rm [10043, end)
slot update_slots: id 0 | task 1975 | prompt processing progress, n_tokens = 10107, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 1975 | prompt done, n_tokens = 10107, batch.n_tokens = 64
slot init_sampler: id 0 | task 1975 | init sampler, took 0.81 ms, tokens: text = 10107, total = 10107
slot update_slots: id 0 | task 1975 | created context checkpoint 1 of 8 (pos_min = 10042, pos_max = 10042, size = 75.376 MiB)
slot print_timing: id 0 | task 1975 |
prompt eval time = 48887.82 ms / 10107 tokens ( 4.84 ms per token, 206.74 tokens per second)
eval time = 2005.96 ms / 37 tokens ( 54.22 ms per token, 18.45 tokens per second)
total time = 50893.78 ms / 10144 tokens
slot release: id 0 | task 1975 | stop processing: n_tokens = 10143, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.967 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2018 | processing task, is_child = 0
slot update_slots: id 0 | task 2018 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 10493
slot update_slots: id 0 | task 2018 | n_tokens = 10143, memory_seq_rm [10143, end)
slot update_slots: id 0 | task 2018 | prompt processing progress, n_tokens = 10429, batch.n_tokens = 286, progress = 0.993901
slot update_slots: id 0 | task 2018 | n_tokens = 10429, memory_seq_rm [10429, end)
slot update_slots: id 0 | task 2018 | prompt processing progress, n_tokens = 10493, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2018 | prompt done, n_tokens = 10493, batch.n_tokens = 64
slot init_sampler: id 0 | task 2018 | init sampler, took 0.82 ms, tokens: text = 10493, total = 10493
slot update_slots: id 0 | task 2018 | created context checkpoint 2 of 8 (pos_min = 10428, pos_max = 10428, size = 75.376 MiB)
slot print_timing: id 0 | task 2018 |
prompt eval time = 4169.57 ms / 350 tokens ( 11.91 ms per token, 83.94 tokens per second)
eval time = 2379.56 ms / 43 tokens ( 55.34 ms per token, 18.07 tokens per second)
total time = 6549.12 ms / 393 tokens
slot release: id 0 | task 2018 | stop processing: n_tokens = 10535, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.800 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2063 | processing task, is_child = 0
slot update_slots: id 0 | task 2063 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13171
slot update_slots: id 0 | task 2063 | n_tokens = 10535, memory_seq_rm [10535, end)
slot update_slots: id 0 | task 2063 | prompt processing progress, n_tokens = 12583, batch.n_tokens = 2048, progress = 0.955356
slot update_slots: id 0 | task 2063 | n_tokens = 12583, memory_seq_rm [12583, end)
slot update_slots: id 0 | task 2063 | prompt processing progress, n_tokens = 13107, batch.n_tokens = 524, progress = 0.995141
slot update_slots: id 0 | task 2063 | n_tokens = 13107, memory_seq_rm [13107, end)
slot update_slots: id 0 | task 2063 | prompt processing progress, n_tokens = 13171, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2063 | prompt done, n_tokens = 13171, batch.n_tokens = 64
slot init_sampler: id 0 | task 2063 | init sampler, took 1.07 ms, tokens: text = 13171, total = 13171
slot update_slots: id 0 | task 2063 | created context checkpoint 3 of 8 (pos_min = 13106, pos_max = 13106, size = 75.376 MiB)
slot print_timing: id 0 | task 2063 |
prompt eval time = 15695.50 ms / 2636 tokens ( 5.95 ms per token, 167.95 tokens per second)
eval time = 1647.42 ms / 32 tokens ( 51.48 ms per token, 19.42 tokens per second)
total time = 17342.92 ms / 2668 tokens
slot release: id 0 | task 2063 | stop processing: n_tokens = 13202, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.985 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2098 | processing task, is_child = 0
slot update_slots: id 0 | task 2098 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13404
slot update_slots: id 0 | task 2098 | n_tokens = 13202, memory_seq_rm [13202, end)
slot update_slots: id 0 | task 2098 | prompt processing progress, n_tokens = 13340, batch.n_tokens = 138, progress = 0.995225
slot update_slots: id 0 | task 2098 | n_tokens = 13340, memory_seq_rm [13340, end)
slot update_slots: id 0 | task 2098 | prompt processing progress, n_tokens = 13404, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2098 | prompt done, n_tokens = 13404, batch.n_tokens = 64
slot init_sampler: id 0 | task 2098 | init sampler, took 1.07 ms, tokens: text = 13404, total = 13404
slot update_slots: id 0 | task 2098 | created context checkpoint 4 of 8 (pos_min = 13339, pos_max = 13339, size = 75.376 MiB)
slot print_timing: id 0 | task 2098 |
prompt eval time = 2317.17 ms / 202 tokens ( 11.47 ms per token, 87.18 tokens per second)
eval time = 2776.01 ms / 49 tokens ( 56.65 ms per token, 17.65 tokens per second)
total time = 5093.19 ms / 251 tokens
slot release: id 0 | task 2098 | stop processing: n_tokens = 13452, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.978 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2149 | processing task, is_child = 0
slot update_slots: id 0 | task 2149 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13753
slot update_slots: id 0 | task 2149 | n_tokens = 13452, memory_seq_rm [13452, end)
slot update_slots: id 0 | task 2149 | prompt processing progress, n_tokens = 13689, batch.n_tokens = 237, progress = 0.995346
^Oslot update_slots: id 0 | task 2149 | n_tokens = 13689, memory_seq_rm [13689, end)
slot update_slots: id 0 | task 2149 | prompt processing progress, n_tokens = 13753, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2149 | prompt done, n_tokens = 13753, batch.n_tokens = 64
slot init_sampler: id 0 | task 2149 | init sampler, took 1.14 ms, tokens: text = 13753, total = 13753
slot update_slots: id 0 | task 2149 | created context checkpoint 5 of 8 (pos_min = 13688, pos_max = 13688, size = 75.376 MiB)
slot print_timing: id 0 | task 2149 |
prompt eval time = 3427.18 ms / 301 tokens ( 11.39 ms per token, 87.83 tokens per second)
eval time = 2794.28 ms / 46 tokens ( 60.75 ms per token, 16.46 tokens per second)
total time = 6221.46 ms / 347 tokens
slot release: id 0 | task 2149 | stop processing: n_tokens = 13798, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 11171476651
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 13798, total state size = 398.925 MiB
srv params_from_: Chat format: Qwen3 Coder
srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.063
srv update: - cache state: 6 prompts, 3183.316 MiB (limits: 8192.000 MiB, 32768 tokens, 134278 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426a0950c0: 15335 tokens, checkpoints: 8, 1037.972 MiB
srv update: - prompt 0x5e426bba13c0: 21931 tokens, checkpoints: 4, 891.138 MiB
srv update: - prompt 0x5e42692b4330: 13798 tokens, checkpoints: 5, 775.804 MiB
srv get_availabl: prompt cache update took 2569.31 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2197 | processing task, is_child = 0
slot update_slots: id 0 | task 2197 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 319
slot update_slots: id 0 | task 2197 | n_past = 20, slot.prompt.tokens.size() = 13798, seq_id = 0, pos_min = 13797, n_swa = 1
slot update_slots: id 0 | task 2197 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 2197 | erased invalidated context checkpoint (pos_min = 10042, pos_max = 10042, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 2197 | erased invalidated context checkpoint (pos_min = 10428, pos_max = 10428, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 2197 | erased invalidated context checkpoint (pos_min = 13106, pos_max = 13106, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 2197 | erased invalidated context checkpoint (pos_min = 13339, pos_max = 13339, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 2197 | erased invalidated context checkpoint (pos_min = 13688, pos_max = 13688, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 2197 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 2197 | prompt processing progress, n_tokens = 255, batch.n_tokens = 255, progress = 0.799373
slot update_slots: id 0 | task 2197 | n_tokens = 255, memory_seq_rm [255, end)
slot update_slots: id 0 | task 2197 | prompt processing progress, n_tokens = 319, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2197 | prompt done, n_tokens = 319, batch.n_tokens = 64
slot init_sampler: id 0 | task 2197 | init sampler, took 0.03 ms, tokens: text = 319, total = 319
slot update_slots: id 0 | task 2197 | created context checkpoint 1 of 8 (pos_min = 254, pos_max = 254, size = 75.376 MiB)
slot print_timing: id 0 | task 2197 |
prompt eval time = 6916.62 ms / 319 tokens ( 21.68 ms per token, 46.12 tokens per second)
eval time = 2074.63 ms / 34 tokens ( 61.02 ms per token, 16.39 tokens per second)
total time = 8991.25 ms / 353 tokens
slot release: id 0 | task 2197 | stop processing: n_tokens = 352, truncated = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 11183297025
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 352, total state size = 83.630 MiB
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv load: - looking for better prompt, base f_keep = 0.057, sim = 0.001
srv load: - found better prompt with f_keep = 1.000, sim = 0.997
srv update: - cache state: 6 prompts, 2566.518 MiB (limits: 8192.000 MiB, 32768 tokens, 123630 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426a0950c0: 15335 tokens, checkpoints: 8, 1037.972 MiB
srv update: - prompt 0x5e426bba13c0: 21931 tokens, checkpoints: 4, 891.138 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv get_availabl: prompt cache update took 589.95 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2198 | processing task, is_child = 0
slot update_slots: id 0 | task 2198 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13843
slot update_slots: id 0 | task 2198 | n_tokens = 13798, memory_seq_rm [13798, end)
slot update_slots: id 0 | task 2198 | prompt processing progress, n_tokens = 13843, batch.n_tokens = 45, progress = 1.000000
slot update_slots: id 0 | task 2198 | prompt done, n_tokens = 13843, batch.n_tokens = 45
slot init_sampler: id 0 | task 2198 | init sampler, took 1.12 ms, tokens: text = 13843, total = 13843
slot update_slots: id 0 | task 2198 | created context checkpoint 6 of 8 (pos_min = 13797, pos_max = 13797, size = 75.376 MiB)
slot print_timing: id 0 | task 2198 |
prompt eval time = 1294.61 ms / 45 tokens ( 28.77 ms per token, 34.76 tokens per second)
eval time = 1716.28 ms / 31 tokens ( 55.36 ms per token, 18.06 tokens per second)
total time = 3010.89 ms / 76 tokens
slot release: id 0 | task 2198 | stop processing: n_tokens = 13873, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.986 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 2265 | processing task, is_child = 0
slot update_slots: id 0 | task 2265 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 14067
slot update_slots: id 0 | task 2265 | n_tokens = 13873, memory_seq_rm [13873, end)
slot update_slots: id 0 | task 2265 | prompt processing progress, n_tokens = 14003, batch.n_tokens = 130, progress = 0.995450
slot update_slots: id 0 | task 2265 | n_tokens = 14003, memory_seq_rm [14003, end)
slot update_slots: id 0 | task 2265 | prompt processing progress, n_tokens = 14067, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2265 | prompt done, n_tokens = 14067, batch.n_tokens = 64
slot init_sampler: id 0 | task 2265 | init sampler, took 1.34 ms, tokens: text = 14067, total = 14067
slot update_slots: id 0 | task 2265 | created context checkpoint 7 of 8 (pos_min = 14002, pos_max = 14002, size = 75.376 MiB)
slot print_timing: id 0 | task 2265 |
prompt eval time = 3538.61 ms / 194 tokens ( 18.24 ms per token, 54.82 tokens per second)
eval time = 98183.48 ms / 1680 tokens ( 58.44 ms per token, 17.11 tokens per second)
total time = 101722.09 ms / 1874 tokens
slot release: id 0 | task 2265 | stop processing: n_tokens = 15746, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 11288798023
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 15746, total state size = 444.603 MiB
srv load: - looking for better prompt, base f_keep = 0.003, sim = 0.002
srv load: - found better prompt with f_keep = 1.000, sim = 0.928
srv update: - cache state: 6 prompts, 2647.614 MiB (limits: 8192.000 MiB, 32768 tokens, 100706 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426a0950c0: 15335 tokens, checkpoints: 8, 1037.972 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e424eb6dbb0: 15746 tokens, checkpoints: 7, 972.234 MiB
srv get_availabl: prompt cache update took 11613.77 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 3947 | processing task, is_child = 0
slot update_slots: id 0 | task 3947 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 23642
slot update_slots: id 0 | task 3947 | n_tokens = 21931, memory_seq_rm [21931, end)
slot update_slots: id 0 | task 3947 | prompt processing progress, n_tokens = 23578, batch.n_tokens = 1647, progress = 0.997293
slot update_slots: id 0 | task 3947 | n_tokens = 23578, memory_seq_rm [23578, end)
slot update_slots: id 0 | task 3947 | prompt processing progress, n_tokens = 23642, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 3947 | prompt done, n_tokens = 23642, batch.n_tokens = 64
slot init_sampler: id 0 | task 3947 | init sampler, took 2.13 ms, tokens: text = 23642, total = 23642
slot update_slots: id 0 | task 3947 | created context checkpoint 5 of 8 (pos_min = 23577, pos_max = 23577, size = 75.376 MiB)
slot print_timing: id 0 | task 3947 |
prompt eval time = 15749.75 ms / 1711 tokens ( 9.21 ms per token, 108.64 tokens per second)
eval time = 11339.72 ms / 196 tokens ( 57.86 ms per token, 17.28 tokens per second)
total time = 27089.48 ms / 1907 tokens
slot release: id 0 | task 3947 | stop processing: n_tokens = 23837, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
^[[B^[[Asrv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.998 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 4145 | processing task, is_child = 0
slot update_slots: id 0 | task 4145 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 23894
slot update_slots: id 0 | task 4145 | n_tokens = 23837, memory_seq_rm [23837, end)
slot update_slots: id 0 | task 4145 | prompt processing progress, n_tokens = 23894, batch.n_tokens = 57, progress = 1.000000
slot update_slots: id 0 | task 4145 | prompt done, n_tokens = 23894, batch.n_tokens = 57
slot init_sampler: id 0 | task 4145 | init sampler, took 1.93 ms, tokens: text = 23894, total = 23894
slot update_slots: id 0 | task 4145 | created context checkpoint 6 of 8 (pos_min = 23836, pos_max = 23836, size = 75.376 MiB)
slot print_timing: id 0 | task 4145 |
prompt eval time = 8031.84 ms / 57 tokens ( 140.91 ms per token, 7.10 tokens per second)
eval time = 53588.53 ms / 802 tokens ( 66.82 ms per token, 14.97 tokens per second)
total time = 61620.37 ms / 859 tokens
slot release: id 0 | task 4145 | stop processing: n_tokens = 24695, truncated = 0
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.994 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 4948 | processing task, is_child = 0
slot update_slots: id 0 | task 4948 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 24838
slot update_slots: id 0 | task 4948 | n_tokens = 24695, memory_seq_rm [24695, end)
slot update_slots: id 0 | task 4948 | prompt processing progress, n_tokens = 24774, batch.n_tokens = 79, progress = 0.997423
slot update_slots: id 0 | task 4948 | n_tokens = 24774, memory_seq_rm [24774, end)
slot update_slots: id 0 | task 4948 | prompt processing progress, n_tokens = 24838, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 4948 | prompt done, n_tokens = 24838, batch.n_tokens = 64
slot init_sampler: id 0 | task 4948 | init sampler, took 2.04 ms, tokens: text = 24838, total = 24838
slot update_slots: id 0 | task 4948 | created context checkpoint 7 of 8 (pos_min = 24773, pos_max = 24773, size = 75.376 MiB)
slot print_timing: id 0 | task 4948 |
prompt eval time = 5046.80 ms / 143 tokens ( 35.29 ms per token, 28.33 tokens per second)
eval time = 841.27 ms / 14 tokens ( 60.09 ms per token, 16.64 tokens per second)
total time = 5888.07 ms / 157 tokens
slot release: id 0 | task 4948 | stop processing: n_tokens = 24851, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.952 (> 0.100 thold), f_keep = 0.681
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 4964 | processing task, is_child = 0
slot update_slots: id 0 | task 4964 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 17765
slot update_slots: id 0 | task 4964 | n_past = 16917, slot.prompt.tokens.size() = 24851, seq_id = 0, pos_min = 24850, n_swa = 1
slot update_slots: id 0 | task 4964 | restored context checkpoint (pos_min = 16897, pos_max = 16897, size = 75.376 MiB)
slot update_slots: id 0 | task 4964 | erased invalidated context checkpoint (pos_min = 17760, pos_max = 17760, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 4964 | erased invalidated context checkpoint (pos_min = 20409, pos_max = 20409, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 4964 | erased invalidated context checkpoint (pos_min = 21487, pos_max = 21487, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 4964 | erased invalidated context checkpoint (pos_min = 23577, pos_max = 23577, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 4964 | erased invalidated context checkpoint (pos_min = 23836, pos_max = 23836, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 4964 | erased invalidated context checkpoint (pos_min = 24773, pos_max = 24773, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 4964 | n_tokens = 16898, memory_seq_rm [16898, end)
slot update_slots: id 0 | task 4964 | prompt processing progress, n_tokens = 17701, batch.n_tokens = 803, progress = 0.996397
slot update_slots: id 0 | task 4964 | n_tokens = 17701, memory_seq_rm [17701, end)
slot update_slots: id 0 | task 4964 | prompt processing progress, n_tokens = 17765, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 4964 | prompt done, n_tokens = 17765, batch.n_tokens = 64
slot init_sampler: id 0 | task 4964 | init sampler, took 1.44 ms, tokens: text = 17765, total = 17765
slot update_slots: id 0 | task 4964 | created context checkpoint 2 of 8 (pos_min = 17700, pos_max = 17700, size = 75.376 MiB)
slot print_timing: id 0 | task 4964 |
prompt eval time = 14860.42 ms / 867 tokens ( 17.14 ms per token, 58.34 tokens per second)
eval time = 8429.19 ms / 144 tokens ( 58.54 ms per token, 17.08 tokens per second)
total time = 23289.60 ms / 1011 tokens
slot release: id 0 | task 4964 | stop processing: n_tokens = 17908, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 12109752620
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 17908, total state size = 495.300 MiB
srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.004
srv load: - found better prompt with f_keep = 0.631, sim = 0.992
srv update: - cache state: 6 prompts, 2255.693 MiB (limits: 8192.000 MiB, 32768 tokens, 127548 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e424eb6dbb0: 15746 tokens, checkpoints: 7, 972.234 MiB
srv update: - prompt 0x5e426baf03e0: 17908 tokens, checkpoints: 2, 646.052 MiB
srv get_availabl: prompt cache update took 5872.87 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5110 | processing task, is_child = 0
slot update_slots: id 0 | task 5110 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 9752
slot update_slots: id 0 | task 5110 | n_past = 9678, slot.prompt.tokens.size() = 15335, seq_id = 0, pos_min = 15334, n_swa = 1
slot update_slots: id 0 | task 5110 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 13464, pos_max = 13464, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 13603, pos_max = 13603, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 13758, pos_max = 13758, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 13911, pos_max = 13911, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 14072, pos_max = 14072, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 14284, pos_max = 14284, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 14508, pos_max = 14508, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | erased invalidated context checkpoint (pos_min = 14619, pos_max = 14619, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 5110 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 5110 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.210008
slot update_slots: id 0 | task 5110 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 0 | task 5110 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.420016
slot update_slots: id 0 | task 5110 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 0 | task 5110 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.630025
slot update_slots: id 0 | task 5110 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 0 | task 5110 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.840033
slot update_slots: id 0 | task 5110 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 0 | task 5110 | prompt processing progress, n_tokens = 9688, batch.n_tokens = 1496, progress = 0.993437
slot update_slots: id 0 | task 5110 | n_tokens = 9688, memory_seq_rm [9688, end)
slot update_slots: id 0 | task 5110 | prompt processing progress, n_tokens = 9752, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5110 | prompt done, n_tokens = 9752, batch.n_tokens = 64
slot init_sampler: id 0 | task 5110 | init sampler, took 0.78 ms, tokens: text = 9752, total = 9752
slot update_slots: id 0 | task 5110 | created context checkpoint 1 of 8 (pos_min = 9687, pos_max = 9687, size = 75.376 MiB)
slot print_timing: id 0 | task 5110 |
prompt eval time = 73465.76 ms / 9752 tokens ( 7.53 ms per token, 132.74 tokens per second)
eval time = 2107.18 ms / 37 tokens ( 56.95 ms per token, 17.56 tokens per second)
total time = 75572.94 ms / 9789 tokens
slot release: id 0 | task 5110 | stop processing: n_tokens = 9788, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.987 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5153 | processing task, is_child = 0
slot update_slots: id 0 | task 5153 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 9913
slot update_slots: id 0 | task 5153 | n_tokens = 9788, memory_seq_rm [9788, end)
slot update_slots: id 0 | task 5153 | prompt processing progress, n_tokens = 9849, batch.n_tokens = 61, progress = 0.993544
slot update_slots: id 0 | task 5153 | n_tokens = 9849, memory_seq_rm [9849, end)
slot update_slots: id 0 | task 5153 | prompt processing progress, n_tokens = 9913, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5153 | prompt done, n_tokens = 9913, batch.n_tokens = 64
slot init_sampler: id 0 | task 5153 | init sampler, took 0.75 ms, tokens: text = 9913, total = 9913
slot update_slots: id 0 | task 5153 | created context checkpoint 2 of 8 (pos_min = 9848, pos_max = 9848, size = 75.376 MiB)
slot print_timing: id 0 | task 5153 |
prompt eval time = 2714.39 ms / 125 tokens ( 21.72 ms per token, 46.05 tokens per second)
eval time = 1830.79 ms / 30 tokens ( 61.03 ms per token, 16.39 tokens per second)
total time = 4545.19 ms / 155 tokens
slot release: id 0 | task 5153 | stop processing: n_tokens = 9942, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.790 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5185 | processing task, is_child = 0
slot update_slots: id 0 | task 5185 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 12578
slot update_slots: id 0 | task 5185 | n_tokens = 9942, memory_seq_rm [9942, end)
slot update_slots: id 0 | task 5185 | prompt processing progress, n_tokens = 11990, batch.n_tokens = 2048, progress = 0.953252
slot update_slots: id 0 | task 5185 | n_tokens = 11990, memory_seq_rm [11990, end)
slot update_slots: id 0 | task 5185 | prompt processing progress, n_tokens = 12514, batch.n_tokens = 524, progress = 0.994912
slot update_slots: id 0 | task 5185 | n_tokens = 12514, memory_seq_rm [12514, end)
slot update_slots: id 0 | task 5185 | prompt processing progress, n_tokens = 12578, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5185 | prompt done, n_tokens = 12578, batch.n_tokens = 64
slot init_sampler: id 0 | task 5185 | init sampler, took 1.07 ms, tokens: text = 12578, total = 12578
slot update_slots: id 0 | task 5185 | created context checkpoint 3 of 8 (pos_min = 12513, pos_max = 12513, size = 75.376 MiB)
slot print_timing: id 0 | task 5185 |
prompt eval time = 22554.48 ms / 2636 tokens ( 8.56 ms per token, 116.87 tokens per second)
eval time = 1803.50 ms / 31 tokens ( 58.18 ms per token, 17.19 tokens per second)
total time = 24357.98 ms / 2667 tokens
slot release: id 0 | task 5185 | stop processing: n_tokens = 12608, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.994 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5219 | processing task, is_child = 0
slot update_slots: id 0 | task 5219 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 12684
slot update_slots: id 0 | task 5219 | n_tokens = 12608, memory_seq_rm [12608, end)
slot update_slots: id 0 | task 5219 | prompt processing progress, n_tokens = 12620, batch.n_tokens = 12, progress = 0.994954
slot update_slots: id 0 | task 5219 | n_tokens = 12620, memory_seq_rm [12620, end)
slot update_slots: id 0 | task 5219 | prompt processing progress, n_tokens = 12684, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5219 | prompt done, n_tokens = 12684, batch.n_tokens = 64
slot init_sampler: id 0 | task 5219 | init sampler, took 1.01 ms, tokens: text = 12684, total = 12684
slot update_slots: id 0 | task 5219 | created context checkpoint 4 of 8 (pos_min = 12619, pos_max = 12619, size = 75.376 MiB)
slot print_timing: id 0 | task 5219 |
prompt eval time = 1333.55 ms / 76 tokens ( 17.55 ms per token, 56.99 tokens per second)
eval time = 1785.36 ms / 32 tokens ( 55.79 ms per token, 17.92 tokens per second)
total time = 3118.91 ms / 108 tokens
slot release: id 0 | task 5219 | stop processing: n_tokens = 12715, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.984 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5253 | processing task, is_child = 0
slot update_slots: id 0 | task 5253 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 12917
slot update_slots: id 0 | task 5253 | n_tokens = 12715, memory_seq_rm [12715, end)
slot update_slots: id 0 | task 5253 | prompt processing progress, n_tokens = 12853, batch.n_tokens = 138, progress = 0.995045
slot update_slots: id 0 | task 5253 | n_tokens = 12853, memory_seq_rm [12853, end)
slot update_slots: id 0 | task 5253 | prompt processing progress, n_tokens = 12917, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5253 | prompt done, n_tokens = 12917, batch.n_tokens = 64
slot init_sampler: id 0 | task 5253 | init sampler, took 1.08 ms, tokens: text = 12917, total = 12917
slot update_slots: id 0 | task 5253 | created context checkpoint 5 of 8 (pos_min = 12852, pos_max = 12852, size = 75.376 MiB)
slot print_timing: id 0 | task 5253 |
prompt eval time = 2505.56 ms / 202 tokens ( 12.40 ms per token, 80.62 tokens per second)
eval time = 2117.64 ms / 37 tokens ( 57.23 ms per token, 17.47 tokens per second)
total time = 4623.20 ms / 239 tokens
slot release: id 0 | task 5253 | stop processing: n_tokens = 12953, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.990 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5292 | processing task, is_child = 0
slot update_slots: id 0 | task 5292 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13078
slot update_slots: id 0 | task 5292 | n_tokens = 12953, memory_seq_rm [12953, end)
slot update_slots: id 0 | task 5292 | prompt processing progress, n_tokens = 13014, batch.n_tokens = 61, progress = 0.995106
slot update_slots: id 0 | task 5292 | n_tokens = 13014, memory_seq_rm [13014, end)
slot update_slots: id 0 | task 5292 | prompt processing progress, n_tokens = 13078, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5292 | prompt done, n_tokens = 13078, batch.n_tokens = 64
slot init_sampler: id 0 | task 5292 | init sampler, took 1.23 ms, tokens: text = 13078, total = 13078
slot update_slots: id 0 | task 5292 | created context checkpoint 6 of 8 (pos_min = 13013, pos_max = 13013, size = 75.376 MiB)
slot print_timing: id 0 | task 5292 |
prompt eval time = 1997.29 ms / 125 tokens ( 15.98 ms per token, 62.58 tokens per second)
eval time = 31247.08 ms / 538 tokens ( 58.08 ms per token, 17.22 tokens per second)
total time = 33244.36 ms / 663 tokens
slot release: id 0 | task 5292 | stop processing: n_tokens = 13615, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 12266760647
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 13615, total state size = 394.634 MiB
srv load: - looking for better prompt, base f_keep = 0.003, sim = 0.002
srv load: - found better prompt with f_keep = 1.000, sim = 0.969
srv update: - cache state: 6 prompts, 2456.530 MiB (limits: 8192.000 MiB, 32768 tokens, 102804 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e424eb6dbb0: 15746 tokens, checkpoints: 7, 972.234 MiB
srv update: - prompt 0x5e426816e6d0: 13615 tokens, checkpoints: 6, 846.889 MiB
srv get_availabl: prompt cache update took 7365.40 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5832 | processing task, is_child = 0
slot update_slots: id 0 | task 5832 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 18481
slot update_slots: id 0 | task 5832 | n_tokens = 17908, memory_seq_rm [17908, end)
slot update_slots: id 0 | task 5832 | prompt processing progress, n_tokens = 18417, batch.n_tokens = 509, progress = 0.996537
slot update_slots: id 0 | task 5832 | n_tokens = 18417, memory_seq_rm [18417, end)
slot update_slots: id 0 | task 5832 | prompt processing progress, n_tokens = 18481, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5832 | prompt done, n_tokens = 18481, batch.n_tokens = 64
slot init_sampler: id 0 | task 5832 | init sampler, took 1.45 ms, tokens: text = 18481, total = 18481
slot update_slots: id 0 | task 5832 | created context checkpoint 3 of 8 (pos_min = 18416, pos_max = 18416, size = 75.376 MiB)
slot print_timing: id 0 | task 5832 |
prompt eval time = 7734.73 ms / 573 tokens ( 13.50 ms per token, 74.08 tokens per second)
eval time = 2791.53 ms / 44 tokens ( 63.44 ms per token, 15.76 tokens per second)
total time = 10526.26 ms / 617 tokens
slot release: id 0 | task 5832 | stop processing: n_tokens = 18524, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.877 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5878 | processing task, is_child = 0
slot update_slots: id 0 | task 5878 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21130
slot update_slots: id 0 | task 5878 | n_tokens = 18524, memory_seq_rm [18524, end)
slot update_slots: id 0 | task 5878 | prompt processing progress, n_tokens = 20572, batch.n_tokens = 2048, progress = 0.973592
slot update_slots: id 0 | task 5878 | n_tokens = 20572, memory_seq_rm [20572, end)
slot update_slots: id 0 | task 5878 | prompt processing progress, n_tokens = 21066, batch.n_tokens = 494, progress = 0.996971
slot update_slots: id 0 | task 5878 | n_tokens = 21066, memory_seq_rm [21066, end)
slot update_slots: id 0 | task 5878 | prompt processing progress, n_tokens = 21130, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5878 | prompt done, n_tokens = 21130, batch.n_tokens = 64
slot init_sampler: id 0 | task 5878 | init sampler, took 1.72 ms, tokens: text = 21130, total = 21130
slot update_slots: id 0 | task 5878 | created context checkpoint 4 of 8 (pos_min = 21065, pos_max = 21065, size = 75.376 MiB)
slot print_timing: id 0 | task 5878 |
prompt eval time = 21446.58 ms / 2606 tokens ( 8.23 ms per token, 121.51 tokens per second)
eval time = 1975.53 ms / 33 tokens ( 59.86 ms per token, 16.70 tokens per second)
total time = 23422.12 ms / 2639 tokens
slot release: id 0 | task 5878 | stop processing: n_tokens = 21162, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5914 | processing task, is_child = 0
slot update_slots: id 0 | task 5914 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21178
slot update_slots: id 0 | task 5914 | n_past = 21157, slot.prompt.tokens.size() = 21162, seq_id = 0, pos_min = 21161, n_swa = 1
slot update_slots: id 0 | task 5914 | restored context checkpoint (pos_min = 21065, pos_max = 21065, size = 75.376 MiB)
slot update_slots: id 0 | task 5914 | n_tokens = 21066, memory_seq_rm [21066, end)
slot update_slots: id 0 | task 5914 | prompt processing progress, n_tokens = 21114, batch.n_tokens = 48, progress = 0.996978
slot update_slots: id 0 | task 5914 | n_tokens = 21114, memory_seq_rm [21114, end)
slot update_slots: id 0 | task 5914 | prompt processing progress, n_tokens = 21178, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 5914 | prompt done, n_tokens = 21178, batch.n_tokens = 64
slot init_sampler: id 0 | task 5914 | init sampler, took 1.64 ms, tokens: text = 21178, total = 21178
slot print_timing: id 0 | task 5914 |
prompt eval time = 1830.46 ms / 112 tokens ( 16.34 ms per token, 61.19 tokens per second)
eval time = 3386.87 ms / 58 tokens ( 58.39 ms per token, 17.12 tokens per second)
total time = 5217.32 ms / 170 tokens
slot release: id 0 | task 5914 | stop processing: n_tokens = 21235, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 5974 | processing task, is_child = 0
slot update_slots: id 0 | task 5974 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21259
slot update_slots: id 0 | task 5974 | n_tokens = 21235, memory_seq_rm [21235, end)
slot update_slots: id 0 | task 5974 | prompt processing progress, n_tokens = 21259, batch.n_tokens = 24, progress = 1.000000
slot update_slots: id 0 | task 5974 | prompt done, n_tokens = 21259, batch.n_tokens = 24
slot init_sampler: id 0 | task 5974 | init sampler, took 1.72 ms, tokens: text = 21259, total = 21259
slot update_slots: id 0 | task 5974 | created context checkpoint 5 of 8 (pos_min = 21234, pos_max = 21234, size = 75.376 MiB)
slot print_timing: id 0 | task 5974 |
prompt eval time = 375.50 ms / 24 tokens ( 15.65 ms per token, 63.91 tokens per second)
eval time = 4046.66 ms / 67 tokens ( 60.40 ms per token, 16.56 tokens per second)
total time = 4422.16 ms / 91 tokens
slot release: id 0 | task 5974 | stop processing: n_tokens = 21325, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6042 | processing task, is_child = 0
slot update_slots: id 0 | task 6042 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21353
slot update_slots: id 0 | task 6042 | n_tokens = 21325, memory_seq_rm [21325, end)
slot update_slots: id 0 | task 6042 | prompt processing progress, n_tokens = 21353, batch.n_tokens = 28, progress = 1.000000
slot update_slots: id 0 | task 6042 | prompt done, n_tokens = 21353, batch.n_tokens = 28
slot init_sampler: id 0 | task 6042 | init sampler, took 1.75 ms, tokens: text = 21353, total = 21353
slot update_slots: id 0 | task 6042 | created context checkpoint 6 of 8 (pos_min = 21324, pos_max = 21324, size = 75.376 MiB)
slot print_timing: id 0 | task 6042 |
prompt eval time = 435.48 ms / 28 tokens ( 15.55 ms per token, 64.30 tokens per second)
eval time = 3561.99 ms / 62 tokens ( 57.45 ms per token, 17.41 tokens per second)
total time = 3997.47 ms / 90 tokens
slot release: id 0 | task 6042 | stop processing: n_tokens = 21414, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6105 | processing task, is_child = 0
slot update_slots: id 0 | task 6105 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21441
slot update_slots: id 0 | task 6105 | n_tokens = 21414, memory_seq_rm [21414, end)
slot update_slots: id 0 | task 6105 | prompt processing progress, n_tokens = 21441, batch.n_tokens = 27, progress = 1.000000
slot update_slots: id 0 | task 6105 | prompt done, n_tokens = 21441, batch.n_tokens = 27
slot init_sampler: id 0 | task 6105 | init sampler, took 1.71 ms, tokens: text = 21441, total = 21441
slot update_slots: id 0 | task 6105 | created context checkpoint 7 of 8 (pos_min = 21413, pos_max = 21413, size = 75.376 MiB)
slot print_timing: id 0 | task 6105 |
prompt eval time = 409.14 ms / 27 tokens ( 15.15 ms per token, 65.99 tokens per second)
eval time = 3569.16 ms / 61 tokens ( 58.51 ms per token, 17.09 tokens per second)
total time = 3978.29 ms / 88 tokens
slot release: id 0 | task 6105 | stop processing: n_tokens = 21501, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6167 | processing task, is_child = 0
slot update_slots: id 0 | task 6167 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21527
slot update_slots: id 0 | task 6167 | n_tokens = 21501, memory_seq_rm [21501, end)
slot update_slots: id 0 | task 6167 | prompt processing progress, n_tokens = 21527, batch.n_tokens = 26, progress = 1.000000
slot update_slots: id 0 | task 6167 | prompt done, n_tokens = 21527, batch.n_tokens = 26
slot init_sampler: id 0 | task 6167 | init sampler, took 1.91 ms, tokens: text = 21527, total = 21527
slot update_slots: id 0 | task 6167 | created context checkpoint 8 of 8 (pos_min = 21500, pos_max = 21500, size = 75.376 MiB)
slot print_timing: id 0 | task 6167 |
prompt eval time = 390.64 ms / 26 tokens ( 15.02 ms per token, 66.56 tokens per second)
eval time = 3449.90 ms / 59 tokens ( 58.47 ms per token, 17.10 tokens per second)
total time = 3840.54 ms / 85 tokens
slot release: id 0 | task 6167 | stop processing: n_tokens = 21585, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6227 | processing task, is_child = 0
slot update_slots: id 0 | task 6227 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21610
slot update_slots: id 0 | task 6227 | n_tokens = 21585, memory_seq_rm [21585, end)
slot update_slots: id 0 | task 6227 | prompt processing progress, n_tokens = 21610, batch.n_tokens = 25, progress = 1.000000
slot update_slots: id 0 | task 6227 | prompt done, n_tokens = 21610, batch.n_tokens = 25
slot init_sampler: id 0 | task 6227 | init sampler, took 1.74 ms, tokens: text = 21610, total = 21610
slot update_slots: id 0 | task 6227 | erasing old context checkpoint (pos_min = 16897, pos_max = 16897, size = 75.376 MiB)
slot update_slots: id 0 | task 6227 | created context checkpoint 8 of 8 (pos_min = 21584, pos_max = 21584, size = 75.376 MiB)
slot print_timing: id 0 | task 6227 |
prompt eval time = 621.28 ms / 25 tokens ( 24.85 ms per token, 40.24 tokens per second)
eval time = 3198.08 ms / 56 tokens ( 57.11 ms per token, 17.51 tokens per second)
total time = 3819.36 ms / 81 tokens
slot release: id 0 | task 6227 | stop processing: n_tokens = 21665, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6284 | processing task, is_child = 0
slot update_slots: id 0 | task 6284 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21690
slot update_slots: id 0 | task 6284 | n_tokens = 21665, memory_seq_rm [21665, end)
slot update_slots: id 0 | task 6284 | prompt processing progress, n_tokens = 21690, batch.n_tokens = 25, progress = 1.000000
slot update_slots: id 0 | task 6284 | prompt done, n_tokens = 21690, batch.n_tokens = 25
slot init_sampler: id 0 | task 6284 | init sampler, took 1.72 ms, tokens: text = 21690, total = 21690
slot update_slots: id 0 | task 6284 | erasing old context checkpoint (pos_min = 17700, pos_max = 17700, size = 75.376 MiB)
slot update_slots: id 0 | task 6284 | created context checkpoint 8 of 8 (pos_min = 21664, pos_max = 21664, size = 75.376 MiB)
slot print_timing: id 0 | task 6284 |
prompt eval time = 750.75 ms / 25 tokens ( 30.03 ms per token, 33.30 tokens per second)
eval time = 3213.78 ms / 55 tokens ( 58.43 ms per token, 17.11 tokens per second)
total time = 3964.53 ms / 80 tokens
slot release: id 0 | task 6284 | stop processing: n_tokens = 21744, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6340 | processing task, is_child = 0
slot update_slots: id 0 | task 6340 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21768
slot update_slots: id 0 | task 6340 | n_tokens = 21744, memory_seq_rm [21744, end)
slot update_slots: id 0 | task 6340 | prompt processing progress, n_tokens = 21768, batch.n_tokens = 24, progress = 1.000000
slot update_slots: id 0 | task 6340 | prompt done, n_tokens = 21768, batch.n_tokens = 24
slot init_sampler: id 0 | task 6340 | init sampler, took 1.77 ms, tokens: text = 21768, total = 21768
slot update_slots: id 0 | task 6340 | erasing old context checkpoint (pos_min = 18416, pos_max = 18416, size = 75.376 MiB)
slot update_slots: id 0 | task 6340 | created context checkpoint 8 of 8 (pos_min = 21743, pos_max = 21743, size = 75.376 MiB)
slot print_timing: id 0 | task 6340 |
prompt eval time = 363.60 ms / 24 tokens ( 15.15 ms per token, 66.01 tokens per second)
eval time = 3446.97 ms / 56 tokens ( 61.55 ms per token, 16.25 tokens per second)
total time = 3810.56 ms / 80 tokens
slot release: id 0 | task 6340 | stop processing: n_tokens = 21823, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6397 | processing task, is_child = 0
slot update_slots: id 0 | task 6397 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21847
slot update_slots: id 0 | task 6397 | n_tokens = 21823, memory_seq_rm [21823, end)
slot update_slots: id 0 | task 6397 | prompt processing progress, n_tokens = 21847, batch.n_tokens = 24, progress = 1.000000
slot update_slots: id 0 | task 6397 | prompt done, n_tokens = 21847, batch.n_tokens = 24
slot init_sampler: id 0 | task 6397 | init sampler, took 1.80 ms, tokens: text = 21847, total = 21847
slot update_slots: id 0 | task 6397 | erasing old context checkpoint (pos_min = 21065, pos_max = 21065, size = 75.376 MiB)
slot update_slots: id 0 | task 6397 | created context checkpoint 8 of 8 (pos_min = 21822, pos_max = 21822, size = 75.376 MiB)
slot print_timing: id 0 | task 6397 |
prompt eval time = 374.63 ms / 24 tokens ( 15.61 ms per token, 64.06 tokens per second)
eval time = 3437.24 ms / 58 tokens ( 59.26 ms per token, 16.87 tokens per second)
total time = 3811.86 ms / 82 tokens
slot release: id 0 | task 6397 | stop processing: n_tokens = 21904, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6456 | processing task, is_child = 0
slot update_slots: id 0 | task 6456 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 21928
slot update_slots: id 0 | task 6456 | n_tokens = 21904, memory_seq_rm [21904, end)
slot update_slots: id 0 | task 6456 | prompt processing progress, n_tokens = 21928, batch.n_tokens = 24, progress = 1.000000
slot update_slots: id 0 | task 6456 | prompt done, n_tokens = 21928, batch.n_tokens = 24
slot init_sampler: id 0 | task 6456 | init sampler, took 1.68 ms, tokens: text = 21928, total = 21928
slot update_slots: id 0 | task 6456 | erasing old context checkpoint (pos_min = 21234, pos_max = 21234, size = 75.376 MiB)
slot update_slots: id 0 | task 6456 | created context checkpoint 8 of 8 (pos_min = 21903, pos_max = 21903, size = 75.376 MiB)
slot print_timing: id 0 | task 6456 |
prompt eval time = 422.02 ms / 24 tokens ( 17.58 ms per token, 56.87 tokens per second)
eval time = 3360.58 ms / 58 tokens ( 57.94 ms per token, 17.26 tokens per second)
total time = 3782.60 ms / 82 tokens
slot release: id 0 | task 6456 | stop processing: n_tokens = 21985, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6515 | processing task, is_child = 0
slot update_slots: id 0 | task 6515 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22010
slot update_slots: id 0 | task 6515 | n_tokens = 21985, memory_seq_rm [21985, end)
slot update_slots: id 0 | task 6515 | prompt processing progress, n_tokens = 22010, batch.n_tokens = 25, progress = 1.000000
slot update_slots: id 0 | task 6515 | prompt done, n_tokens = 22010, batch.n_tokens = 25
slot init_sampler: id 0 | task 6515 | init sampler, took 1.79 ms, tokens: text = 22010, total = 22010
slot update_slots: id 0 | task 6515 | erasing old context checkpoint (pos_min = 21324, pos_max = 21324, size = 75.376 MiB)
slot update_slots: id 0 | task 6515 | created context checkpoint 8 of 8 (pos_min = 21984, pos_max = 21984, size = 75.376 MiB)
slot print_timing: id 0 | task 6515 |
prompt eval time = 383.56 ms / 25 tokens ( 15.34 ms per token, 65.18 tokens per second)
eval time = 3391.38 ms / 57 tokens ( 59.50 ms per token, 16.81 tokens per second)
total time = 3774.94 ms / 82 tokens
slot release: id 0 | task 6515 | stop processing: n_tokens = 22066, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6573 | processing task, is_child = 0
slot update_slots: id 0 | task 6573 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22092
slot update_slots: id 0 | task 6573 | n_tokens = 22066, memory_seq_rm [22066, end)
slot update_slots: id 0 | task 6573 | prompt processing progress, n_tokens = 22092, batch.n_tokens = 26, progress = 1.000000
slot update_slots: id 0 | task 6573 | prompt done, n_tokens = 22092, batch.n_tokens = 26
slot init_sampler: id 0 | task 6573 | init sampler, took 1.82 ms, tokens: text = 22092, total = 22092
slot update_slots: id 0 | task 6573 | erasing old context checkpoint (pos_min = 21413, pos_max = 21413, size = 75.376 MiB)
slot update_slots: id 0 | task 6573 | created context checkpoint 8 of 8 (pos_min = 22065, pos_max = 22065, size = 75.376 MiB)
slot print_timing: id 0 | task 6573 |
prompt eval time = 413.38 ms / 26 tokens ( 15.90 ms per token, 62.90 tokens per second)
eval time = 3271.08 ms / 54 tokens ( 60.58 ms per token, 16.51 tokens per second)
total time = 3684.46 ms / 80 tokens
slot release: id 0 | task 6573 | stop processing: n_tokens = 22145, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6628 | processing task, is_child = 0
slot update_slots: id 0 | task 6628 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22171
slot update_slots: id 0 | task 6628 | n_tokens = 22145, memory_seq_rm [22145, end)
slot update_slots: id 0 | task 6628 | prompt processing progress, n_tokens = 22171, batch.n_tokens = 26, progress = 1.000000
slot update_slots: id 0 | task 6628 | prompt done, n_tokens = 22171, batch.n_tokens = 26
slot init_sampler: id 0 | task 6628 | init sampler, took 1.83 ms, tokens: text = 22171, total = 22171
slot update_slots: id 0 | task 6628 | erasing old context checkpoint (pos_min = 21500, pos_max = 21500, size = 75.376 MiB)
slot update_slots: id 0 | task 6628 | created context checkpoint 8 of 8 (pos_min = 22144, pos_max = 22144, size = 75.376 MiB)
slot print_timing: id 0 | task 6628 |
prompt eval time = 417.88 ms / 26 tokens ( 16.07 ms per token, 62.22 tokens per second)
eval time = 2850.32 ms / 49 tokens ( 58.17 ms per token, 17.19 tokens per second)
total time = 3268.19 ms / 75 tokens
slot release: id 0 | task 6628 | stop processing: n_tokens = 22219, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6678 | processing task, is_child = 0
slot update_slots: id 0 | task 6678 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22272
slot update_slots: id 0 | task 6678 | n_past = 22209, slot.prompt.tokens.size() = 22219, seq_id = 0, pos_min = 22218, n_swa = 1
slot update_slots: id 0 | task 6678 | restored context checkpoint (pos_min = 22144, pos_max = 22144, size = 75.376 MiB)
slot update_slots: id 0 | task 6678 | n_tokens = 22145, memory_seq_rm [22145, end)
slot update_slots: id 0 | task 6678 | prompt processing progress, n_tokens = 22208, batch.n_tokens = 63, progress = 0.997126
slot update_slots: id 0 | task 6678 | n_tokens = 22208, memory_seq_rm [22208, end)
slot update_slots: id 0 | task 6678 | prompt processing progress, n_tokens = 22272, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 6678 | prompt done, n_tokens = 22272, batch.n_tokens = 64
slot init_sampler: id 0 | task 6678 | init sampler, took 1.71 ms, tokens: text = 22272, total = 22272
slot print_timing: id 0 | task 6678 |
prompt eval time = 2156.74 ms / 127 tokens ( 16.98 ms per token, 58.89 tokens per second)
eval time = 1729.32 ms / 33 tokens ( 52.40 ms per token, 19.08 tokens per second)
total time = 3886.06 ms / 160 tokens
slot release: id 0 | task 6678 | stop processing: n_tokens = 22304, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6713 | processing task, is_child = 0
slot update_slots: id 0 | task 6713 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22357
slot update_slots: id 0 | task 6713 | n_past = 22294, slot.prompt.tokens.size() = 22304, seq_id = 0, pos_min = 22303, n_swa = 1
slot update_slots: id 0 | task 6713 | restored context checkpoint (pos_min = 22144, pos_max = 22144, size = 75.376 MiB)
slot update_slots: id 0 | task 6713 | n_tokens = 22145, memory_seq_rm [22145, end)
slot update_slots: id 0 | task 6713 | prompt processing progress, n_tokens = 22293, batch.n_tokens = 148, progress = 0.997137
slot update_slots: id 0 | task 6713 | n_tokens = 22293, memory_seq_rm [22293, end)
slot update_slots: id 0 | task 6713 | prompt processing progress, n_tokens = 22357, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 6713 | prompt done, n_tokens = 22357, batch.n_tokens = 64
slot init_sampler: id 0 | task 6713 | init sampler, took 1.81 ms, tokens: text = 22357, total = 22357
slot update_slots: id 0 | task 6713 | erasing old context checkpoint (pos_min = 21584, pos_max = 21584, size = 75.376 MiB)
slot update_slots: id 0 | task 6713 | created context checkpoint 8 of 8 (pos_min = 22292, pos_max = 22292, size = 75.376 MiB)
slot print_timing: id 0 | task 6713 |
prompt eval time = 2757.89 ms / 212 tokens ( 13.01 ms per token, 76.87 tokens per second)
eval time = 2032.58 ms / 34 tokens ( 59.78 ms per token, 16.73 tokens per second)
total time = 4790.47 ms / 246 tokens
slot release: id 0 | task 6713 | stop processing: n_tokens = 22390, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.998 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6749 | processing task, is_child = 0
slot update_slots: id 0 | task 6749 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22443
slot update_slots: id 0 | task 6749 | n_tokens = 22390, memory_seq_rm [22390, end)
slot update_slots: id 0 | task 6749 | prompt processing progress, n_tokens = 22443, batch.n_tokens = 53, progress = 1.000000
slot update_slots: id 0 | task 6749 | prompt done, n_tokens = 22443, batch.n_tokens = 53
slot init_sampler: id 0 | task 6749 | init sampler, took 1.76 ms, tokens: text = 22443, total = 22443
slot update_slots: id 0 | task 6749 | erasing old context checkpoint (pos_min = 21664, pos_max = 21664, size = 75.376 MiB)
slot update_slots: id 0 | task 6749 | created context checkpoint 8 of 8 (pos_min = 22389, pos_max = 22389, size = 75.376 MiB)
slot print_timing: id 0 | task 6749 |
prompt eval time = 1048.08 ms / 53 tokens ( 19.78 ms per token, 50.57 tokens per second)
eval time = 5528.37 ms / 88 tokens ( 62.82 ms per token, 15.92 tokens per second)
total time = 6576.45 ms / 141 tokens
slot release: id 0 | task 6749 | stop processing: n_tokens = 22530, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 12378895599
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 22530, total state size = 603.681 MiB
srv params_from_: Chat format: Qwen3 Coder
srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.057
srv update: - cache state: 7 prompts, 3663.218 MiB (limits: 8192.000 MiB, 32768 tokens, 119323 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e424eb6dbb0: 15746 tokens, checkpoints: 7, 972.234 MiB
srv update: - prompt 0x5e426816e6d0: 13615 tokens, checkpoints: 6, 846.889 MiB
srv update: - prompt 0x5e426850d0e0: 22530 tokens, checkpoints: 8, 1206.688 MiB
srv get_availabl: prompt cache update took 6022.52 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6838 | processing task, is_child = 0
slot update_slots: id 0 | task 6838 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 350
slot update_slots: id 0 | task 6838 | n_past = 20, slot.prompt.tokens.size() = 22530, seq_id = 0, pos_min = 22529, n_swa = 1
slot update_slots: id 0 | task 6838 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 21743, pos_max = 21743, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 21822, pos_max = 21822, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 21903, pos_max = 21903, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 21984, pos_max = 21984, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 22065, pos_max = 22065, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 22144, pos_max = 22144, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 22292, pos_max = 22292, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | erased invalidated context checkpoint (pos_min = 22389, pos_max = 22389, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 6838 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 6838 | prompt processing progress, n_tokens = 286, batch.n_tokens = 286, progress = 0.817143
slot update_slots: id 0 | task 6838 | n_tokens = 286, memory_seq_rm [286, end)
slot update_slots: id 0 | task 6838 | prompt processing progress, n_tokens = 350, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 6838 | prompt done, n_tokens = 350, batch.n_tokens = 64
slot init_sampler: id 0 | task 6838 | init sampler, took 0.03 ms, tokens: text = 350, total = 350
slot update_slots: id 0 | task 6838 | created context checkpoint 1 of 8 (pos_min = 285, pos_max = 285, size = 75.376 MiB)
slot print_timing: id 0 | task 6838 |
prompt eval time = 11793.55 ms / 350 tokens ( 33.70 ms per token, 29.68 tokens per second)
eval time = 1458.87 ms / 23 tokens ( 63.43 ms per token, 15.77 tokens per second)
total time = 13252.42 ms / 373 tokens
slot release: id 0 | task 6838 | stop processing: n_tokens = 372, truncated = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 12398394801
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 372, total state size = 84.099 MiB
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv load: - looking for better prompt, base f_keep = 0.054, sim = 0.001
srv load: - found better prompt with f_keep = 1.000, sim = 0.999
srv update: - cache state: 7 prompts, 2616.005 MiB (limits: 8192.000 MiB, 32768 tokens, 97702 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e424eb6dbb0: 15746 tokens, checkpoints: 7, 972.234 MiB
srv update: - prompt 0x5e426816e6d0: 13615 tokens, checkpoints: 6, 846.889 MiB
srv update: - prompt 0x5e424eb0ef50: 372 tokens, checkpoints: 1, 159.475 MiB
srv get_availabl: prompt cache update took 846.78 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6839 | processing task, is_child = 0
slot update_slots: id 0 | task 6839 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22543
slot update_slots: id 0 | task 6839 | n_tokens = 22530, memory_seq_rm [22530, end)
slot update_slots: id 0 | task 6839 | prompt processing progress, n_tokens = 22543, batch.n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 6839 | prompt done, n_tokens = 22543, batch.n_tokens = 13
slot init_sampler: id 0 | task 6839 | init sampler, took 1.71 ms, tokens: text = 22543, total = 22543
slot update_slots: id 0 | task 6839 | erasing old context checkpoint (pos_min = 21743, pos_max = 21743, size = 75.376 MiB)
slot update_slots: id 0 | task 6839 | created context checkpoint 8 of 8 (pos_min = 22529, pos_max = 22529, size = 75.376 MiB)
slot print_timing: id 0 | task 6839 |
prompt eval time = 480.65 ms / 13 tokens ( 36.97 ms per token, 27.05 tokens per second)
eval time = 5347.05 ms / 81 tokens ( 66.01 ms per token, 15.15 tokens per second)
total time = 5827.69 ms / 94 tokens
slot release: id 0 | task 6839 | stop processing: n_tokens = 22623, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.995 (> 0.100 thold), f_keep = 0.998
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6945 | processing task, is_child = 0
slot update_slots: id 0 | task 6945 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22688
slot update_slots: id 0 | task 6945 | n_past = 22580, slot.prompt.tokens.size() = 22623, seq_id = 0, pos_min = 22622, n_swa = 1
slot update_slots: id 0 | task 6945 | restored context checkpoint (pos_min = 22529, pos_max = 22529, size = 75.376 MiB)
slot update_slots: id 0 | task 6945 | n_tokens = 22530, memory_seq_rm [22530, end)
slot update_slots: id 0 | task 6945 | prompt processing progress, n_tokens = 22624, batch.n_tokens = 94, progress = 0.997179
slot update_slots: id 0 | task 6945 | n_tokens = 22624, memory_seq_rm [22624, end)
slot update_slots: id 0 | task 6945 | prompt processing progress, n_tokens = 22688, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 6945 | prompt done, n_tokens = 22688, batch.n_tokens = 64
slot init_sampler: id 0 | task 6945 | init sampler, took 1.76 ms, tokens: text = 22688, total = 22688
slot update_slots: id 0 | task 6945 | erasing old context checkpoint (pos_min = 21822, pos_max = 21822, size = 75.376 MiB)
slot update_slots: id 0 | task 6945 | created context checkpoint 8 of 8 (pos_min = 22623, pos_max = 22623, size = 75.376 MiB)
slot print_timing: id 0 | task 6945 |
prompt eval time = 2796.67 ms / 158 tokens ( 17.70 ms per token, 56.50 tokens per second)
eval time = 2716.46 ms / 46 tokens ( 59.05 ms per token, 16.93 tokens per second)
total time = 5513.13 ms / 204 tokens
slot release: id 0 | task 6945 | stop processing: n_tokens = 22733, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 6993 | processing task, is_child = 0
slot update_slots: id 0 | task 6993 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 22766
slot update_slots: id 0 | task 6993 | n_tokens = 22733, memory_seq_rm [22733, end)
slot update_slots: id 0 | task 6993 | prompt processing progress, n_tokens = 22766, batch.n_tokens = 33, progress = 1.000000
slot update_slots: id 0 | task 6993 | prompt done, n_tokens = 22766, batch.n_tokens = 33
slot init_sampler: id 0 | task 6993 | init sampler, took 1.88 ms, tokens: text = 22766, total = 22766
slot update_slots: id 0 | task 6993 | erasing old context checkpoint (pos_min = 21903, pos_max = 21903, size = 75.376 MiB)
slot update_slots: id 0 | task 6993 | created context checkpoint 8 of 8 (pos_min = 22732, pos_max = 22732, size = 75.376 MiB)
slot print_timing: id 0 | task 6993 |
prompt eval time = 1262.12 ms / 33 tokens ( 38.25 ms per token, 26.15 tokens per second)
eval time = 55410.56 ms / 902 tokens ( 61.43 ms per token, 16.28 tokens per second)
total time = 56672.67 ms / 935 tokens
slot release: id 0 | task 6993 | stop processing: n_tokens = 23667, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.996 (> 0.100 thold), f_keep = 0.998
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 7896 | processing task, is_child = 0
slot update_slots: id 0 | task 7896 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 23730
slot update_slots: id 0 | task 7896 | n_past = 23625, slot.prompt.tokens.size() = 23667, seq_id = 0, pos_min = 23666, n_swa = 1
slot update_slots: id 0 | task 7896 | restored context checkpoint (pos_min = 22732, pos_max = 22732, size = 75.376 MiB)
slot update_slots: id 0 | task 7896 | n_tokens = 22733, memory_seq_rm [22733, end)
slot update_slots: id 0 | task 7896 | prompt processing progress, n_tokens = 23666, batch.n_tokens = 933, progress = 0.997303
slot update_slots: id 0 | task 7896 | n_tokens = 23666, memory_seq_rm [23666, end)
slot update_slots: id 0 | task 7896 | prompt processing progress, n_tokens = 23730, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 7896 | prompt done, n_tokens = 23730, batch.n_tokens = 64
slot init_sampler: id 0 | task 7896 | init sampler, took 1.85 ms, tokens: text = 23730, total = 23730
slot update_slots: id 0 | task 7896 | erasing old context checkpoint (pos_min = 21984, pos_max = 21984, size = 75.376 MiB)
slot update_slots: id 0 | task 7896 | created context checkpoint 8 of 8 (pos_min = 23665, pos_max = 23665, size = 75.376 MiB)
slot print_timing: id 0 | task 7896 |
prompt eval time = 5967.17 ms / 997 tokens ( 5.99 ms per token, 167.08 tokens per second)
eval time = 11417.55 ms / 196 tokens ( 58.25 ms per token, 17.17 tokens per second)
total time = 17384.72 ms / 1193 tokens
slot release: id 0 | task 7896 | stop processing: n_tokens = 23925, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 8094 | processing task, is_child = 0
slot update_slots: id 0 | task 8094 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 23957
slot update_slots: id 0 | task 8094 | n_tokens = 23925, memory_seq_rm [23925, end)
slot update_slots: id 0 | task 8094 | prompt processing progress, n_tokens = 23957, batch.n_tokens = 32, progress = 1.000000
slot update_slots: id 0 | task 8094 | prompt done, n_tokens = 23957, batch.n_tokens = 32
slot init_sampler: id 0 | task 8094 | init sampler, took 1.89 ms, tokens: text = 23957, total = 23957
slot update_slots: id 0 | task 8094 | erasing old context checkpoint (pos_min = 22065, pos_max = 22065, size = 75.376 MiB)
slot update_slots: id 0 | task 8094 | created context checkpoint 8 of 8 (pos_min = 23924, pos_max = 23924, size = 75.376 MiB)
slot print_timing: id 0 | task 8094 |
prompt eval time = 811.30 ms / 32 tokens ( 25.35 ms per token, 39.44 tokens per second)
eval time = 77493.44 ms / 1302 tokens ( 59.52 ms per token, 16.80 tokens per second)
total time = 78304.74 ms / 1334 tokens
slot release: id 0 | task 8094 | stop processing: n_tokens = 25258, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 9397 | processing task, is_child = 0
slot update_slots: id 0 | task 9397 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 25289
slot update_slots: id 0 | task 9397 | n_tokens = 25258, memory_seq_rm [25258, end)
slot update_slots: id 0 | task 9397 | prompt processing progress, n_tokens = 25289, batch.n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 9397 | prompt done, n_tokens = 25289, batch.n_tokens = 31
slot init_sampler: id 0 | task 9397 | init sampler, took 2.03 ms, tokens: text = 25289, total = 25289
slot update_slots: id 0 | task 9397 | erasing old context checkpoint (pos_min = 22144, pos_max = 22144, size = 75.376 MiB)
slot update_slots: id 0 | task 9397 | created context checkpoint 8 of 8 (pos_min = 25257, pos_max = 25257, size = 75.376 MiB)
slot print_timing: id 0 | task 9397 |
prompt eval time = 454.77 ms / 31 tokens ( 14.67 ms per token, 68.17 tokens per second)
eval time = 17713.42 ms / 309 tokens ( 57.32 ms per token, 17.44 tokens per second)
total time = 18168.19 ms / 340 tokens
slot release: id 0 | task 9397 | stop processing: n_tokens = 25597, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.989 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 9707 | processing task, is_child = 0
slot update_slots: id 0 | task 9707 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 25887
slot update_slots: id 0 | task 9707 | n_tokens = 25597, memory_seq_rm [25597, end)
slot update_slots: id 0 | task 9707 | prompt processing progress, n_tokens = 25823, batch.n_tokens = 226, progress = 0.997528
slot update_slots: id 0 | task 9707 | n_tokens = 25823, memory_seq_rm [25823, end)
slot update_slots: id 0 | task 9707 | prompt processing progress, n_tokens = 25887, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 9707 | prompt done, n_tokens = 25887, batch.n_tokens = 64
slot init_sampler: id 0 | task 9707 | init sampler, took 2.05 ms, tokens: text = 25887, total = 25887
slot update_slots: id 0 | task 9707 | erasing old context checkpoint (pos_min = 22292, pos_max = 22292, size = 75.376 MiB)
slot update_slots: id 0 | task 9707 | created context checkpoint 8 of 8 (pos_min = 25822, pos_max = 25822, size = 75.376 MiB)
slot print_timing: id 0 | task 9707 |
prompt eval time = 3431.64 ms / 290 tokens ( 11.83 ms per token, 84.51 tokens per second)
eval time = 5636.36 ms / 92 tokens ( 61.26 ms per token, 16.32 tokens per second)
total time = 9068.00 ms / 382 tokens
slot release: id 0 | task 9707 | stop processing: n_tokens = 25978, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 9801 | processing task, is_child = 0
slot update_slots: id 0 | task 9801 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 26009
slot update_slots: id 0 | task 9801 | n_tokens = 25978, memory_seq_rm [25978, end)
slot update_slots: id 0 | task 9801 | prompt processing progress, n_tokens = 26009, batch.n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 9801 | prompt done, n_tokens = 26009, batch.n_tokens = 31
slot init_sampler: id 0 | task 9801 | init sampler, took 2.08 ms, tokens: text = 26009, total = 26009
slot update_slots: id 0 | task 9801 | erasing old context checkpoint (pos_min = 22389, pos_max = 22389, size = 75.376 MiB)
slot update_slots: id 0 | task 9801 | created context checkpoint 8 of 8 (pos_min = 25977, pos_max = 25977, size = 75.376 MiB)
slot print_timing: id 0 | task 9801 |
prompt eval time = 374.53 ms / 31 tokens ( 12.08 ms per token, 82.77 tokens per second)
eval time = 42089.66 ms / 709 tokens ( 59.36 ms per token, 16.84 tokens per second)
total time = 42464.19 ms / 740 tokens
slot release: id 0 | task 9801 | stop processing: n_tokens = 26717, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 10511 | processing task, is_child = 0
slot update_slots: id 0 | task 10511 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 26747
slot update_slots: id 0 | task 10511 | n_tokens = 26717, memory_seq_rm [26717, end)
slot update_slots: id 0 | task 10511 | prompt processing progress, n_tokens = 26747, batch.n_tokens = 30, progress = 1.000000
slot update_slots: id 0 | task 10511 | prompt done, n_tokens = 26747, batch.n_tokens = 30
slot init_sampler: id 0 | task 10511 | init sampler, took 2.25 ms, tokens: text = 26747, total = 26747
slot update_slots: id 0 | task 10511 | erasing old context checkpoint (pos_min = 22529, pos_max = 22529, size = 75.376 MiB)
slot update_slots: id 0 | task 10511 | created context checkpoint 8 of 8 (pos_min = 26716, pos_max = 26716, size = 75.376 MiB)
slot print_timing: id 0 | task 10511 |
prompt eval time = 623.52 ms / 30 tokens ( 20.78 ms per token, 48.11 tokens per second)
eval time = 102561.43 ms / 1737 tokens ( 59.05 ms per token, 16.94 tokens per second)
total time = 103184.96 ms / 1767 tokens
slot release: id 0 | task 10511 | stop processing: n_tokens = 28483, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 12249 | processing task, is_child = 0
slot update_slots: id 0 | task 12249 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 28514
slot update_slots: id 0 | task 12249 | n_tokens = 28483, memory_seq_rm [28483, end)
slot update_slots: id 0 | task 12249 | prompt processing progress, n_tokens = 28514, batch.n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 12249 | prompt done, n_tokens = 28514, batch.n_tokens = 31
slot init_sampler: id 0 | task 12249 | init sampler, took 2.26 ms, tokens: text = 28514, total = 28514
slot update_slots: id 0 | task 12249 | erasing old context checkpoint (pos_min = 22623, pos_max = 22623, size = 75.376 MiB)
slot update_slots: id 0 | task 12249 | created context checkpoint 8 of 8 (pos_min = 28482, pos_max = 28482, size = 75.376 MiB)
slot print_timing: id 0 | task 12249 |
prompt eval time = 566.94 ms / 31 tokens ( 18.29 ms per token, 54.68 tokens per second)
eval time = 74037.91 ms / 1242 tokens ( 59.61 ms per token, 16.78 tokens per second)
total time = 74604.85 ms / 1273 tokens
slot release: id 0 | task 12249 | stop processing: n_tokens = 29755, truncated = 0
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 13492 | processing task, is_child = 0
slot update_slots: id 0 | task 13492 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 29786
slot update_slots: id 0 | task 13492 | n_tokens = 29755, memory_seq_rm [29755, end)
slot update_slots: id 0 | task 13492 | prompt processing progress, n_tokens = 29786, batch.n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 13492 | prompt done, n_tokens = 29786, batch.n_tokens = 31
slot init_sampler: id 0 | task 13492 | init sampler, took 2.40 ms, tokens: text = 29786, total = 29786
slot update_slots: id 0 | task 13492 | erasing old context checkpoint (pos_min = 22732, pos_max = 22732, size = 75.376 MiB)
slot update_slots: id 0 | task 13492 | created context checkpoint 8 of 8 (pos_min = 29754, pos_max = 29754, size = 75.376 MiB)
slot print_timing: id 0 | task 13492 |
prompt eval time = 739.31 ms / 31 tokens ( 23.85 ms per token, 41.93 tokens per second)
eval time = 69617.41 ms / 1153 tokens ( 60.38 ms per token, 16.56 tokens per second)
total time = 70356.72 ms / 1184 tokens
slot release: id 0 | task 13492 | stop processing: n_tokens = 30938, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 14646 | processing task, is_child = 0
slot update_slots: id 0 | task 14646 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 30969
slot update_slots: id 0 | task 14646 | n_tokens = 30938, memory_seq_rm [30938, end)
slot update_slots: id 0 | task 14646 | prompt processing progress, n_tokens = 30969, batch.n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 14646 | prompt done, n_tokens = 30969, batch.n_tokens = 31
slot init_sampler: id 0 | task 14646 | init sampler, took 2.55 ms, tokens: text = 30969, total = 30969
slot update_slots: id 0 | task 14646 | erasing old context checkpoint (pos_min = 23665, pos_max = 23665, size = 75.376 MiB)
slot update_slots: id 0 | task 14646 | created context checkpoint 8 of 8 (pos_min = 30937, pos_max = 30937, size = 75.376 MiB)
slot print_timing: id 0 | task 14646 |
prompt eval time = 869.43 ms / 31 tokens ( 28.05 ms per token, 35.66 tokens per second)
eval time = 91165.07 ms / 1465 tokens ( 62.23 ms per token, 16.07 tokens per second)
total time = 92034.51 ms / 1496 tokens
slot release: id 0 | task 14646 | stop processing: n_tokens = 32433, truncated = 0
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv update_slots: all slots are idle
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 16112 | processing task, is_child = 0
slot update_slots: id 0 | task 16112 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 32464
slot update_slots: id 0 | task 16112 | n_tokens = 32433, memory_seq_rm [32433, end)
slot update_slots: id 0 | task 16112 | prompt processing progress, n_tokens = 32464, batch.n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 16112 | prompt done, n_tokens = 32464, batch.n_tokens = 31
slot init_sampler: id 0 | task 16112 | init sampler, took 2.57 ms, tokens: text = 32464, total = 32464
slot update_slots: id 0 | task 16112 | erasing old context checkpoint (pos_min = 23924, pos_max = 23924, size = 75.376 MiB)
slot update_slots: id 0 | task 16112 | created context checkpoint 8 of 8 (pos_min = 32432, pos_max = 32432, size = 75.376 MiB)
slot print_timing: id 0 | task 16112 |
prompt eval time = 1005.12 ms / 31 tokens ( 32.42 ms per token, 30.84 tokens per second)
eval time = 4757.12 ms / 71 tokens ( 67.00 ms per token, 14.92 tokens per second)
total time = 5762.25 ms / 102 tokens
slot release: id 0 | task 16112 | stop processing: n_tokens = 32534, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 12983208127
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 32534, total state size = 838.264 MiB
srv params_from_: Chat format: Qwen3 Coder
srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.061
srv update: - cache state: 8 prompts, 4057.276 MiB (limits: 8192.000 MiB, 32768 tokens, 128684 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e424eb6dbb0: 15746 tokens, checkpoints: 7, 972.234 MiB
srv update: - prompt 0x5e426816e6d0: 13615 tokens, checkpoints: 6, 846.889 MiB
srv update: - prompt 0x5e424eb0ef50: 372 tokens, checkpoints: 1, 159.475 MiB
srv update: - prompt 0x5e426af39220: 32534 tokens, checkpoints: 8, 1441.271 MiB
srv get_availabl: prompt cache update took 13719.02 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 16184 | processing task, is_child = 0
slot update_slots: id 0 | task 16184 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 329
slot update_slots: id 0 | task 16184 | n_past = 20, slot.prompt.tokens.size() = 32534, seq_id = 0, pos_min = 32533, n_swa = 1
slot update_slots: id 0 | task 16184 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 25257, pos_max = 25257, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 25822, pos_max = 25822, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 25977, pos_max = 25977, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 26716, pos_max = 26716, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 28482, pos_max = 28482, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 29754, pos_max = 29754, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 30937, pos_max = 30937, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | erased invalidated context checkpoint (pos_min = 32432, pos_max = 32432, n_swa = 1, size = 75.376 MiB)
slot update_slots: id 0 | task 16184 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 16184 | prompt processing progress, n_tokens = 265, batch.n_tokens = 265, progress = 0.805471
slot update_slots: id 0 | task 16184 | n_tokens = 265, memory_seq_rm [265, end)
slot update_slots: id 0 | task 16184 | prompt processing progress, n_tokens = 329, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 16184 | prompt done, n_tokens = 329, batch.n_tokens = 64
slot init_sampler: id 0 | task 16184 | init sampler, took 0.03 ms, tokens: text = 329, total = 329
slot update_slots: id 0 | task 16184 | created context checkpoint 1 of 8 (pos_min = 264, pos_max = 264, size = 75.376 MiB)
slot print_timing: id 0 | task 16184 |
prompt eval time = 23115.51 ms / 329 tokens ( 70.26 ms per token, 14.23 tokens per second)
eval time = 1521.45 ms / 23 tokens ( 66.15 ms per token, 15.12 tokens per second)
total time = 24636.95 ms / 352 tokens
slot release: id 0 | task 16184 | stop processing: n_tokens = 351, truncated = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 13021897998
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 351, total state size = 83.607 MiB
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv load: - looking for better prompt, base f_keep = 0.057, sim = 0.001
srv load: - found better prompt with f_keep = 1.000, sim = 1.000
srv update: - cache state: 8 prompts, 2774.988 MiB (limits: 8192.000 MiB, 32768 tokens, 93141 est)
srv update: - prompt 0x5e426789a450: 168 tokens, checkpoints: 1, 154.691 MiB
srv update: - prompt 0x5e4268394340: 595 tokens, checkpoints: 1, 164.704 MiB
srv update: - prompt 0x5e426ad8a3f0: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e426ad9d890: 352 tokens, checkpoints: 1, 159.006 MiB
srv update: - prompt 0x5e424eb6dbb0: 15746 tokens, checkpoints: 7, 972.234 MiB
srv update: - prompt 0x5e426816e6d0: 13615 tokens, checkpoints: 6, 846.889 MiB
srv update: - prompt 0x5e424eb0ef50: 372 tokens, checkpoints: 1, 159.475 MiB
srv update: - prompt 0x5e426af2f1c0: 351 tokens, checkpoints: 1, 158.983 MiB
srv get_availabl: prompt cache update took 1311.45 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 16185 | processing task, is_child = 0
slot update_slots: id 0 | task 16185 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 32547
slot update_slots: id 0 | task 16185 | n_tokens = 32534, memory_seq_rm [32534, end)
slot update_slots: id 0 | task 16185 | prompt processing progress, n_tokens = 32547, batch.n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 16185 | prompt done, n_tokens = 32547, batch.n_tokens = 13
slot init_sampler: id 0 | task 16185 | init sampler, took 2.61 ms, tokens: text = 32547, total = 32547
slot update_slots: id 0 | task 16185 | erasing old context checkpoint (pos_min = 25257, pos_max = 25257, size = 75.376 MiB)
slot update_slots: id 0 | task 16185 | created context checkpoint 8 of 8 (pos_min = 32533, pos_max = 32533, size = 75.376 MiB)
slot print_timing: id 0 | task 16185 |
prompt eval time = 519.86 ms / 13 tokens ( 39.99 ms per token, 25.01 tokens per second)
eval time = 14544.86 ms / 221 tokens ( 65.81 ms per token, 15.19 tokens per second)
total time = 15064.72 ms / 234 tokens
slot release: id 0 | task 16185 | stop processing: n_tokens = 32767, truncated = 1
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv stop: cancel task, id_task = 16185
srv update_slots: all slots are idle
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.993
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 16432 | processing task, is_child = 0
slot update_slots: id 0 | task 16432 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 32547
slot update_slots: id 0 | task 16432 | n_past = 32547, slot.prompt.tokens.size() = 32767, seq_id = 0, pos_min = 32766, n_swa = 1
slot update_slots: id 0 | task 16432 | restored context checkpoint (pos_min = 32533, pos_max = 32533, size = 75.376 MiB)
slot update_slots: id 0 | task 16432 | n_tokens = 32534, memory_seq_rm [32534, end)
slot update_slots: id 0 | task 16432 | prompt processing progress, n_tokens = 32547, batch.n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 16432 | prompt done, n_tokens = 32547, batch.n_tokens = 13
slot init_sampler: id 0 | task 16432 | init sampler, took 2.51 ms, tokens: text = 32547, total = 32547
slot print_timing: id 0 | task 16432 |
prompt eval time = 240.75 ms / 13 tokens ( 18.52 ms per token, 54.00 tokens per second)
eval time = 13605.62 ms / 221 tokens ( 61.56 ms per token, 16.24 tokens per second)
total time = 13846.37 ms / 234 tokens
slot release: id 0 | task 16432 | stop processing: n_tokens = 32767, truncated = 1
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv params_from_: Chat format: Qwen3 Coder
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.992 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 16654 | processing task, is_child = 0
slot update_slots: id 0 | task 16654 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 33042
srv send_error: task id = 16654, error: request (33042 tokens) exceeds the available context size (32768 tokens), try increasing it
slot release: id 0 | task 16654 | stop processing: n_tokens = 32767, truncated = 0
srv update_slots: no tokens to decode
srv update_slots: all slots are idle
srv stop: cancel task, id_task = 16654
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/messages 127.0.0.1 400
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment