Skip to content

Instantly share code, notes, and snippets.

@raz334
Created September 12, 2025 06:18
Show Gist options
  • Select an option

  • Save raz334/0d4438f6148b0bc0bf4beb12b55fbc8c to your computer and use it in GitHub Desktop.

Select an option

Save raz334/0d4438f6148b0bc0bf4beb12b55fbc8c to your computer and use it in GitHub Desktop.
[2025-09-12 00:09:17][DEBUG] Received request: GET to /v1/chat/completions/api/tags
[2025-09-12 00:09:17][ERROR] Unexpected endpoint or method. (GET /v1/chat/completions/api/tags). Returning 200 anyway
[2025-09-12 00:09:30][DEBUG] Received request: GET to /v1/embeddings/api/tags
[2025-09-12 00:09:30][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
[2025-09-12 00:09:38][DEBUG] Received request: GET to /v1/embeddings/api/tags
[2025-09-12 00:09:38][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
[2025-09-12 00:09:46][DEBUG] Received request: GET to /v1/chat/completions/api/v1/models
[2025-09-12 00:09:46][ERROR] Unexpected endpoint or method. (GET /v1/chat/completions/api/v1/models). Returning 200 anyway
[2025-09-12 00:10:40][DEBUG] Received request: GET to /v1/embeddings/api/tags
[2025-09-12 00:10:40][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
[2025-09-12 00:10:47][DEBUG] Received request: GET to /v1/embeddings/api/tags
[2025-09-12 00:10:47][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
[2025-09-12 00:10:56][DEBUG] Received request: GET to /v1/api/tags
[2025-09-12 00:10:56][ERROR] Unexpected endpoint or method. (GET /v1/api/tags). Returning 200 anyway
[2025-09-12 00:10:58][DEBUG] Received request: GET to /api/tags
[2025-09-12 00:10:58][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
[2025-09-12 00:12:07][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:12:07][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:12:07][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192':
Model: 11.16 GB
Context: 1.13 GB
Total: 12.29 GB
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[2025-09-12 00:12:07][DEBUG][LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:12:07][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:12:07][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:12:07][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
[2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:12:07][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:12:07][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:12:07][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:12:11][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:12:19][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:12:19][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
[2025-09-12 00:12:19][DEBUG] llama_kv_cache: CPU KV buffer size = 561.00 MiB
[2025-09-12 00:12:19][DEBUG] llama_kv_cache: size = 561.00 MiB ( 8192 cells, 44 layers, 1/1 seqs), K (q8_0): 280.50 MiB, V (q8_0): 280.50 MiB
[2025-09-12 00:12:19][DEBUG] llama_context: Vulkan0 compute buffer size = 1384.00 MiB
llama_context: Vulkan_Host compute buffer size = 28.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:12:19][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:12:20][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:14:05][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:14:05][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:14:05][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 8192,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:14:05][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:14:05][DEBUG] Sampling params:
[2025-09-12 00:14:05][DEBUG] repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:14:05][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
[2025-09-12 00:14:05][DEBUG] Total prompt tokens: 8
Prompt tokens to decode: 8
BeginProcessingPrompt
[2025-09-12 00:14:07][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:14:07][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:14:07][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:14:07][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[2025-09-12 00:14:07][DEBUG][LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:14:07][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:14:07][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:14:07][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:14:07][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:14:07][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:14:07][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
[2025-09-12 00:14:07][DEBUG] print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:14:11][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:14:20][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:14:21][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 9.12 ms / 35 runs ( 0.26 ms per token, 3836.46 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 1112.75 ms / 8 tokens ( 139.09 ms per token, 7.19 tokens per second)
llama_perf_context_print: eval time = 14062.28 ms / 26 runs ( 540.86 ms per token, 1.85 tokens per second)
llama_perf_context_print: total time = 15193.48 ms / 34 tokens
llama_perf_context_print: graphs reused = 25
[2025-09-12 00:14:21][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:14:21][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-ougj4mogwoqqolfz7ayoj",
"object": "chat.completion",
"created": 1757654045,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think.\n\nFirst, since they greeted",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 27,
"total_tokens": 35
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:14:22][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:14:22][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:14:22][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:14:23][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:14:23][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:14:24][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:14:36][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 8192,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:14:36][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:14:36][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:14:36][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:14:36][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
[2025-09-12 00:14:36][DEBUG] Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
Total prompt tokens: 8
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:14:36][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:14:36][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:14:38][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:14:38][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:14:38][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:14:38][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:14:38][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:14:38][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:14:39][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:14:39][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:14:39][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
[2025-09-12 00:14:39][DEBUG] print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:14:41][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:14:51][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:14:51][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 9.53 ms / 36 runs ( 0.26 ms per token, 3777.94 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15096.70 ms / 28 runs ( 539.17 ms per token, 1.85 tokens per second)
[2025-09-12 00:14:51][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:14:51][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-28i1pkpa7owj92zk46b7s3p",
"object": "chat.completion",
"created": 1757654076,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\n",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:14:51][DEBUG] llama_perf_context_print: total time = 15111.45 ms / 29 tokens
llama_perf_context_print: graphs reused = 28
[2025-09-12 00:14:52][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:14:52][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
[2025-09-12 00:14:52][DEBUG] llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:14:52][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:14:52][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:14:52][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:14:53][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:15:13][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:15:13][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:15:13][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 8192,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:15:13][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:15:13][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:15:13][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
Total prompt tokens: 8
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:15:14][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:15:14][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:15:15][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:15:15][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:15:15][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:15:15][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:15:15][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:15:15][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:15:16][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:15:16][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:15:16][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
[2025-09-12 00:15:16][DEBUG] print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:15:18][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:15:28][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:15:28][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 9.51 ms / 36 runs ( 0.26 ms per token, 3786.29 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15154.53 ms / 28 runs ( 541.23 ms per token, 1.85 tokens per second)
llama_perf_context_print: total time = 15168.26 ms / 29 tokens
llama_perf_context_print: graphs reused = 28
[2025-09-12 00:15:28][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:15:28][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-7fs0j338wau9n5bvptz1gc",
"object": "chat.completion",
"created": 1757654113,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about the best way to greet them",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:15:29][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:15:29][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
[2025-09-12 00:15:29][DEBUG] llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:15:29][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:15:29][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:15:29][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:15:31][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:16:01][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 8192,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:16:01][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:16:01][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:16:01][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:16:01][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
Total prompt tokens: 8
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:16:02][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:16:02][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:16:03][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:16:03][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:16:03][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[2025-09-12 00:16:03][DEBUG][LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:16:04][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:16:04][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:16:04][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:16:04][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:16:04][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:16:04][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
[2025-09-12 00:16:04][DEBUG] print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:16:07][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:16:16][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:16:16][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 10.85 ms / 36 runs ( 0.30 ms per token, 3319.50 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15213.16 ms / 28 runs ( 543.33 ms per token, 1.84 tokens per second)
llama_perf_context_print: total time = 15228.41 ms / 29 tokens
llama_perf_context_print: graphs reused = 28
[2025-09-12 00:16:16][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:16:16][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-phk3xmfkolzs9eisnnkw",
"object": "chat.completion",
"created": 1757654161,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about the best way to greet them",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:16:18][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:16:18][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
[2025-09-12 00:16:18][DEBUG] llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:16:18][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:16:19][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:16:19][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:16:20][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:16:54][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 8192,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:16:54][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:16:54][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:16:54][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:16:54][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
Total prompt tokens: 8
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:16:55][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:16:55][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:16:56][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:16:56][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:16:56][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:16:56][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:16:57][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:16:57][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:16:57][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:16:57][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
[2025-09-12 00:16:57][DEBUG] print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:17:00][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:17:09][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:17:09][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 9.19 ms / 36 runs ( 0.26 ms per token, 3917.73 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15172.68 ms / 28 runs ( 541.88 ms per token, 1.85 tokens per second)
llama_perf_context_print: total time = 15186.49 ms / 29 tokens
llama_perf_context_print: graphs reused = 28
[2025-09-12 00:17:09][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:17:09][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-cmi03v0wqdhixfj5bbn1om",
"object": "chat.completion",
"created": 1757654214,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me start by acknowledging their greeting.\n\nHmm,",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:17:10][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:17:10][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:17:11][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:17:11][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:17:11][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:17:12][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:17:48][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:17:48][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:17:48][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 254,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:17:48][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:17:48][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:17:48][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
Total prompt tokens: 8
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:17:49][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:17:49][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:17:51][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:17:51][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:17:51][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:17:51][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:17:51][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:17:51][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:17:52][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:17:52][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:17:52][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
[2025-09-12 00:17:52][DEBUG] print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:17:55][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:18:03][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:18:03][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 9.80 ms / 36 runs ( 0.27 ms per token, 3673.47 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15197.28 ms / 28 runs ( 542.76 ms per token, 1.84 tokens per second)
llama_perf_context_print: total time = 15211.42 ms / 29 tokens
llama_perf_context_print: graphs reused = 28
[2025-09-12 00:18:03][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:18:03][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-sa2a6b2k1jg1btir6v3m",
"object": "chat.completion",
"created": 1757654268,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\n",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:18:05][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:18:05][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:18:05][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:18:05][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:18:05][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:18:07][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:18:33][INFO]
[2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:18:33][INFO]
[2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:18:45][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:18:45][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:18:45][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 254,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:18:45][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:18:45][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:18:45][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
Total prompt tokens: 8
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:18:46][DEBUG] FinishedProcessingPrompt. Progress: 100
No tokens to output. Continuing generation
[2025-09-12 00:18:48][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:18:48][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:18:48][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[2025-09-12 00:18:48][DEBUG][LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:18:48][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:18:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:18:48][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
[2025-09-12 00:18:48][DEBUG] print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:18:48][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:18:48][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:18:48][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
[2025-09-12 00:18:48][DEBUG] print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:18:51][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:19:01][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:19:01][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 10.06 ms / 36 runs ( 0.28 ms per token, 3579.60 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15165.60 ms / 28 runs ( 541.63 ms per token, 1.85 tokens per second)
llama_perf_context_print: total time = 15180.15 ms / 29 tokens
llama_perf_context_print: graphs reused = 28
[2025-09-12 00:19:01][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:19:01][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-zcdn1g8f3pi21sby0xvp9q",
"object": "chat.completion",
"created": 1757654325,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. First, I should acknowledge their greeting. Maybe say",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:19:02][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:19:02][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:19:02][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:19:02][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:19:02][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:19:04][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:19:11][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
[2025-09-12 00:19:11][INFO]
[2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models
[2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions
[2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions
[2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings
[2025-09-12 00:19:11][INFO]
[2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:19:16][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 254,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:19:16][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:19:16][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"input": [
"Test input"
]
}
[2025-09-12 00:19:16][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
[2025-09-12 00:19:16][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:19:16][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
Total prompt tokens: 8
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:19:17][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:19:17][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:19:18][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: OFF
[2025-09-12 00:19:18][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:19:18][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
Model: 11.16 GB
Context: 2.26 GB
Total: 13.42 GB
[LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:19:19][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:19:19][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:19:19][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:19:19][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:19:19][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:19:19][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
[2025-09-12 00:19:19][DEBUG] print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:19:22][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:19:31][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:19:31][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 9.34 ms / 36 runs ( 0.26 ms per token, 3855.63 tokens per second)
llama_perf_context_print: load time = 13243.14 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 15091.80 ms / 28 runs ( 538.99 ms per token, 1.86 tokens per second)
llama_perf_context_print: total time = 15105.49 ms / 29 tokens
[2025-09-12 00:19:31][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:19:31][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-ong1jptqvmfus4sqcmu9c",
"object": "chat.completion",
"created": 1757654356,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation here",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:19:31][DEBUG] llama_perf_context_print: graphs reused = 28
[2025-09-12 00:19:32][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16771
llama_context: n_ctx_per_seq = 16771
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:19:32][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
llama_kv_cache: CPU KV buffer size = 1157.06 MiB
[2025-09-12 00:19:33][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB
[2025-09-12 00:19:33][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 45.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
[2025-09-12 00:19:33][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:19:33][INFO]
[2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:19:33][INFO]
[2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:19:34][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:19:35][INFO]
[2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:19:35][INFO]
[2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:19:57][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 254,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:19:57][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "",
"input": [
"Test input"
]
}
[2025-09-12 00:20:03][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 254,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:20:03][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "",
"input": [
"Test input"
]
}
[2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:21:22][INFO]
[2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:21:22][INFO]
[2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:26:09][INFO]
[2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:26:09][INFO]
[2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:27:53][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: ON
[2025-09-12 00:27:53][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:27:53][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192':
Model: 11.16 GB
Context: 1.43 GB
Total: 12.58 GB
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:27:53][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:27:53][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:27:53][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
[2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:27:54][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:27:54][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:27:54][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
[2025-09-12 00:27:54][DEBUG] print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:27:56][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:28:05][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:28:05][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
[2025-09-12 00:28:05][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 280.50 MiB
[2025-09-12 00:28:05][DEBUG] llama_kv_cache: CPU KV buffer size = 280.50 MiB
[2025-09-12 00:28:05][DEBUG] llama_kv_cache: size = 561.00 MiB ( 8192 cells, 44 layers, 1/1 seqs), K (q8_0): 280.50 MiB, V (q8_0): 280.50 MiB
[2025-09-12 00:28:05][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 28.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 246 (with bs=512), 3 (with bs=1)
[2025-09-12 00:28:05][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:28:06][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:28:47][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created.
[2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio
[2025-09-12 00:28:47][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor.
[2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor][Request (sKQs11)] New preprocess request received.
[2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor][Request (sKQs11)] Preprocess request completed.
[2025-09-12 00:28:47][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:28:47][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 9
Total prompt tokens: 9
Prompt tokens to decode: 9
BeginProcessingPrompt
[2025-09-12 00:28:48][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:28:48][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:29:03][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 8.82 ms / 39 runs ( 0.23 ms per token, 4423.27 tokens per second)
llama_perf_context_print: load time = 12603.59 ms
llama_perf_context_print: prompt eval time = 1035.86 ms / 9 tokens ( 115.10 ms per token, 8.69 tokens per second)
llama_perf_context_print: eval time = 14731.45 ms / 29 runs ( 507.98 ms per token, 1.97 tokens per second)
llama_perf_context_print: total time = 15780.55 ms / 38 tokens
llama_perf_context_print: graphs reused = 28
[2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:29:18][INFO]
[2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:29:18][INFO]
[2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:29:18][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected.
[2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:29:19][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
[2025-09-12 00:29:19][INFO]
[2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models
[2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions
[2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions
[2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings
[2025-09-12 00:29:19][INFO]
[2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:29:50][INFO] Server stopped.
[2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:30:01][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
[2025-09-12 00:30:01][INFO]
[2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models
[2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions
[2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions
[2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings
[2025-09-12 00:30:01][INFO]
[2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:30:01][INFO] Server started.
[2025-09-12 00:30:01][INFO] Just-in-time model loading active.
[2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client created.
[2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created.
[2025-09-12 00:30:02][INFO][Plugin(lmstudio/js-code-sandbox)] stdout: [Tools Prvdr.] Register with LM Studio
[2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox][Endpoint=setToolsProvider] Registering tools provider.
[2025-09-12 00:30:02][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio
[2025-09-12 00:30:03][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor.
[2025-09-12 00:30:03][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected.
[2025-09-12 00:30:25][INFO] Server stopped.
[2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:31:07][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
[2025-09-12 00:31:07][INFO]
[2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models
[2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions
[2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions
[2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings
[2025-09-12 00:31:07][INFO]
[2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:31:07][INFO] Server started.
[2025-09-12 00:31:07][INFO] Just-in-time model loading active.
[2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client created.
[2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created.
[2025-09-12 00:31:08][INFO][Plugin(lmstudio/js-code-sandbox)] stdout: [Tools Prvdr.] Register with LM Studio
[2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox][Endpoint=setToolsProvider] Registering tools provider.
[2025-09-12 00:31:08][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio
[2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor.
[2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client disconnected.
[2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected.
[2025-09-12 00:31:43][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: ON
[2025-09-12 00:31:43][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:31:43][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192':
Model: 11.16 GB
Context: 1.43 GB
Total: 12.58 GB
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[2025-09-12 00:31:43][DEBUG][LM Studio] Resolved GPU config options:
Num Offload Layers: 22
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:31:44][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:31:44][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:31:44][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Reka Flash 3
llama_model_loader: - kv 3: general.version str = 3
llama_model_loader: - kv 4: general.basename str = reka-flash
llama_model_loader: - kv 5: general.size_label str = 21B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: llama.block_count u32 = 44
llama_model_loader: - kv 8: llama.context_length u32 = 32768
llama_model_loader: - kv 9: llama.embedding_length u32 = 6144
llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648
llama_model_loader: - kv 11: llama.attention.head_count u32 = 64
llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: llama.vocab_size u32 = 100352
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx
[2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257
llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 7
llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180
llama_model_loader: - type f32: 89 tensors
llama_model_loader: - type q8_0: 309 tensors
llama_model_loader: - type bf16: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 21.23 GiB (8.72 BPW)
[2025-09-12 00:31:44][DEBUG] load: printing all EOG tokens:
load: - 100257 ('<|endoftext|>')
[2025-09-12 00:31:44][DEBUG] load: special tokens cache size = 21
[2025-09-12 00:31:44][DEBUG] load: token to piece cache size = 0.6145 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_layer = 44
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 96
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 96
print_info: n_embd_head_v = 96
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 19648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
[2025-09-12 00:31:44][DEBUG] print_info: rope scaling = linear
print_info: freq_base_train = 8000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 20.91 B
print_info: general.name = Reka Flash 3
print_info: vocab type = BPE
print_info: n_vocab = 100352
print_info: n_merges = 100000
print_info: BOS token = 100257 '<|endoftext|>'
print_info: EOS token = 100257 '<|endoftext|>'
print_info: EOT token = 100257 '<|endoftext|>'
print_info: UNK token = 100257 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
print_info: FIM MID token = 100259 '<|fim_middle|>'
print_info: EOG token = 100257 '<|endoftext|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:31:47][DEBUG] load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/45 layers to GPU
load_tensors: Vulkan0 model buffer size = 9967.55 MiB
load_tensors: CPU_Mapped model buffer size = 11768.32 MiB
[2025-09-12 00:31:55][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 8000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[2025-09-12 00:31:55][DEBUG] llama_context: CPU output buffer size = 0.38 MiB
[2025-09-12 00:31:56][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 280.50 MiB
[2025-09-12 00:31:56][DEBUG] llama_kv_cache: CPU KV buffer size = 280.50 MiB
[2025-09-12 00:31:56][DEBUG] llama_kv_cache: size = 561.00 MiB ( 8192 cells, 44 layers, 1/1 seqs), K (q8_0): 280.50 MiB, V (q8_0): 280.50 MiB
[2025-09-12 00:31:56][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB
llama_context: Vulkan_Host compute buffer size = 28.01 MiB
llama_context: graph nodes = 1371
llama_context: graph splits = 246 (with bs=512), 3 (with bs=1)
[2025-09-12 00:31:56][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:31:56][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:33:47][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 254,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:33:47][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:33:47][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:33:47][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
Total prompt tokens: 8
Prompt tokens to decode: 8
BeginProcessingPrompt
[2025-09-12 00:33:48][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:33:48][DEBUG] No tokens to output. Continuing generation
[2025-09-12 00:34:02][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
[2025-09-12 00:34:02][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 9.17 ms / 37 runs ( 0.25 ms per token, 4033.14 tokens per second)
llama_perf_context_print: load time = 12853.96 ms
llama_perf_context_print: prompt eval time = 820.59 ms / 8 tokens ( 102.57 ms per token, 9.75 tokens per second)
llama_perf_context_print: eval time = 14363.18 ms / 28 runs ( 512.97 ms per token, 1.95 tokens per second)
llama_perf_context_print: total time = 15197.21 ms / 36 tokens
llama_perf_context_print: graphs reused = 27
[2025-09-12 00:34:02][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
[2025-09-12 00:34:02][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
"id": "chatcmpl-t5ql74xf1rih385fceeszc",
"object": "chat.completion",
"created": 1757655227,
"model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\nFirst",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 29,
"total_tokens": 37
},
"stats": {},
"system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
}
[2025-09-12 00:34:48][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
[2025-09-12 00:34:48][DEBUG][WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
[2025-09-12 00:34:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:34:48][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 15
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
[2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
[2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q4_K: 43 tensors
llama_model_loader: - type q5_K: 12 tensors
llama_model_loader: - type q6_K: 6 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 79.49 MiB (4.88 BPW)
[2025-09-12 00:34:48][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 102 ('[SEP]')
load: special tokens cache size = 5
[2025-09-12 00:34:48][DEBUG] load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
[2025-09-12 00:34:48][DEBUG] print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:34:48][DEBUG] load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: Vulkan0 model buffer size = 66.90 MiB
load_tensors: CPU_Mapped model buffer size = 12.59 MiB
[2025-09-12 00:34:48][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
[2025-09-12 00:34:48][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
[2025-09-12 00:34:48][DEBUG] llama_context: Flash Attention was auto, set to enabled
[2025-09-12 00:34:48][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB
llama_context: Vulkan_Host compute buffer size = 22.03 MiB
llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1)
llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
common_init_from_params: added [SEP] logit bias = -inf
[2025-09-12 00:34:48][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:34:49][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[INFO] [PaniniRagEngine] Model loaded without an active session.
[2025-09-12 00:35:15][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "text-embedding-nomic-embed-text-v1.5",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:35:15][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
[2025-09-12 00:35:16][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
[WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
[2025-09-12 00:35:16][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:35:16][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 15
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
[2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
[2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q4_K: 43 tensors
llama_model_loader: - type q5_K: 12 tensors
llama_model_loader: - type q6_K: 6 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 79.49 MiB (4.88 BPW)
[2025-09-12 00:35:16][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 102 ('[SEP]')
load: special tokens cache size = 5
[2025-09-12 00:35:16][DEBUG] load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
[2025-09-12 00:35:16][DEBUG] print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:35:17][DEBUG] load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: Vulkan0 model buffer size = 66.90 MiB
load_tensors: CPU_Mapped model buffer size = 12.59 MiB
[2025-09-12 00:35:17][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
[2025-09-12 00:35:17][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
llama_context: Flash Attention was auto, set to enabled
[2025-09-12 00:35:17][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB
llama_context: Vulkan_Host compute buffer size = 22.03 MiB
llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1)
llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
common_init_from_params: added [SEP] logit bias = -inf
[2025-09-12 00:35:17][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:35:17][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[INFO] [PaniniRagEngine] Model loaded without an active session.
[2025-09-12 00:35:34][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "text-embedding-nomic-embed-text-v1.5",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:35:34][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
[2025-09-12 00:35:36][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
[WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
[2025-09-12 00:35:36][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:35:36][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 15
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
[2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
[2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q4_K: 43 tensors
llama_model_loader: - type q5_K: 12 tensors
llama_model_loader: - type q6_K: 6 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 79.49 MiB (4.88 BPW)
[2025-09-12 00:35:36][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 102 ('[SEP]')
load: special tokens cache size = 5
[2025-09-12 00:35:36][DEBUG] load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
[2025-09-12 00:35:36][DEBUG] print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:35:36][DEBUG] load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: Vulkan0 model buffer size = 66.90 MiB
load_tensors: CPU_Mapped model buffer size = 12.59 MiB
[2025-09-12 00:35:36][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
[2025-09-12 00:35:36][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
llama_context: Flash Attention was auto, set to enabled
[2025-09-12 00:35:36][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB
llama_context: Vulkan_Host compute buffer size = 22.03 MiB
llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1)
llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
common_init_from_params: added [SEP] logit bias = -inf
[2025-09-12 00:35:36][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:35:36][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[INFO] [PaniniRagEngine] Model loaded without an active session.
[2025-09-12 00:35:58][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "text-embedding-nomic-embed-text-v1.5",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:35:58][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
[2025-09-12 00:36:00][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
[WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
[2025-09-12 00:36:00][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:36:00][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 15
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
[2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
[2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q4_K: 43 tensors
llama_model_loader: - type q5_K: 12 tensors
llama_model_loader: - type q6_K: 6 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 79.49 MiB (4.88 BPW)
[2025-09-12 00:36:00][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 102 ('[SEP]')
load: special tokens cache size = 5
[2025-09-12 00:36:00][DEBUG] load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
[2025-09-12 00:36:00][DEBUG] print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:36:00][DEBUG] load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: Vulkan0 model buffer size = 66.90 MiB
load_tensors: CPU_Mapped model buffer size = 12.59 MiB
[2025-09-12 00:36:00][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
[2025-09-12 00:36:00][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
llama_context: Flash Attention was auto, set to enabled
[2025-09-12 00:36:00][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB
llama_context: Vulkan_Host compute buffer size = 22.03 MiB
llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1)
llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
common_init_from_params: added [SEP] logit bias = -inf
[2025-09-12 00:36:00][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:36:00][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[INFO] [PaniniRagEngine] Model loaded without an active session.
[2025-09-12 00:36:00][DEBUG]
[2025-09-12 00:36:20][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "text-embedding-nomic-embed-text-v1.5",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:36:20][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
[2025-09-12 00:36:22][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
[WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
[2025-09-12 00:36:22][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:36:22][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 15
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
[2025-09-12 00:36:22][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-09-12 00:36:23][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
[2025-09-12 00:36:23][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q4_K: 43 tensors
llama_model_loader: - type q5_K: 12 tensors
llama_model_loader: - type q6_K: 6 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 79.49 MiB (4.88 BPW)
[2025-09-12 00:36:23][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 102 ('[SEP]')
load: special tokens cache size = 5
[2025-09-12 00:36:23][DEBUG] load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
[2025-09-12 00:36:23][DEBUG] print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:36:23][DEBUG] load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: Vulkan0 model buffer size = 66.90 MiB
load_tensors: CPU_Mapped model buffer size = 12.59 MiB
[2025-09-12 00:36:23][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
[2025-09-12 00:36:23][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
llama_context: Flash Attention was auto, set to enabled
[2025-09-12 00:36:23][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB
llama_context: Vulkan_Host compute buffer size = 22.03 MiB
llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1)
llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
common_init_from_params: added [SEP] logit bias = -inf
[2025-09-12 00:36:23][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:36:23][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[INFO] [PaniniRagEngine] Model loaded without an active session.
[2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:37:47][INFO]
[2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:37:47][INFO]
[2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:41:47][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: ON
[2025-09-12 00:41:47][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:41:47][DEBUG][LM Studio] Model load size estimate with raw num offload layers 'max' and context length '2048':
Model: 13.81 GB
Context: 1.09 GB
Total: 14.91 GB
[2025-09-12 00:41:47][DEBUG][LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[LM Studio] Resolved GPU config options:
Num Offload Layers: max
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:41:48][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:41:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:41:48][DEBUG] llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from D:\AI-Models\__LMStudio\Random\Nethena-13B\Nethena-13B.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 7
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
[2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
[2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
[2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 12.88 GiB (8.50 BPW)
[2025-09-12 00:41:48][DEBUG] load: bad special token: 'tokenizer.ggml.padding_token_id' = 32000, using default id -1
[2025-09-12 00:41:48][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 2 ('</s>')
load: special tokens cache size = 3
[2025-09-12 00:41:48][DEBUG] load: token to piece cache size = 0.1684 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 4096
print_info: n_embd = 5120
print_info: n_layer = 40
print_info: n_head = 40
print_info: n_head_kv = 40
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
[2025-09-12 00:41:48][DEBUG] print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 5120
print_info: n_embd_v_gqa = 5120
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 13824
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 13B
print_info: model params = 13.02 B
print_info: general.name = LLaMA v2
print_info: vocab type = SPM
print_info: n_vocab = 32000
print_info: n_merges = 0
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:41:59][DEBUG] load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: Vulkan0 model buffer size = 13023.85 MiB
load_tensors: CPU_Mapped model buffer size = 166.02 MiB
[2025-09-12 00:43:06][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
[2025-09-12 00:43:06][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
[2025-09-12 00:43:06][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 850.00 MiB
[2025-09-12 00:43:06][DEBUG] llama_kv_cache: size = 850.00 MiB ( 2048 cells, 40 layers, 1/1 seqs), K (q8_0): 425.00 MiB, V (q8_0): 425.00 MiB
[2025-09-12 00:43:06][DEBUG] llama_context: Vulkan0 compute buffer size = 117.01 MiB
llama_context: Vulkan_Host compute buffer size = 14.01 MiB
llama_context: graph nodes = 1247
llama_context: graph splits = 2
common_init_from_params: added </s> logit bias = -inf
[2025-09-12 00:43:06][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:43:07][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:43:28][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
[WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
[2025-09-12 00:43:28][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:43:28][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 15
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
[2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
[2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q4_K: 43 tensors
llama_model_loader: - type q5_K: 12 tensors
llama_model_loader: - type q6_K: 6 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 79.49 MiB (4.88 BPW)
[2025-09-12 00:43:28][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 102 ('[SEP]')
load: special tokens cache size = 5
[2025-09-12 00:43:28][DEBUG] load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
[2025-09-12 00:43:28][DEBUG] print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:43:28][DEBUG] load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: Vulkan0 model buffer size = 66.90 MiB
load_tensors: CPU_Mapped model buffer size = 12.59 MiB
[2025-09-12 00:43:28][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
[2025-09-12 00:43:28][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
llama_context: Flash Attention was auto, set to enabled
[2025-09-12 00:43:28][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB
llama_context: Vulkan_Host compute buffer size = 22.03 MiB
llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1)
llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
common_init_from_params: added [SEP] logit bias = -inf
[2025-09-12 00:43:28][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:43:28][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[INFO] [PaniniRagEngine] Model loaded without an active session.
[2025-09-12 00:43:28][DEBUG]
[2025-09-12 00:44:03][INFO] Server stopped.
[2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
[2025-09-12 00:44:06][INFO]
[2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Supported endpoints:
[2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models
[2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions
[2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions
[2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings
[2025-09-12 00:44:06][INFO]
[2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
[2025-09-12 00:44:06][INFO] Server started.
[2025-09-12 00:44:06][INFO] Just-in-time model loading active.
[2025-09-12 00:45:26][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
[WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
[INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
[2025-09-12 00:45:26][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:45:26][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 15
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
[2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
[2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
[2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type q4_K: 43 tensors
llama_model_loader: - type q5_K: 12 tensors
llama_model_loader: - type q6_K: 6 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 79.49 MiB (4.88 BPW)
[2025-09-12 00:45:26][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 102 ('[SEP]')
load: special tokens cache size = 5
[2025-09-12 00:45:26][DEBUG] load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
[2025-09-12 00:45:26][DEBUG] print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:45:26][DEBUG] load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: Vulkan0 model buffer size = 66.90 MiB
load_tensors: CPU_Mapped model buffer size = 12.59 MiB
[2025-09-12 00:45:26][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
[2025-09-12 00:45:26][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
llama_context: Flash Attention was auto, set to enabled
[2025-09-12 00:45:26][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB
llama_context: Vulkan_Host compute buffer size = 22.03 MiB
llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1)
llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
common_init_from_params: added [SEP] logit bias = -inf
[2025-09-12 00:45:26][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:45:26][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
[INFO] [PaniniRagEngine] Model loaded into embedding engine!
[
[2025-09-12 00:45:26][DEBUG] INFO] [PaniniRagEngine] Model loaded without an active session.
[2025-09-12 00:45:45][DEBUG][LM Studio] GPU Configuration:
Strategy: evenly
Priority: []
Disabled GPUs: []
Limit weight offload to dedicated GPU Memory: OFF
Offload KV Cache to GPU: ON
[2025-09-12 00:45:45][DEBUG][LM Studio] Live GPU memory info:
No live GPU info available
[2025-09-12 00:45:45][DEBUG][LM Studio] Model load size estimate with raw num offload layers 'max' and context length '2048':
Model: 13.81 GB
Context: 1.09 GB
Total: 14.91 GB
[LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
[2025-09-12 00:45:45][DEBUG][LM Studio] Resolved GPU config options:
Num Offload Layers: max
Num CPU Expert Layers: 0
Main GPU: 0
Tensor Split: [0]
Disabled GPUs: []
[2025-09-12 00:45:45][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[2025-09-12 00:45:45][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
[2025-09-12 00:45:45][DEBUG] llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from D:\AI-Models\__LMStudio\Random\Nethena-13B\Nethena-13B.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 7
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
[2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
[2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
[2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 12.88 GiB (8.50 BPW)
[2025-09-12 00:45:45][DEBUG] load: bad special token: 'tokenizer.ggml.padding_token_id' = 32000, using default id -1
[2025-09-12 00:45:45][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 2 ('</s>')
load: special tokens cache size = 3
[2025-09-12 00:45:45][DEBUG] load: token to piece cache size = 0.1684 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 4096
print_info: n_embd = 5120
print_info: n_layer = 40
print_info: n_head = 40
print_info: n_head_kv = 40
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 5120
print_info: n_embd_v_gqa = 5120
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 13824
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 13B
print_info: model params = 13.02 B
print_info: general.name = LLaMA v2
print_info: vocab type = SPM
print_info: n_vocab = 32000
print_info: n_merges = 0
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
[2025-09-12 00:45:48][DEBUG] load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: Vulkan0 model buffer size = 13023.85 MiB
load_tensors: CPU_Mapped model buffer size = 166.02 MiB
[2025-09-12 00:45:53][DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
[2025-09-12 00:45:53][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB
[2025-09-12 00:45:53][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 850.00 MiB
[2025-09-12 00:45:54][DEBUG] llama_kv_cache: size = 850.00 MiB ( 2048 cells, 40 layers, 1/1 seqs), K (q8_0): 425.00 MiB, V (q8_0): 425.00 MiB
[2025-09-12 00:45:54][DEBUG] llama_context: Vulkan0 compute buffer size = 117.01 MiB
llama_context: Vulkan_Host compute buffer size = 14.01 MiB
llama_context: graph nodes = 1247
llama_context: graph splits = 2
common_init_from_params: added </s> logit bias = -inf
[2025-09-12 00:45:54][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[2025-09-12 00:45:54][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
[2025-09-12 00:46:40][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "nethena-13b@q8_0",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:46:40][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:46:40][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:46:40][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
Total prompt tokens: 14
Prompt tokens to decode: 14
BeginProcessingPrompt
[2025-09-12 00:46:41][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:46:46][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 0.91 ms / 23 runs ( 0.04 ms per token, 25358.32 tokens per second)
llama_perf_context_print: load time = 8894.17 ms
llama_perf_context_print: prompt eval time = 919.45 ms / 14 tokens ( 65.68 ms per token, 15.23 tokens per second)
llama_perf_context_print: eval time = 4928.79 ms / 8 runs ( 616.10 ms per token, 1.62 tokens per second)
llama_perf_context_print: total time = 5850.57 ms / 22 tokens
llama_perf_context_print: graphs reused = 7
[2025-09-12 00:46:46][INFO][nethena-13b@q8_0] Model generated tool calls: []
[2025-09-12 00:46:46][INFO][nethena-13b@q8_0] Generated prediction: {
"id": "chatcmpl-rdmujbhvp5pargg7qaws7",
"object": "chat.completion",
"created": 1757656000,
"model": "nethena-13b@q8_0",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! How can I help you?",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 9,
"total_tokens": 23
},
"stats": {},
"system_fingerprint": "nethena-13b@q8_0"
}
[2025-09-12 00:47:35][DEBUG] Received request: GET to /api/tags
[2025-09-12 00:47:35][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
[2025-09-12 00:48:08][DEBUG] Received request: GET to /v1/completions/api/tags
[2025-09-12 00:48:08][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
[2025-09-12 00:48:34][DEBUG] Received request: GET to /v1/completions/api/tags
[2025-09-12 00:48:34][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
[2025-09-12 00:48:40][DEBUG] Received request: GET to /v1/completions/api/tags
[2025-09-12 00:48:40][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
[2025-09-12 00:49:04][DEBUG] Received request: GET to /v1/completions/api/tags
[2025-09-12 00:49:04][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
[2025-09-12 00:49:08][DEBUG] Received request: GET to /v1/completions/api/tags
[2025-09-12 00:49:08][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
[2025-09-12 00:49:43][DEBUG] Received request: GET to /api/v1/models
[2025-09-12 00:49:43][INFO] Returning 20 models from v1 API
[2025-09-12 00:49:58][DEBUG] Received request: GET to /api/v1/models
[2025-09-12 00:49:58][INFO] Returning 20 models from v1 API
[2025-09-12 00:52:37][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "nethena-13b@q8_0",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:52:37][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:52:37][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
[2025-09-12 00:52:37][DEBUG] Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix
Total prompt tokens: 14
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:52:38][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:52:43][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 1.01 ms / 24 runs ( 0.04 ms per token, 23692.00 tokens per second)
llama_perf_context_print: load time = 8894.17 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 6072.72 ms / 10 runs ( 607.27 ms per token, 1.65 tokens per second)
llama_perf_context_print: total time = 6077.33 ms / 11 tokens
llama_perf_context_print: graphs reused = 10
[2025-09-12 00:52:43][INFO][nethena-13b@q8_0] Model generated tool calls: []
[2025-09-12 00:52:43][INFO][nethena-13b@q8_0] Generated prediction: {
"id": "chatcmpl-r2s94r9fkeqa5nt1tvgna5",
"object": "chat.completion",
"created": 1757656357,
"model": "nethena-13b@q8_0",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! How can I help you today?",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 10,
"total_tokens": 24
},
"stats": {},
"system_fingerprint": "nethena-13b@q8_0"
}
[2025-09-12 00:52:55][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "nethena-13b@q8_0",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:52:55][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:52:55][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix
Total prompt tokens: 14
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:52:56][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:53:01][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 0.93 ms / 23 runs ( 0.04 ms per token, 24651.66 tokens per second)
llama_perf_context_print: load time = 8894.17 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 5397.13 ms / 9 runs ( 599.68 ms per token, 1.67 tokens per second)
llama_perf_context_print: total time = 5398.86 ms / 10 tokens
llama_perf_context_print: graphs reused = 9
[2025-09-12 00:53:01][INFO][nethena-13b@q8_0] Model generated tool calls: []
[2025-09-12 00:53:01][INFO][nethena-13b@q8_0] Generated prediction: {
"id": "chatcmpl-fe5qlymqnimr3izqhwkyqq",
"object": "chat.completion",
"created": 1757656375,
"model": "nethena-13b@q8_0",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! How can I help you?",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 9,
"total_tokens": 23
},
"stats": {},
"system_fingerprint": "nethena-13b@q8_0"
}
[2025-09-12 00:53:26][DEBUG] Received request: POST to /v1/chat/completions with body {
"model": "nethena-13b@q8_0",
"temperature": 0.7,
"top_p": 1,
"typical_p": 1,
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "Hi"
}
]
}
[2025-09-12 00:53:26][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
[2025-09-12 00:53:26][DEBUG] Received request: POST to /v1/embeddings with body {
"model": "text-embedding-nomic-embed-text-v1.5",
"input": [
"Test input"
]
}
[2025-09-12 00:53:26][INFO] Received request to embed multiple: [
"Test input"
]
[2025-09-12 00:53:26][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
Sampling:
logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix
Total prompt tokens: 14
Prompt tokens to decode: 1
BeginProcessingPrompt
[2025-09-12 00:53:27][INFO] Returning embeddings (not shown in logs)
[2025-09-12 00:53:27][DEBUG] FinishedProcessingPrompt. Progress: 100
[2025-09-12 00:53:32][DEBUG] Target model llama_perf stats:
llama_perf_sampler_print: sampling time = 1.01 ms / 24 runs ( 0.04 ms per token, 23762.38 tokens per second)
llama_perf_context_print: load time = 8894.17 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 5993.23 ms / 10 runs ( 599.32 ms per token, 1.67 tokens per second)
llama_perf_context_print: total time = 5995.08 ms / 11 tokens
llama_perf_context_print: graphs reused = 10
[2025-09-12 00:53:32][INFO][nethena-13b@q8_0] Model generated tool calls: []
[2025-09-12 00:53:32][INFO][nethena-13b@q8_0] Generated prediction: {
"id": "chatcmpl-ahoig45pfuw2yw3kjxjc2o",
"object": "chat.completion",
"created": 1757656406,
"model": "nethena-13b@q8_0",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! How can I help you today?",
"reasoning_content": "",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 10,
"total_tokens": 24
},
"stats": {},
"system_fingerprint": "nethena-13b@q8_0"
}
[2025-09-12 00:58:45][DEBUG] Received request: GET to /api/v1/models
[2025-09-12 00:58:45][INFO] Returning 20 models from v1 API
[2025-09-12 00:59:34][DEBUG] Received request: GET to /api/tags
[2025-09-12 00:59:34][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
[2025-09-12 00:59:37][DEBUG] Received request: GET to /api/tags
[2025-09-12 00:59:37][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
[2025-09-12 01:00:15][DEBUG] Received request: GET to /v1/completions/api/v1/models
[2025-09-12 01:00:15][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
[2025-09-12 01:00:51][DEBUG] Received request: GET to /v1/completions/api/v1/models
[2025-09-12 01:00:51][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
[2025-09-12 01:01:13][DEBUG] Received request: GET to /v1/completions/api/v1/models
[2025-09-12 01:01:13][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
[2025-09-12 01:01:28][DEBUG] Received request: GET to /api/tags
[2025-09-12 01:01:28][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
[2025-09-12 01:01:30][DEBUG] Received request: GET to /v1/completions/api/v1/models
[2025-09-12 01:01:30][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment