raz334 · September 12, 2025 06:18
diff --git a/2025-09-12.1.log b/2025-09-12.1.log
 [2025-09-12 00:09:17][DEBUG] Received request: GET to /v1/chat/completions/api/tags
 [2025-09-12 00:09:17][ERROR] Unexpected endpoint or method. (GET /v1/chat/completions/api/tags). Returning 200 anyway
 [2025-09-12 00:09:30][DEBUG] Received request: GET to /v1/embeddings/api/tags
 [2025-09-12 00:09:30][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
 [2025-09-12 00:09:38][DEBUG] Received request: GET to /v1/embeddings/api/tags
 [2025-09-12 00:09:38][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
 [2025-09-12 00:09:46][DEBUG] Received request: GET to /v1/chat/completions/api/v1/models
 [2025-09-12 00:09:46][ERROR] Unexpected endpoint or method. (GET /v1/chat/completions/api/v1/models). Returning 200 anyway
 [2025-09-12 00:10:40][DEBUG] Received request: GET to /v1/embeddings/api/tags
 [2025-09-12 00:10:40][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
 [2025-09-12 00:10:47][DEBUG] Received request: GET to /v1/embeddings/api/tags
 [2025-09-12 00:10:47][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway
 [2025-09-12 00:10:56][DEBUG] Received request: GET to /v1/api/tags
 [2025-09-12 00:10:56][ERROR] Unexpected endpoint or method. (GET /v1/api/tags). Returning 200 anyway
 [2025-09-12 00:10:58][DEBUG] Received request: GET to /api/tags
 [2025-09-12 00:10:58][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
 [2025-09-12 00:12:07][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:12:07][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:12:07][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192':
  Model: 11.16 GB
  Context: 1.13 GB
  Total: 12.29 GB
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [2025-09-12 00:12:07][DEBUG][LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:12:07][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:12:07][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:12:07][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:12:07][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:12:07][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:12:07][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:12:11][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:12:19][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 8192
 llama_context: n_ctx_per_seq = 8192
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:12:19][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 [2025-09-12 00:12:19][DEBUG] llama_kv_cache:        CPU KV buffer size =   561.00 MiB
 [2025-09-12 00:12:19][DEBUG] llama_kv_cache: size =  561.00 MiB (  8192 cells,  44 layers,  1/1 seqs), K (q8_0):  280.50 MiB, V (q8_0):  280.50 MiB
 [2025-09-12 00:12:19][DEBUG] llama_context:    Vulkan0 compute buffer size =  1384.00 MiB
 llama_context: Vulkan_Host compute buffer size =    28.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:12:19][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:12:20][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:14:05][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:14:05][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:14:05][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 8192,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:14:05][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:14:05][DEBUG] Sampling params:
 [2025-09-12 00:14:05][DEBUG] 	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:14:05][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
 [2025-09-12 00:14:05][DEBUG] Total prompt tokens: 8
 Prompt tokens to decode: 8
 BeginProcessingPrompt
 [2025-09-12 00:14:07][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:14:07][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:14:07][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:14:07][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [2025-09-12 00:14:07][DEBUG][LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:14:07][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:14:07][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:14:07][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:14:07][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:14:07][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:14:07][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 [2025-09-12 00:14:07][DEBUG] print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:14:11][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:14:20][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:14:21][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       9.12 ms /    35 runs   (    0.26 ms per token,  3836.46 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =    1112.75 ms /     8 tokens (  139.09 ms per token,     7.19 tokens per second)
 llama_perf_context_print:        eval time =   14062.28 ms /    26 runs   (  540.86 ms per token,     1.85 tokens per second)
 llama_perf_context_print:       total time =   15193.48 ms /    34 tokens
 llama_perf_context_print:    graphs reused =         25
 [2025-09-12 00:14:21][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:14:21][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-ougj4mogwoqqolfz7ayoj",
  "object": "chat.completion",
  "created": 1757654045,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think.\n\nFirst, since they greeted",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 27,
    "total_tokens": 35
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:14:22][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:14:22][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:14:22][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:14:23][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:14:23][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:14:24][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:14:36][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 8192,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:14:36][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:14:36][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:14:36][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:14:36][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
 [2025-09-12 00:14:36][DEBUG] Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
 Total prompt tokens: 8
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:14:36][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:14:36][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:14:38][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:14:38][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:14:38][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:14:38][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:14:38][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:14:38][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:14:39][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:14:39][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:14:39][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 [2025-09-12 00:14:39][DEBUG] print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:14:41][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:14:51][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:14:51][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       9.53 ms /    36 runs   (    0.26 ms per token,  3777.94 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =   15096.70 ms /    28 runs   (  539.17 ms per token,     1.85 tokens per second)
 [2025-09-12 00:14:51][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:14:51][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-28i1pkpa7owj92zk46b7s3p",
  "object": "chat.completion",
  "created": 1757654076,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\n",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:14:51][DEBUG] llama_perf_context_print:       total time =   15111.45 ms /    29 tokens
 llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:14:52][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:14:52][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 [2025-09-12 00:14:52][DEBUG] llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:14:52][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:14:52][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:14:52][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:14:53][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:15:13][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:15:13][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:15:13][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 8192,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:15:13][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:15:13][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:15:13][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
 Total prompt tokens: 8
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:15:14][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:15:14][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:15:15][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:15:15][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:15:15][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:15:15][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:15:15][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:15:15][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:15:16][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:15:16][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:15:16][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 [2025-09-12 00:15:16][DEBUG] print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:15:18][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:15:28][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:15:28][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       9.51 ms /    36 runs   (    0.26 ms per token,  3786.29 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =   15154.53 ms /    28 runs   (  541.23 ms per token,     1.85 tokens per second)
 llama_perf_context_print:       total time =   15168.26 ms /    29 tokens
 llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:15:28][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:15:28][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-7fs0j338wau9n5bvptz1gc",
  "object": "chat.completion",
  "created": 1757654113,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about the best way to greet them",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:15:29][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:15:29][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 [2025-09-12 00:15:29][DEBUG] llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:15:29][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:15:29][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:15:29][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:15:31][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:16:01][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 8192,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:16:01][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:16:01][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:16:01][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:16:01][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
 Total prompt tokens: 8
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:16:02][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:16:02][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:16:03][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:16:03][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:16:03][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [2025-09-12 00:16:03][DEBUG][LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:16:04][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:16:04][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:16:04][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:16:04][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:16:04][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:16:04][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 [2025-09-12 00:16:04][DEBUG] print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:16:07][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:16:16][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:16:16][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =      10.85 ms /    36 runs   (    0.30 ms per token,  3319.50 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =   15213.16 ms /    28 runs   (  543.33 ms per token,     1.84 tokens per second)
 llama_perf_context_print:       total time =   15228.41 ms /    29 tokens
 llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:16:16][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:16:16][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-phk3xmfkolzs9eisnnkw",
  "object": "chat.completion",
  "created": 1757654161,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about the best way to greet them",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:16:18][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:16:18][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 [2025-09-12 00:16:18][DEBUG] llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:16:18][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:16:19][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:16:19][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:16:20][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:16:54][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 8192,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:16:54][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:16:54][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:16:54][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:16:54][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
 Total prompt tokens: 8
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:16:55][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:16:55][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:16:56][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:16:56][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:16:56][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:16:56][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:16:57][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:16:57][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:16:57][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:16:57][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 [2025-09-12 00:16:57][DEBUG] print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:17:00][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:17:09][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:17:09][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       9.19 ms /    36 runs   (    0.26 ms per token,  3917.73 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =   15172.68 ms /    28 runs   (  541.88 ms per token,     1.85 tokens per second)
 llama_perf_context_print:       total time =   15186.49 ms /    29 tokens
 llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:17:09][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:17:09][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-cmi03v0wqdhixfj5bbn1om",
  "object": "chat.completion",
  "created": 1757654214,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me start by acknowledging their greeting.\n\nHmm,",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:17:10][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:17:10][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:17:11][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:17:11][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:17:11][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:17:12][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:17:48][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:17:48][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:17:48][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 254,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:17:48][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:17:48][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:17:48][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
 Total prompt tokens: 8
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:17:49][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:17:49][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:17:51][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:17:51][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:17:51][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:17:51][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:17:51][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:17:51][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:17:52][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:17:52][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:17:52][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 [2025-09-12 00:17:52][DEBUG] print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:17:55][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:18:03][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:18:03][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       9.80 ms /    36 runs   (    0.27 ms per token,  3673.47 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =   15197.28 ms /    28 runs   (  542.76 ms per token,     1.84 tokens per second)
 llama_perf_context_print:       total time =   15211.42 ms /    29 tokens
 llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:18:03][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:18:03][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-sa2a6b2k1jg1btir6v3m",
  "object": "chat.completion",
  "created": 1757654268,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\n",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:18:05][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:18:05][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:18:05][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:18:05][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:18:05][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:18:07][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:18:33][INFO] 
 [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:18:33][INFO] 
 [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:18:45][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:18:45][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:18:45][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 254,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:18:45][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:18:45][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:18:45][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
 Total prompt tokens: 8
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:18:46][DEBUG] FinishedProcessingPrompt. Progress: 100
 No tokens to output. Continuing generation
 [2025-09-12 00:18:48][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:18:48][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:18:48][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [2025-09-12 00:18:48][DEBUG][LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:18:48][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:18:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:18:48][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 [2025-09-12 00:18:48][DEBUG] print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:18:48][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:18:48][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:18:48][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 [2025-09-12 00:18:48][DEBUG] print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:18:51][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:19:01][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:19:01][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =      10.06 ms /    36 runs   (    0.28 ms per token,  3579.60 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =   15165.60 ms /    28 runs   (  541.63 ms per token,     1.85 tokens per second)
 llama_perf_context_print:       total time =   15180.15 ms /    29 tokens
 llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:19:01][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:19:01][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-zcdn1g8f3pi21sby0xvp9q",
  "object": "chat.completion",
  "created": 1757654325,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. First, I should acknowledge their greeting. Maybe say",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:19:02][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:19:02][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:19:02][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:19:02][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:19:02][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:19:04][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:19:11][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
 [2025-09-12 00:19:11][INFO] 
 [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] ->	GET  http://192.168.128.20:12345/v1/models
 [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/chat/completions
 [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/completions
 [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/embeddings
 [2025-09-12 00:19:11][INFO] 
 [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:19:16][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 254,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:19:16][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:19:16][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:19:16][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now...
 [2025-09-12 00:19:16][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:19:16][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix
 Total prompt tokens: 8
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:19:17][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:19:17][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:19:18][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
 [2025-09-12 00:19:18][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:19:18][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771':
  Model: 11.16 GB
  Context: 2.26 GB
  Total: 13.42 GB
 [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'.
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:19:19][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:19:19][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:19:19][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:19:19][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:19:19][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:19:19][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 [2025-09-12 00:19:19][DEBUG] print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:19:22][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:19:31][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:19:31][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       9.34 ms /    36 runs   (    0.26 ms per token,  3855.63 tokens per second)
 llama_perf_context_print:        load time =   13243.14 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =   15091.80 ms /    28 runs   (  538.99 ms per token,     1.86 tokens per second)
 llama_perf_context_print:       total time =   15105.49 ms /    29 tokens
 [2025-09-12 00:19:31][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:19:31][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-ong1jptqvmfus4sqcmu9c",
  "object": "chat.completion",
  "created": 1757654356,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation here",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:19:31][DEBUG] llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:19:32][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 16771
 llama_context: n_ctx_per_seq = 16771
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:19:32][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 llama_kv_cache:        CPU KV buffer size =  1157.06 MiB
 [2025-09-12 00:19:33][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells,  44 layers,  1/1 seqs), K (q8_0):  578.53 MiB, V (q8_0):  578.53 MiB
 [2025-09-12 00:19:33][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    45.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 290 (with bs=512), 47 (with bs=1)
 [2025-09-12 00:19:33][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:19:33][INFO] 
 [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:19:33][INFO] 
 [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:19:34][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:19:35][INFO] 
 [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:19:35][INFO] 
 [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:19:57][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 254,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:19:57][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:20:03][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 254,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:20:03][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:21:22][INFO] 
 [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:21:22][INFO] 
 [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:26:09][INFO] 
 [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:26:09][INFO] 
 [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:27:53][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: ON
 [2025-09-12 00:27:53][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:27:53][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192':
  Model: 11.16 GB
  Context: 1.43 GB
  Total: 12.58 GB
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:27:53][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:27:53][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:27:53][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:27:54][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:27:54][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:27:54][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 [2025-09-12 00:27:54][DEBUG] print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:27:56][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:28:05][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 8192
 llama_context: n_ctx_per_seq = 8192
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:28:05][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 [2025-09-12 00:28:05][DEBUG] llama_kv_cache:    Vulkan0 KV buffer size =   280.50 MiB
 [2025-09-12 00:28:05][DEBUG] llama_kv_cache:        CPU KV buffer size =   280.50 MiB
 [2025-09-12 00:28:05][DEBUG] llama_kv_cache: size =  561.00 MiB (  8192 cells,  44 layers,  1/1 seqs), K (q8_0):  280.50 MiB, V (q8_0):  280.50 MiB
 [2025-09-12 00:28:05][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    28.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 246 (with bs=512), 3 (with bs=1)
 [2025-09-12 00:28:05][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:28:06][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:28:47][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created.
 [2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio

 [2025-09-12 00:28:47][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor.
 [2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor][Request (sKQs11)] New preprocess request received.

 [2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor][Request (sKQs11)] Preprocess request completed.

 [2025-09-12 00:28:47][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:28:47][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 9
 Total prompt tokens: 9
 Prompt tokens to decode: 9
 BeginProcessingPrompt
 [2025-09-12 00:28:48][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:28:48][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:29:03][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       8.82 ms /    39 runs   (    0.23 ms per token,  4423.27 tokens per second)
 llama_perf_context_print:        load time =   12603.59 ms
 llama_perf_context_print: prompt eval time =    1035.86 ms /     9 tokens (  115.10 ms per token,     8.69 tokens per second)
 llama_perf_context_print:        eval time =   14731.45 ms /    29 runs   (  507.98 ms per token,     1.97 tokens per second)
 llama_perf_context_print:       total time =   15780.55 ms /    38 tokens
 llama_perf_context_print:    graphs reused =         28
 [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:29:18][INFO] 
 [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:29:18][INFO] 
 [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:29:18][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected.
 [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:29:19][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
 [2025-09-12 00:29:19][INFO] 
 [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] ->	GET  http://192.168.128.20:12345/v1/models
 [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/chat/completions
 [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/completions
 [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/embeddings
 [2025-09-12 00:29:19][INFO] 
 [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:29:50][INFO] Server stopped.
 [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:30:01][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
 [2025-09-12 00:30:01][INFO] 
 [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] ->	GET  http://192.168.128.20:12345/v1/models
 [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/chat/completions
 [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/completions
 [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/embeddings
 [2025-09-12 00:30:01][INFO] 
 [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:30:01][INFO] Server started.
 [2025-09-12 00:30:01][INFO] Just-in-time model loading active.
 [2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client created.
 [2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created.
 [2025-09-12 00:30:02][INFO][Plugin(lmstudio/js-code-sandbox)] stdout: [Tools Prvdr.] Register with LM Studio

 [2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox][Endpoint=setToolsProvider] Registering tools provider.
 [2025-09-12 00:30:02][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio

 [2025-09-12 00:30:03][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor.
 [2025-09-12 00:30:03][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected.
 [2025-09-12 00:30:25][INFO] Server stopped.
 [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:31:07][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing!
 [2025-09-12 00:31:07][INFO] 
 [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] ->	GET  http://192.168.128.20:12345/v1/models
 [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/chat/completions
 [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/completions
 [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] ->	POST http://192.168.128.20:12345/v1/embeddings
 [2025-09-12 00:31:07][INFO] 
 [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:31:07][INFO] Server started.
 [2025-09-12 00:31:07][INFO] Just-in-time model loading active.
 [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client created.
 [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created.
 [2025-09-12 00:31:08][INFO][Plugin(lmstudio/js-code-sandbox)] stdout: [Tools Prvdr.] Register with LM Studio

 [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox][Endpoint=setToolsProvider] Registering tools provider.
 [2025-09-12 00:31:08][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio

 [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor.
 [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client disconnected.
 [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected.
 [2025-09-12 00:31:43][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: ON
 [2025-09-12 00:31:43][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:31:43][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192':
  Model: 11.16 GB
  Context: 1.43 GB
  Total: 12.58 GB
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [2025-09-12 00:31:43][DEBUG][LM Studio] Resolved GPU config options:
  Num Offload Layers: 22
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:31:44][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:31:44][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:31:44][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.type str              = model
 llama_model_loader: - kv   2:                               general.name str              = Reka Flash 3
 llama_model_loader: - kv   3:                            general.version str              = 3
 llama_model_loader: - kv   4:                           general.basename str              = reka-flash
 llama_model_loader: - kv   5:                         general.size_label str              = 21B
 llama_model_loader: - kv   6:                            general.license str              = apache-2.0
 llama_model_loader: - kv   7:                          llama.block_count u32              = 44
 llama_model_loader: - kv   8:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   9:                     llama.embedding_length u32              = 6144
 llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 19648
 llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
 llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 8000000.000000
 llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  15:                           llama.vocab_size u32              = 100352
 llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 96
 llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
 llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = dbrx
 [2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
 [2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 [2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
 llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 100257
 llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 100257
 llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 100257
 llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
 llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
 llama_model_loader: - kv  27:               general.quantization_version u32              = 2
 llama_model_loader: - kv  28:                          general.file_type u32              = 7
 llama_model_loader: - kv  29:                      quantize.imatrix.file str              = E:/_imx/Reka-Flash-3-21B-Reasoning-NE...
 llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = f:/llamacpp/_raw_imatrix/neo1-v2.txt
 llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 308
 llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 180
 llama_model_loader: - type  f32:   89 tensors
 llama_model_loader: - type q8_0:  309 tensors
 llama_model_loader: - type bf16:    1 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 21.23 GiB (8.72 BPW)
 [2025-09-12 00:31:44][DEBUG] load: printing all EOG tokens:
 load:   - 100257 ('<|endoftext|>')
 [2025-09-12 00:31:44][DEBUG] load: special tokens cache size = 21
 [2025-09-12 00:31:44][DEBUG] load: token to piece cache size = 0.6145 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 32768
 print_info: n_embd           = 6144
 print_info: n_layer          = 44
 print_info: n_head           = 64
 print_info: n_head_kv        = 8
 print_info: n_rot            = 96
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 96
 print_info: n_embd_head_v    = 96
 print_info: n_gqa            = 8
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 19648
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 [2025-09-12 00:31:44][DEBUG] print_info: rope scaling     = linear
 print_info: freq_base_train  = 8000000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 32768
 print_info: rope_finetuned   = unknown
 print_info: model type       = ?B
 print_info: model params     = 20.91 B
 print_info: general.name     = Reka Flash 3
 print_info: vocab type       = BPE
 print_info: n_vocab          = 100352
 print_info: n_merges         = 100000
 print_info: BOS token        = 100257 '<|endoftext|>'
 print_info: EOS token        = 100257 '<|endoftext|>'
 print_info: EOT token        = 100257 '<|endoftext|>'
 print_info: UNK token        = 100257 '<|endoftext|>'
 print_info: LF token         = 198 'Ċ'
 print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
 print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
 print_info: FIM MID token    = 100259 '<|fim_middle|>'
 print_info: EOG token        = 100257 '<|endoftext|>'
 print_info: max token length = 256
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:31:47][DEBUG] load_tensors: offloading 22 repeating layers to GPU
 load_tensors: offloaded 22/45 layers to GPU
 load_tensors:      Vulkan0 model buffer size =  9967.55 MiB
 load_tensors:   CPU_Mapped model buffer size = 11768.32 MiB
 [2025-09-12 00:31:55][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 8192
 llama_context: n_ctx_per_seq = 8192
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 8000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
 [2025-09-12 00:31:55][DEBUG] llama_context:        CPU  output buffer size =     0.38 MiB
 [2025-09-12 00:31:56][DEBUG] llama_kv_cache:    Vulkan0 KV buffer size =   280.50 MiB
 [2025-09-12 00:31:56][DEBUG] llama_kv_cache:        CPU KV buffer size =   280.50 MiB
 [2025-09-12 00:31:56][DEBUG] llama_kv_cache: size =  561.00 MiB (  8192 cells,  44 layers,  1/1 seqs), K (q8_0):  280.50 MiB, V (q8_0):  280.50 MiB
 [2025-09-12 00:31:56][DEBUG] llama_context:    Vulkan0 compute buffer size =  1396.00 MiB
 llama_context: Vulkan_Host compute buffer size =    28.01 MiB
 llama_context: graph nodes  = 1371
 llama_context: graph splits = 246 (with bs=512), 3 (with bs=1)
 [2025-09-12 00:31:56][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf
 common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:31:56][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:33:47][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 254,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:33:47][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:33:47][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:33:47][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8
 Total prompt tokens: 8
 Prompt tokens to decode: 8
 BeginProcessingPrompt
 [2025-09-12 00:33:48][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:33:48][DEBUG] No tokens to output. Continuing generation
 [2025-09-12 00:34:02][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.)
 [2025-09-12 00:34:02][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       9.17 ms /    37 runs   (    0.25 ms per token,  4033.14 tokens per second)
 llama_perf_context_print:        load time =   12853.96 ms
 llama_perf_context_print: prompt eval time =     820.59 ms /     8 tokens (  102.57 ms per token,     9.75 tokens per second)
 llama_perf_context_print:        eval time =   14363.18 ms /    28 runs   (  512.97 ms per token,     1.95 tokens per second)
 llama_perf_context_print:       total time =   15197.21 ms /    36 tokens
 llama_perf_context_print:    graphs reused =         27
 [2025-09-12 00:34:02][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: []
 [2025-09-12 00:34:02][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: {
  "id": "chatcmpl-t5ql74xf1rih385fceeszc",
  "object": "chat.completion",
  "created": 1757655227,
  "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\nFirst",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 29,
    "total_tokens": 37
  },
  "stats": {},
  "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix"
 }
 [2025-09-12 00:34:48][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
 [2025-09-12 00:34:48][DEBUG][WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
 [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
 [2025-09-12 00:34:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:34:48][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
 llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
 llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
 llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
 llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
 llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
 llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
 llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
 llama_model_loader: - kv   8:                          general.file_type u32              = 15
 llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
 llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
 llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
 llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
 llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
 llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
 llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
 [2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 [2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
 [2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
 llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
 llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
 llama_model_loader: - kv  22:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   51 tensors
 llama_model_loader: - type q4_K:   43 tensors
 llama_model_loader: - type q5_K:   12 tensors
 llama_model_loader: - type q6_K:    6 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_K - Medium
 print_info: file size   = 79.49 MiB (4.88 BPW)
 [2025-09-12 00:34:48][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 102 ('[SEP]')
 load: special tokens cache size = 5
 [2025-09-12 00:34:48][DEBUG] load: token to piece cache size = 0.2032 MB
 print_info: arch             = nomic-bert
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 2048
 print_info: n_embd           = 768
 print_info: n_layer          = 12
 print_info: n_head           = 12
 print_info: n_head_kv        = 12
 print_info: n_rot            = 64
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 64
 print_info: n_embd_head_v    = 64
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 1.0e-12
 print_info: f_norm_rms_eps   = 0.0e+00
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 [2025-09-12 00:34:48][DEBUG] print_info: n_ff             = 3072
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 0
 print_info: pooling type     = 1
 print_info: rope type        = 2
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 1000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 2048
 print_info: rope_finetuned   = unknown
 print_info: model type       = 137M
 print_info: model params     = 136.73 M
 print_info: general.name     = nomic-embed-text-v1.5
 print_info: vocab type       = WPM
 print_info: n_vocab          = 30522
 print_info: n_merges         = 0
 print_info: BOS token        = 101 '[CLS]'
 print_info: EOS token        = 102 '[SEP]'
 print_info: UNK token        = 100 '[UNK]'
 print_info: SEP token        = 102 '[SEP]'
 print_info: PAD token        = 0 '[PAD]'
 print_info: MASK token       = 103 '[MASK]'
 print_info: LF token         = 0 '[PAD]'
 print_info: EOG token        = 102 '[SEP]'
 print_info: max token length = 21
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:34:48][DEBUG] load_tensors: offloading 12 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 13/13 layers to GPU
 load_tensors:      Vulkan0 model buffer size =    66.90 MiB
 load_tensors:   CPU_Mapped model buffer size =    12.59 MiB
 [2025-09-12 00:34:48][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 2048
 llama_context: causal_attn   = 0
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = true
 llama_context: freq_base     = 1000.0
 llama_context: freq_scale    = 1
 [2025-09-12 00:34:48][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 [2025-09-12 00:34:48][DEBUG] llama_context: Flash Attention was auto, set to enabled
 [2025-09-12 00:34:48][DEBUG] llama_context:    Vulkan0 compute buffer size =   108.00 MiB
 llama_context: Vulkan_Host compute buffer size =    22.03 MiB
 llama_context: graph nodes  = 372 (with bs=2048), 408 (with bs=1)
 llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
 common_init_from_params: added [SEP] logit bias = -inf
 [2025-09-12 00:34:48][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:34:49][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
 [INFO] [PaniniRagEngine] Model loaded into embedding engine!
 [INFO] [PaniniRagEngine] Model loaded without an active session.
 [2025-09-12 00:35:15][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "text-embedding-nomic-embed-text-v1.5",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:35:15][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
 [2025-09-12 00:35:16][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
 [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
 [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
 [2025-09-12 00:35:16][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:35:16][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
 llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
 llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
 llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
 llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
 llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
 llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
 llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
 llama_model_loader: - kv   8:                          general.file_type u32              = 15
 llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
 llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
 llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
 llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
 llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
 llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
 llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
 [2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 [2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
 [2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
 llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
 llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
 llama_model_loader: - kv  22:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   51 tensors
 llama_model_loader: - type q4_K:   43 tensors
 llama_model_loader: - type q5_K:   12 tensors
 llama_model_loader: - type q6_K:    6 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_K - Medium
 print_info: file size   = 79.49 MiB (4.88 BPW)
 [2025-09-12 00:35:16][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 102 ('[SEP]')
 load: special tokens cache size = 5
 [2025-09-12 00:35:16][DEBUG] load: token to piece cache size = 0.2032 MB
 print_info: arch             = nomic-bert
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 2048
 print_info: n_embd           = 768
 print_info: n_layer          = 12
 print_info: n_head           = 12
 print_info: n_head_kv        = 12
 print_info: n_rot            = 64
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 64
 print_info: n_embd_head_v    = 64
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 1.0e-12
 print_info: f_norm_rms_eps   = 0.0e+00
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 3072
 [2025-09-12 00:35:16][DEBUG] print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 0
 print_info: pooling type     = 1
 print_info: rope type        = 2
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 1000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 2048
 print_info: rope_finetuned   = unknown
 print_info: model type       = 137M
 print_info: model params     = 136.73 M
 print_info: general.name     = nomic-embed-text-v1.5
 print_info: vocab type       = WPM
 print_info: n_vocab          = 30522
 print_info: n_merges         = 0
 print_info: BOS token        = 101 '[CLS]'
 print_info: EOS token        = 102 '[SEP]'
 print_info: UNK token        = 100 '[UNK]'
 print_info: SEP token        = 102 '[SEP]'
 print_info: PAD token        = 0 '[PAD]'
 print_info: MASK token       = 103 '[MASK]'
 print_info: LF token         = 0 '[PAD]'
 print_info: EOG token        = 102 '[SEP]'
 print_info: max token length = 21
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:35:17][DEBUG] load_tensors: offloading 12 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 13/13 layers to GPU
 load_tensors:      Vulkan0 model buffer size =    66.90 MiB
 load_tensors:   CPU_Mapped model buffer size =    12.59 MiB
 [2025-09-12 00:35:17][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 2048
 llama_context: causal_attn   = 0
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = true
 llama_context: freq_base     = 1000.0
 llama_context: freq_scale    = 1
 [2025-09-12 00:35:17][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 llama_context: Flash Attention was auto, set to enabled
 [2025-09-12 00:35:17][DEBUG] llama_context:    Vulkan0 compute buffer size =   108.00 MiB
 llama_context: Vulkan_Host compute buffer size =    22.03 MiB
 llama_context: graph nodes  = 372 (with bs=2048), 408 (with bs=1)
 llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
 common_init_from_params: added [SEP] logit bias = -inf
 [2025-09-12 00:35:17][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:35:17][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
 [INFO] [PaniniRagEngine] Model loaded into embedding engine!
 [INFO] [PaniniRagEngine] Model loaded without an active session.
 [2025-09-12 00:35:34][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "text-embedding-nomic-embed-text-v1.5",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:35:34][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
 [2025-09-12 00:35:36][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
 [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
 [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
 [2025-09-12 00:35:36][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:35:36][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
 llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
 llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
 llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
 llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
 llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
 llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
 llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
 llama_model_loader: - kv   8:                          general.file_type u32              = 15
 llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
 llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
 llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
 llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
 llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
 llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
 llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
 [2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 [2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
 [2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
 llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
 llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
 llama_model_loader: - kv  22:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   51 tensors
 llama_model_loader: - type q4_K:   43 tensors
 llama_model_loader: - type q5_K:   12 tensors
 llama_model_loader: - type q6_K:    6 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_K - Medium
 print_info: file size   = 79.49 MiB (4.88 BPW)
 [2025-09-12 00:35:36][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 102 ('[SEP]')
 load: special tokens cache size = 5
 [2025-09-12 00:35:36][DEBUG] load: token to piece cache size = 0.2032 MB
 print_info: arch             = nomic-bert
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 2048
 print_info: n_embd           = 768
 print_info: n_layer          = 12
 print_info: n_head           = 12
 print_info: n_head_kv        = 12
 print_info: n_rot            = 64
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 64
 print_info: n_embd_head_v    = 64
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 1.0e-12
 print_info: f_norm_rms_eps   = 0.0e+00
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 [2025-09-12 00:35:36][DEBUG] print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 3072
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 0
 print_info: pooling type     = 1
 print_info: rope type        = 2
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 1000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 2048
 print_info: rope_finetuned   = unknown
 print_info: model type       = 137M
 print_info: model params     = 136.73 M
 print_info: general.name     = nomic-embed-text-v1.5
 print_info: vocab type       = WPM
 print_info: n_vocab          = 30522
 print_info: n_merges         = 0
 print_info: BOS token        = 101 '[CLS]'
 print_info: EOS token        = 102 '[SEP]'
 print_info: UNK token        = 100 '[UNK]'
 print_info: SEP token        = 102 '[SEP]'
 print_info: PAD token        = 0 '[PAD]'
 print_info: MASK token       = 103 '[MASK]'
 print_info: LF token         = 0 '[PAD]'
 print_info: EOG token        = 102 '[SEP]'
 print_info: max token length = 21
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:35:36][DEBUG] load_tensors: offloading 12 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 13/13 layers to GPU
 load_tensors:      Vulkan0 model buffer size =    66.90 MiB
 load_tensors:   CPU_Mapped model buffer size =    12.59 MiB
 [2025-09-12 00:35:36][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 2048
 llama_context: causal_attn   = 0
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = true
 llama_context: freq_base     = 1000.0
 llama_context: freq_scale    = 1
 [2025-09-12 00:35:36][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 llama_context: Flash Attention was auto, set to enabled
 [2025-09-12 00:35:36][DEBUG] llama_context:    Vulkan0 compute buffer size =   108.00 MiB
 llama_context: Vulkan_Host compute buffer size =    22.03 MiB
 llama_context: graph nodes  = 372 (with bs=2048), 408 (with bs=1)
 llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
 common_init_from_params: added [SEP] logit bias = -inf
 [2025-09-12 00:35:36][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:35:36][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
 [INFO] [PaniniRagEngine] Model loaded into embedding engine!
 [INFO] [PaniniRagEngine] Model loaded without an active session.
 [2025-09-12 00:35:58][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "text-embedding-nomic-embed-text-v1.5",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:35:58][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
 [2025-09-12 00:36:00][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
 [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
 [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
 [2025-09-12 00:36:00][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:36:00][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
 llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
 llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
 llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
 llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
 llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
 llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
 llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
 llama_model_loader: - kv   8:                          general.file_type u32              = 15
 llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
 llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
 llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
 llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
 llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
 llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
 llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
 [2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 [2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
 [2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
 llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
 llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
 llama_model_loader: - kv  22:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   51 tensors
 llama_model_loader: - type q4_K:   43 tensors
 llama_model_loader: - type q5_K:   12 tensors
 llama_model_loader: - type q6_K:    6 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_K - Medium
 print_info: file size   = 79.49 MiB (4.88 BPW)
 [2025-09-12 00:36:00][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 102 ('[SEP]')
 load: special tokens cache size = 5
 [2025-09-12 00:36:00][DEBUG] load: token to piece cache size = 0.2032 MB
 print_info: arch             = nomic-bert
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 2048
 print_info: n_embd           = 768
 print_info: n_layer          = 12
 print_info: n_head           = 12
 print_info: n_head_kv        = 12
 print_info: n_rot            = 64
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 64
 print_info: n_embd_head_v    = 64
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 1.0e-12
 print_info: f_norm_rms_eps   = 0.0e+00
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 [2025-09-12 00:36:00][DEBUG] print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 3072
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 0
 print_info: pooling type     = 1
 print_info: rope type        = 2
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 1000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 2048
 print_info: rope_finetuned   = unknown
 print_info: model type       = 137M
 print_info: model params     = 136.73 M
 print_info: general.name     = nomic-embed-text-v1.5
 print_info: vocab type       = WPM
 print_info: n_vocab          = 30522
 print_info: n_merges         = 0
 print_info: BOS token        = 101 '[CLS]'
 print_info: EOS token        = 102 '[SEP]'
 print_info: UNK token        = 100 '[UNK]'
 print_info: SEP token        = 102 '[SEP]'
 print_info: PAD token        = 0 '[PAD]'
 print_info: MASK token       = 103 '[MASK]'
 print_info: LF token         = 0 '[PAD]'
 print_info: EOG token        = 102 '[SEP]'
 print_info: max token length = 21
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:36:00][DEBUG] load_tensors: offloading 12 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 13/13 layers to GPU
 load_tensors:      Vulkan0 model buffer size =    66.90 MiB
 load_tensors:   CPU_Mapped model buffer size =    12.59 MiB
 [2025-09-12 00:36:00][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 2048
 llama_context: causal_attn   = 0
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = true
 llama_context: freq_base     = 1000.0
 llama_context: freq_scale    = 1
 [2025-09-12 00:36:00][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 llama_context: Flash Attention was auto, set to enabled
 [2025-09-12 00:36:00][DEBUG] llama_context:    Vulkan0 compute buffer size =   108.00 MiB
 llama_context: Vulkan_Host compute buffer size =    22.03 MiB
 llama_context: graph nodes  = 372 (with bs=2048), 408 (with bs=1)
 llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
 common_init_from_params: added [SEP] logit bias = -inf
 [2025-09-12 00:36:00][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:36:00][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
 [INFO] [PaniniRagEngine] Model loaded into embedding engine!
 [INFO] [PaniniRagEngine] Model loaded without an active session.
 [2025-09-12 00:36:00][DEBUG] 
 [2025-09-12 00:36:20][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "text-embedding-nomic-embed-text-v1.5",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:36:20][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now...
 [2025-09-12 00:36:22][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
 [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
 [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
 [2025-09-12 00:36:22][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:36:22][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
 llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
 llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
 llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
 llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
 llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
 llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
 llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
 llama_model_loader: - kv   8:                          general.file_type u32              = 15
 llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
 llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
 llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
 llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
 llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
 llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
 llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
 [2025-09-12 00:36:22][DEBUG] llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 [2025-09-12 00:36:23][DEBUG] llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
 [2025-09-12 00:36:23][DEBUG] llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
 llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
 llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
 llama_model_loader: - kv  22:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   51 tensors
 llama_model_loader: - type q4_K:   43 tensors
 llama_model_loader: - type q5_K:   12 tensors
 llama_model_loader: - type q6_K:    6 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_K - Medium
 print_info: file size   = 79.49 MiB (4.88 BPW)
 [2025-09-12 00:36:23][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 102 ('[SEP]')
 load: special tokens cache size = 5
 [2025-09-12 00:36:23][DEBUG] load: token to piece cache size = 0.2032 MB
 print_info: arch             = nomic-bert
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 2048
 print_info: n_embd           = 768
 print_info: n_layer          = 12
 print_info: n_head           = 12
 print_info: n_head_kv        = 12
 print_info: n_rot            = 64
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 64
 print_info: n_embd_head_v    = 64
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 1.0e-12
 print_info: f_norm_rms_eps   = 0.0e+00
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 3072
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 0
 [2025-09-12 00:36:23][DEBUG] print_info: pooling type     = 1
 print_info: rope type        = 2
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 1000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 2048
 print_info: rope_finetuned   = unknown
 print_info: model type       = 137M
 print_info: model params     = 136.73 M
 print_info: general.name     = nomic-embed-text-v1.5
 print_info: vocab type       = WPM
 print_info: n_vocab          = 30522
 print_info: n_merges         = 0
 print_info: BOS token        = 101 '[CLS]'
 print_info: EOS token        = 102 '[SEP]'
 print_info: UNK token        = 100 '[UNK]'
 print_info: SEP token        = 102 '[SEP]'
 print_info: PAD token        = 0 '[PAD]'
 print_info: MASK token       = 103 '[MASK]'
 print_info: LF token         = 0 '[PAD]'
 print_info: EOG token        = 102 '[SEP]'
 print_info: max token length = 21
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:36:23][DEBUG] load_tensors: offloading 12 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 13/13 layers to GPU
 load_tensors:      Vulkan0 model buffer size =    66.90 MiB
 load_tensors:   CPU_Mapped model buffer size =    12.59 MiB
 [2025-09-12 00:36:23][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 2048
 llama_context: causal_attn   = 0
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = true
 llama_context: freq_base     = 1000.0
 llama_context: freq_scale    = 1
 [2025-09-12 00:36:23][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 llama_context: Flash Attention was auto, set to enabled
 [2025-09-12 00:36:23][DEBUG] llama_context:    Vulkan0 compute buffer size =   108.00 MiB
 llama_context: Vulkan_Host compute buffer size =    22.03 MiB
 llama_context: graph nodes  = 372 (with bs=2048), 408 (with bs=1)
 llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
 common_init_from_params: added [SEP] logit bias = -inf
 [2025-09-12 00:36:23][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:36:23][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
 [INFO] [PaniniRagEngine] Model loaded into embedding engine!
 [INFO] [PaniniRagEngine] Model loaded without an active session.
 [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:37:47][INFO] 
 [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:37:47][INFO] 
 [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:41:47][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: ON
 [2025-09-12 00:41:47][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:41:47][DEBUG][LM Studio] Model load size estimate with raw num offload layers 'max' and context length '2048':
  Model: 13.81 GB
  Context: 1.09 GB
  Total: 14.91 GB
 [2025-09-12 00:41:47][DEBUG][LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [LM Studio] Resolved GPU config options:
  Num Offload Layers: max
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:41:48][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:41:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:41:48][DEBUG] llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from D:\AI-Models\__LMStudio\Random\Nethena-13B\Nethena-13B.Q8_0.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
 llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
 llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
 llama_model_loader: - kv   4:                          llama.block_count u32              = 40
 llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
 llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
 llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
 llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
 llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
 llama_model_loader: - kv  11:                          general.file_type u32              = 7
 llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
 [2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
 [2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
 [2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
 llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
 llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
 llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
 llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 32000
 llama_model_loader: - kv  20:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   81 tensors
 llama_model_loader: - type q8_0:  282 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 12.88 GiB (8.50 BPW)
 [2025-09-12 00:41:48][DEBUG] load: bad special token: 'tokenizer.ggml.padding_token_id' = 32000, using default id -1
 [2025-09-12 00:41:48][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 2 ('</s>')
 load: special tokens cache size = 3
 [2025-09-12 00:41:48][DEBUG] load: token to piece cache size = 0.1684 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 4096
 print_info: n_embd           = 5120
 print_info: n_layer          = 40
 print_info: n_head           = 40
 print_info: n_head_kv        = 40
 print_info: n_rot            = 128
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 128
 [2025-09-12 00:41:48][DEBUG] print_info: n_embd_head_v    = 128
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 5120
 print_info: n_embd_v_gqa     = 5120
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 13824
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 10000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 4096
 print_info: rope_finetuned   = unknown
 print_info: model type       = 13B
 print_info: model params     = 13.02 B
 print_info: general.name     = LLaMA v2
 print_info: vocab type       = SPM
 print_info: n_vocab          = 32000
 print_info: n_merges         = 0
 print_info: BOS token        = 1 '<s>'
 print_info: EOS token        = 2 '</s>'
 print_info: UNK token        = 0 '<unk>'
 print_info: LF token         = 13 '<0x0A>'
 print_info: EOG token        = 2 '</s>'
 print_info: max token length = 48
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:41:59][DEBUG] load_tensors: offloading 40 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 41/41 layers to GPU
 load_tensors:      Vulkan0 model buffer size = 13023.85 MiB
 load_tensors:   CPU_Mapped model buffer size =   166.02 MiB
 [2025-09-12 00:43:06][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 10000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
 [2025-09-12 00:43:06][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 [2025-09-12 00:43:06][DEBUG] llama_kv_cache:    Vulkan0 KV buffer size =   850.00 MiB
 [2025-09-12 00:43:06][DEBUG] llama_kv_cache: size =  850.00 MiB (  2048 cells,  40 layers,  1/1 seqs), K (q8_0):  425.00 MiB, V (q8_0):  425.00 MiB
 [2025-09-12 00:43:06][DEBUG] llama_context:    Vulkan0 compute buffer size =   117.01 MiB
 llama_context: Vulkan_Host compute buffer size =    14.01 MiB
 llama_context: graph nodes  = 1247
 llama_context: graph splits = 2
 common_init_from_params: added </s> logit bias = -inf
 [2025-09-12 00:43:06][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:43:07][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:43:28][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
 [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
 [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
 [2025-09-12 00:43:28][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:43:28][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
 llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
 llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
 llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
 llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
 llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
 llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
 llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
 llama_model_loader: - kv   8:                          general.file_type u32              = 15
 llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
 llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
 llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
 llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
 llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
 llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
 llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
 [2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 [2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
 [2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
 llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
 llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
 llama_model_loader: - kv  22:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   51 tensors
 llama_model_loader: - type q4_K:   43 tensors
 llama_model_loader: - type q5_K:   12 tensors
 llama_model_loader: - type q6_K:    6 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_K - Medium
 print_info: file size   = 79.49 MiB (4.88 BPW)
 [2025-09-12 00:43:28][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 102 ('[SEP]')
 load: special tokens cache size = 5
 [2025-09-12 00:43:28][DEBUG] load: token to piece cache size = 0.2032 MB
 print_info: arch             = nomic-bert
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 2048
 print_info: n_embd           = 768
 print_info: n_layer          = 12
 print_info: n_head           = 12
 print_info: n_head_kv        = 12
 print_info: n_rot            = 64
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 64
 print_info: n_embd_head_v    = 64
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 1.0e-12
 [2025-09-12 00:43:28][DEBUG] print_info: f_norm_rms_eps   = 0.0e+00
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 3072
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 0
 print_info: pooling type     = 1
 print_info: rope type        = 2
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 1000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 2048
 print_info: rope_finetuned   = unknown
 print_info: model type       = 137M
 print_info: model params     = 136.73 M
 print_info: general.name     = nomic-embed-text-v1.5
 print_info: vocab type       = WPM
 print_info: n_vocab          = 30522
 print_info: n_merges         = 0
 print_info: BOS token        = 101 '[CLS]'
 print_info: EOS token        = 102 '[SEP]'
 print_info: UNK token        = 100 '[UNK]'
 print_info: SEP token        = 102 '[SEP]'
 print_info: PAD token        = 0 '[PAD]'
 print_info: MASK token       = 103 '[MASK]'
 print_info: LF token         = 0 '[PAD]'
 print_info: EOG token        = 102 '[SEP]'
 print_info: max token length = 21
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:43:28][DEBUG] load_tensors: offloading 12 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 13/13 layers to GPU
 load_tensors:      Vulkan0 model buffer size =    66.90 MiB
 load_tensors:   CPU_Mapped model buffer size =    12.59 MiB
 [2025-09-12 00:43:28][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 2048
 llama_context: causal_attn   = 0
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = true
 llama_context: freq_base     = 1000.0
 llama_context: freq_scale    = 1
 [2025-09-12 00:43:28][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 llama_context: Flash Attention was auto, set to enabled
 [2025-09-12 00:43:28][DEBUG] llama_context:    Vulkan0 compute buffer size =   108.00 MiB
 llama_context: Vulkan_Host compute buffer size =    22.03 MiB
 llama_context: graph nodes  = 372 (with bs=2048), 408 (with bs=1)
 llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
 common_init_from_params: added [SEP] logit bias = -inf
 [2025-09-12 00:43:28][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:43:28][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
 [INFO] [PaniniRagEngine] Model loaded into embedding engine!
 [INFO] [PaniniRagEngine] Model loaded without an active session.
 [2025-09-12 00:43:28][DEBUG] 
 [2025-09-12 00:44:03][INFO] Server stopped.
 [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345
 [2025-09-12 00:44:06][INFO] 
 [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Supported endpoints:
 [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] ->	GET  http://localhost:12345/v1/models
 [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/chat/completions
 [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/completions
 [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] ->	POST http://localhost:12345/v1/embeddings
 [2025-09-12 00:44:06][INFO] 
 [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs
 [2025-09-12 00:44:06][INFO] Server started.
 [2025-09-12 00:44:06][INFO] Just-in-time model loading active.
 [2025-09-12 00:45:26][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine...
 [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior.
 [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
 [2025-09-12 00:45:26][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:45:26][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
 llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
 llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
 llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
 llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
 llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
 llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
 llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
 llama_model_loader: - kv   8:                          general.file_type u32              = 15
 llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
 llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
 llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
 llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
 llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
 llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
 llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
 [2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
 [2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
 [2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
 llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
 llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
 llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
 llama_model_loader: - kv  22:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   51 tensors
 llama_model_loader: - type q4_K:   43 tensors
 llama_model_loader: - type q5_K:   12 tensors
 llama_model_loader: - type q6_K:    6 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q4_K - Medium
 print_info: file size   = 79.49 MiB (4.88 BPW)
 [2025-09-12 00:45:26][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 102 ('[SEP]')
 load: special tokens cache size = 5
 [2025-09-12 00:45:26][DEBUG] load: token to piece cache size = 0.2032 MB
 print_info: arch             = nomic-bert
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 2048
 print_info: n_embd           = 768
 print_info: n_layer          = 12
 print_info: n_head           = 12
 print_info: n_head_kv        = 12
 print_info: n_rot            = 64
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 [2025-09-12 00:45:26][DEBUG] print_info: n_embd_head_k    = 64
 print_info: n_embd_head_v    = 64
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 768
 print_info: n_embd_v_gqa     = 768
 print_info: f_norm_eps       = 1.0e-12
 print_info: f_norm_rms_eps   = 0.0e+00
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 3072
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 0
 print_info: pooling type     = 1
 print_info: rope type        = 2
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 1000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 2048
 print_info: rope_finetuned   = unknown
 print_info: model type       = 137M
 print_info: model params     = 136.73 M
 print_info: general.name     = nomic-embed-text-v1.5
 print_info: vocab type       = WPM
 print_info: n_vocab          = 30522
 print_info: n_merges         = 0
 print_info: BOS token        = 101 '[CLS]'
 print_info: EOS token        = 102 '[SEP]'
 print_info: UNK token        = 100 '[UNK]'
 print_info: SEP token        = 102 '[SEP]'
 print_info: PAD token        = 0 '[PAD]'
 print_info: MASK token       = 103 '[MASK]'
 print_info: LF token         = 0 '[PAD]'
 print_info: EOG token        = 102 '[SEP]'
 print_info: max token length = 21
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:45:26][DEBUG] load_tensors: offloading 12 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 13/13 layers to GPU
 load_tensors:      Vulkan0 model buffer size =    66.90 MiB
 load_tensors:   CPU_Mapped model buffer size =    12.59 MiB
 [2025-09-12 00:45:26][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 2048
 llama_context: n_ubatch      = 2048
 llama_context: causal_attn   = 0
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = true
 llama_context: freq_base     = 1000.0
 llama_context: freq_scale    = 1
 [2025-09-12 00:45:26][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 llama_context: Flash Attention was auto, set to enabled
 [2025-09-12 00:45:26][DEBUG] llama_context:    Vulkan0 compute buffer size =   108.00 MiB
 llama_context: Vulkan_Host compute buffer size =    22.03 MiB
 llama_context: graph nodes  = 372 (with bs=2048), 408 (with bs=1)
 llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1)
 common_init_from_params: added [SEP] logit bias = -inf
 [2025-09-12 00:45:26][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:45:26][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete!
 [INFO] [PaniniRagEngine] Model loaded into embedding engine!
 [
 [2025-09-12 00:45:26][DEBUG] INFO] [PaniniRagEngine] Model loaded without an active session.
 [2025-09-12 00:45:45][DEBUG][LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: []
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: ON
 [2025-09-12 00:45:45][DEBUG][LM Studio] Live GPU memory info:
 No live GPU info available
 [2025-09-12 00:45:45][DEBUG][LM Studio] Model load size estimate with raw num offload layers 'max' and context length '2048':
  Model: 13.81 GB
  Context: 1.09 GB
  Total: 14.91 GB
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
 [2025-09-12 00:45:45][DEBUG][LM Studio] Resolved GPU config options:
  Num Offload Layers: max
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: []
 [2025-09-12 00:45:45][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 [2025-09-12 00:45:45][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free
 [2025-09-12 00:45:45][DEBUG] llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from D:\AI-Models\__LMStudio\Random\Nethena-13B\Nethena-13B.Q8_0.gguf (version GGUF V3 (latest))
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
 llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
 llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
 llama_model_loader: - kv   4:                          llama.block_count u32              = 40
 llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
 llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
 llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
 llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
 llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
 llama_model_loader: - kv  11:                          general.file_type u32              = 7
 llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
 [2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
 [2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
 [2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
 llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
 llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
 llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
 llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 32000
 llama_model_loader: - kv  20:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   81 tensors
 llama_model_loader: - type q8_0:  282 tensors
 print_info: file format = GGUF V3 (latest)
 print_info: file type   = Q8_0
 print_info: file size   = 12.88 GiB (8.50 BPW)
 [2025-09-12 00:45:45][DEBUG] load: bad special token: 'tokenizer.ggml.padding_token_id' = 32000, using default id -1
 [2025-09-12 00:45:45][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
 load: printing all EOG tokens:
 load:   - 2 ('</s>')
 load: special tokens cache size = 3
 [2025-09-12 00:45:45][DEBUG] load: token to piece cache size = 0.1684 MB
 print_info: arch             = llama
 print_info: vocab_only       = 0
 print_info: n_ctx_train      = 4096
 print_info: n_embd           = 5120
 print_info: n_layer          = 40
 print_info: n_head           = 40
 print_info: n_head_kv        = 40
 print_info: n_rot            = 128
 print_info: n_swa            = 0
 print_info: is_swa_any       = 0
 print_info: n_embd_head_k    = 128
 print_info: n_embd_head_v    = 128
 print_info: n_gqa            = 1
 print_info: n_embd_k_gqa     = 5120
 print_info: n_embd_v_gqa     = 5120
 print_info: f_norm_eps       = 0.0e+00
 print_info: f_norm_rms_eps   = 1.0e-05
 print_info: f_clamp_kqv      = 0.0e+00
 print_info: f_max_alibi_bias = 0.0e+00
 print_info: f_logit_scale    = 0.0e+00
 print_info: f_attn_scale     = 0.0e+00
 print_info: n_ff             = 13824
 print_info: n_expert         = 0
 print_info: n_expert_used    = 0
 print_info: causal attn      = 1
 print_info: pooling type     = 0
 print_info: rope type        = 0
 print_info: rope scaling     = linear
 print_info: freq_base_train  = 10000.0
 print_info: freq_scale_train = 1
 print_info: n_ctx_orig_yarn  = 4096
 print_info: rope_finetuned   = unknown
 print_info: model type       = 13B
 print_info: model params     = 13.02 B
 print_info: general.name     = LLaMA v2
 print_info: vocab type       = SPM
 print_info: n_vocab          = 32000
 print_info: n_merges         = 0
 print_info: BOS token        = 1 '<s>'
 print_info: EOS token        = 2 '</s>'
 print_info: UNK token        = 0 '<unk>'
 print_info: LF token         = 13 '<0x0A>'
 print_info: EOG token        = 2 '</s>'
 print_info: max token length = 48
 load_tensors: loading model tensors, this can take a while... (mmap = true)
 [2025-09-12 00:45:48][DEBUG] load_tensors: offloading 40 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 41/41 layers to GPU
 load_tensors:      Vulkan0 model buffer size = 13023.85 MiB
 load_tensors:   CPU_Mapped model buffer size =   166.02 MiB
 [2025-09-12 00:45:53][DEBUG] llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 2048
 llama_context: n_ctx_per_seq = 2048
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = enabled
 llama_context: kv_unified    = false
 llama_context: freq_base     = 10000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
 [2025-09-12 00:45:53][DEBUG] llama_context: Vulkan_Host  output buffer size =     0.12 MiB
 [2025-09-12 00:45:53][DEBUG] llama_kv_cache:    Vulkan0 KV buffer size =   850.00 MiB
 [2025-09-12 00:45:54][DEBUG] llama_kv_cache: size =  850.00 MiB (  2048 cells,  40 layers,  1/1 seqs), K (q8_0):  425.00 MiB, V (q8_0):  425.00 MiB
 [2025-09-12 00:45:54][DEBUG] llama_context:    Vulkan0 compute buffer size =   117.01 MiB
 llama_context: Vulkan_Host compute buffer size =    14.01 MiB
 llama_context: graph nodes  = 1247
 llama_context: graph splits = 2
 common_init_from_params: added </s> logit bias = -inf
 [2025-09-12 00:45:54][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
 [2025-09-12 00:45:54][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9
 [2025-09-12 00:46:40][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "nethena-13b@q8_0",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:46:40][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:46:40][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:46:40][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
 Total prompt tokens: 14
 Prompt tokens to decode: 14
 BeginProcessingPrompt
 [2025-09-12 00:46:41][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:46:46][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       0.91 ms /    23 runs   (    0.04 ms per token, 25358.32 tokens per second)
 llama_perf_context_print:        load time =    8894.17 ms
 llama_perf_context_print: prompt eval time =     919.45 ms /    14 tokens (   65.68 ms per token,    15.23 tokens per second)
 llama_perf_context_print:        eval time =    4928.79 ms /     8 runs   (  616.10 ms per token,     1.62 tokens per second)
 llama_perf_context_print:       total time =    5850.57 ms /    22 tokens
 llama_perf_context_print:    graphs reused =          7
 [2025-09-12 00:46:46][INFO][nethena-13b@q8_0] Model generated tool calls: []
 [2025-09-12 00:46:46][INFO][nethena-13b@q8_0] Generated prediction: {
  "id": "chatcmpl-rdmujbhvp5pargg7qaws7",
  "object": "chat.completion",
  "created": 1757656000,
  "model": "nethena-13b@q8_0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Hello! How can I help you?",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 9,
    "total_tokens": 23
  },
  "stats": {},
  "system_fingerprint": "nethena-13b@q8_0"
 }
 [2025-09-12 00:47:35][DEBUG] Received request: GET to /api/tags
 [2025-09-12 00:47:35][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
 [2025-09-12 00:48:08][DEBUG] Received request: GET to /v1/completions/api/tags
 [2025-09-12 00:48:08][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
 [2025-09-12 00:48:34][DEBUG] Received request: GET to /v1/completions/api/tags
 [2025-09-12 00:48:34][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
 [2025-09-12 00:48:40][DEBUG] Received request: GET to /v1/completions/api/tags
 [2025-09-12 00:48:40][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
 [2025-09-12 00:49:04][DEBUG] Received request: GET to /v1/completions/api/tags
 [2025-09-12 00:49:04][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
 [2025-09-12 00:49:08][DEBUG] Received request: GET to /v1/completions/api/tags
 [2025-09-12 00:49:08][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway
 [2025-09-12 00:49:43][DEBUG] Received request: GET to /api/v1/models
 [2025-09-12 00:49:43][INFO] Returning 20 models from v1 API
 [2025-09-12 00:49:58][DEBUG] Received request: GET to /api/v1/models
 [2025-09-12 00:49:58][INFO] Returning 20 models from v1 API
 [2025-09-12 00:52:37][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "nethena-13b@q8_0",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:52:37][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:52:37][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 [2025-09-12 00:52:37][DEBUG] Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix
 Total prompt tokens: 14
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:52:38][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:52:43][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       1.01 ms /    24 runs   (    0.04 ms per token, 23692.00 tokens per second)
 llama_perf_context_print:        load time =    8894.17 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =    6072.72 ms /    10 runs   (  607.27 ms per token,     1.65 tokens per second)
 llama_perf_context_print:       total time =    6077.33 ms /    11 tokens
 llama_perf_context_print:    graphs reused =         10
 [2025-09-12 00:52:43][INFO][nethena-13b@q8_0] Model generated tool calls: []
 [2025-09-12 00:52:43][INFO][nethena-13b@q8_0] Generated prediction: {
  "id": "chatcmpl-r2s94r9fkeqa5nt1tvgna5",
  "object": "chat.completion",
  "created": 1757656357,
  "model": "nethena-13b@q8_0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Hello! How can I help you today?",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 10,
    "total_tokens": 24
  },
  "stats": {},
  "system_fingerprint": "nethena-13b@q8_0"
 }
 [2025-09-12 00:52:55][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "nethena-13b@q8_0",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:52:55][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:52:55][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix
 Total prompt tokens: 14
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:52:56][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:53:01][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       0.93 ms /    23 runs   (    0.04 ms per token, 24651.66 tokens per second)
 llama_perf_context_print:        load time =    8894.17 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =    5397.13 ms /     9 runs   (  599.68 ms per token,     1.67 tokens per second)
 llama_perf_context_print:       total time =    5398.86 ms /    10 tokens
 llama_perf_context_print:    graphs reused =          9
 [2025-09-12 00:53:01][INFO][nethena-13b@q8_0] Model generated tool calls: []
 [2025-09-12 00:53:01][INFO][nethena-13b@q8_0] Generated prediction: {
  "id": "chatcmpl-fe5qlymqnimr3izqhwkyqq",
  "object": "chat.completion",
  "created": 1757656375,
  "model": "nethena-13b@q8_0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Hello! How can I help you?",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 9,
    "total_tokens": 23
  },
  "stats": {},
  "system_fingerprint": "nethena-13b@q8_0"
 }
 [2025-09-12 00:53:26][DEBUG] Received request: POST to /v1/chat/completions with body {
  "model": "nethena-13b@q8_0",
  "temperature": 0.7,
  "top_p": 1,
  "typical_p": 1,
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "Hi"
    }
  ]
 }
 [2025-09-12 00:53:26][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages.
 [2025-09-12 00:53:26][DEBUG] Received request: POST to /v1/embeddings with body {
  "model": "text-embedding-nomic-embed-text-v1.5",
  "input": [
    "Test input"
  ]
 }
 [2025-09-12 00:53:26][INFO] Received request to embed multiple: [
  "Test input"
 ]
 [2025-09-12 00:53:26][DEBUG] Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
 	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
 	top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
 	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 Sampling: 
 logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
 Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14
 Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache
 Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix
 Total prompt tokens: 14
 Prompt tokens to decode: 1
 BeginProcessingPrompt
 [2025-09-12 00:53:27][INFO] Returning embeddings (not shown in logs)
 [2025-09-12 00:53:27][DEBUG] FinishedProcessingPrompt. Progress: 100
 [2025-09-12 00:53:32][DEBUG] Target model llama_perf stats:
 llama_perf_sampler_print:    sampling time =       1.01 ms /    24 runs   (    0.04 ms per token, 23762.38 tokens per second)
 llama_perf_context_print:        load time =    8894.17 ms
 llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_perf_context_print:        eval time =    5993.23 ms /    10 runs   (  599.32 ms per token,     1.67 tokens per second)
 llama_perf_context_print:       total time =    5995.08 ms /    11 tokens
 llama_perf_context_print:    graphs reused =         10
 [2025-09-12 00:53:32][INFO][nethena-13b@q8_0] Model generated tool calls: []
 [2025-09-12 00:53:32][INFO][nethena-13b@q8_0] Generated prediction: {
  "id": "chatcmpl-ahoig45pfuw2yw3kjxjc2o",
  "object": "chat.completion",
  "created": 1757656406,
  "model": "nethena-13b@q8_0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Hello! How can I help you today?",
        "reasoning_content": "",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 10,
    "total_tokens": 24
  },
  "stats": {},
  "system_fingerprint": "nethena-13b@q8_0"
 }
 [2025-09-12 00:58:45][DEBUG] Received request: GET to /api/v1/models
 [2025-09-12 00:58:45][INFO] Returning 20 models from v1 API
 [2025-09-12 00:59:34][DEBUG] Received request: GET to /api/tags
 [2025-09-12 00:59:34][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
 [2025-09-12 00:59:37][DEBUG] Received request: GET to /api/tags
 [2025-09-12 00:59:37][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
 [2025-09-12 01:00:15][DEBUG] Received request: GET to /v1/completions/api/v1/models
 [2025-09-12 01:00:15][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
 [2025-09-12 01:00:51][DEBUG] Received request: GET to /v1/completions/api/v1/models
 [2025-09-12 01:00:51][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
 [2025-09-12 01:01:13][DEBUG] Received request: GET to /v1/completions/api/v1/models
 [2025-09-12 01:01:13][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
 [2025-09-12 01:01:28][DEBUG] Received request: GET to /api/tags
 [2025-09-12 01:01:28][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway
 [2025-09-12 01:01:30][DEBUG] Received request: GET to /v1/completions/api/v1/models
 [2025-09-12 01:01:30][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway
No results found