Created
September 12, 2025 06:18
-
-
Save raz334/0d4438f6148b0bc0bf4beb12b55fbc8c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| [2025-09-12 00:09:17][DEBUG] Received request: GET to /v1/chat/completions/api/tags | |
| [2025-09-12 00:09:17][ERROR] Unexpected endpoint or method. (GET /v1/chat/completions/api/tags). Returning 200 anyway | |
| [2025-09-12 00:09:30][DEBUG] Received request: GET to /v1/embeddings/api/tags | |
| [2025-09-12 00:09:30][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway | |
| [2025-09-12 00:09:38][DEBUG] Received request: GET to /v1/embeddings/api/tags | |
| [2025-09-12 00:09:38][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway | |
| [2025-09-12 00:09:46][DEBUG] Received request: GET to /v1/chat/completions/api/v1/models | |
| [2025-09-12 00:09:46][ERROR] Unexpected endpoint or method. (GET /v1/chat/completions/api/v1/models). Returning 200 anyway | |
| [2025-09-12 00:10:40][DEBUG] Received request: GET to /v1/embeddings/api/tags | |
| [2025-09-12 00:10:40][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway | |
| [2025-09-12 00:10:47][DEBUG] Received request: GET to /v1/embeddings/api/tags | |
| [2025-09-12 00:10:47][ERROR] Unexpected endpoint or method. (GET /v1/embeddings/api/tags). Returning 200 anyway | |
| [2025-09-12 00:10:56][DEBUG] Received request: GET to /v1/api/tags | |
| [2025-09-12 00:10:56][ERROR] Unexpected endpoint or method. (GET /v1/api/tags). Returning 200 anyway | |
| [2025-09-12 00:10:58][DEBUG] Received request: GET to /api/tags | |
| [2025-09-12 00:10:58][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway | |
| [2025-09-12 00:12:07][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:12:07][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:12:07][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192': | |
| Model: 11.16 GB | |
| Context: 1.13 GB | |
| Total: 12.29 GB | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [2025-09-12 00:12:07][DEBUG][LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:12:07][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:12:07][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:12:07][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:12:07][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:12:07][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:12:07][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:12:07][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:12:11][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:12:19][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 8192 | |
| llama_context: n_ctx_per_seq = 8192 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:12:19][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| [2025-09-12 00:12:19][DEBUG] llama_kv_cache: CPU KV buffer size = 561.00 MiB | |
| [2025-09-12 00:12:19][DEBUG] llama_kv_cache: size = 561.00 MiB ( 8192 cells, 44 layers, 1/1 seqs), K (q8_0): 280.50 MiB, V (q8_0): 280.50 MiB | |
| [2025-09-12 00:12:19][DEBUG] llama_context: Vulkan0 compute buffer size = 1384.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 28.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:12:19][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:12:20][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:14:05][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:14:05][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:14:05][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 8192, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:14:05][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:14:05][DEBUG] Sampling params: | |
| [2025-09-12 00:14:05][DEBUG] repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:14:05][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8 | |
| [2025-09-12 00:14:05][DEBUG] Total prompt tokens: 8 | |
| Prompt tokens to decode: 8 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:14:07][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:14:07][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:14:07][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:14:07][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [2025-09-12 00:14:07][DEBUG][LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:14:07][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:14:07][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:14:07][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:14:07][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:14:07][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:14:07][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:14:07][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| [2025-09-12 00:14:07][DEBUG] print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:14:11][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:14:20][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:14:21][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 9.12 ms / 35 runs ( 0.26 ms per token, 3836.46 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 1112.75 ms / 8 tokens ( 139.09 ms per token, 7.19 tokens per second) | |
| llama_perf_context_print: eval time = 14062.28 ms / 26 runs ( 540.86 ms per token, 1.85 tokens per second) | |
| llama_perf_context_print: total time = 15193.48 ms / 34 tokens | |
| llama_perf_context_print: graphs reused = 25 | |
| [2025-09-12 00:14:21][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:14:21][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-ougj4mogwoqqolfz7ayoj", | |
| "object": "chat.completion", | |
| "created": 1757654045, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think.\n\nFirst, since they greeted", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 27, | |
| "total_tokens": 35 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:14:22][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:14:22][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:14:22][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:14:23][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:14:23][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:14:24][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:14:36][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 8192, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:14:36][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:14:36][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:14:36][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:14:36][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8 | |
| [2025-09-12 00:14:36][DEBUG] Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:14:36][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:14:36][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:14:38][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:14:38][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:14:38][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:14:38][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:14:38][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:14:38][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:14:38][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:14:39][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:14:39][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:14:39][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| [2025-09-12 00:14:39][DEBUG] print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:14:41][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:14:51][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:14:51][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 9.53 ms / 36 runs ( 0.26 ms per token, 3777.94 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 15096.70 ms / 28 runs ( 539.17 ms per token, 1.85 tokens per second) | |
| [2025-09-12 00:14:51][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:14:51][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-28i1pkpa7owj92zk46b7s3p", | |
| "object": "chat.completion", | |
| "created": 1757654076, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\n", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 28, | |
| "total_tokens": 36 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:14:51][DEBUG] llama_perf_context_print: total time = 15111.45 ms / 29 tokens | |
| llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:14:52][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:14:52][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| [2025-09-12 00:14:52][DEBUG] llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:14:52][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:14:52][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:14:52][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:14:53][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:15:13][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:15:13][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:15:13][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 8192, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:15:13][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:15:13][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:15:13][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:15:14][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:15:14][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:15:15][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:15:15][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:15:15][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:15:15][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:15:15][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:15:15][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:15:15][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:15:16][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:15:16][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:15:16][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| [2025-09-12 00:15:16][DEBUG] print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:15:18][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:15:28][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:15:28][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 9.51 ms / 36 runs ( 0.26 ms per token, 3786.29 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 15154.53 ms / 28 runs ( 541.23 ms per token, 1.85 tokens per second) | |
| llama_perf_context_print: total time = 15168.26 ms / 29 tokens | |
| llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:15:28][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:15:28][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-7fs0j338wau9n5bvptz1gc", | |
| "object": "chat.completion", | |
| "created": 1757654113, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about the best way to greet them", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 28, | |
| "total_tokens": 36 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:15:29][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:15:29][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| [2025-09-12 00:15:29][DEBUG] llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:15:29][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:15:29][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:15:29][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:15:31][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:16:01][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 8192, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:16:01][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:16:01][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:16:01][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:16:01][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:16:02][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:16:02][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:16:03][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:16:03][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:16:03][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [2025-09-12 00:16:03][DEBUG][LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:16:04][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:16:04][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:16:04][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:16:04][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:16:04][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:16:04][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:16:04][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| [2025-09-12 00:16:04][DEBUG] print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:16:07][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:16:16][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:16:16][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 10.85 ms / 36 runs ( 0.30 ms per token, 3319.50 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 15213.16 ms / 28 runs ( 543.33 ms per token, 1.84 tokens per second) | |
| llama_perf_context_print: total time = 15228.41 ms / 29 tokens | |
| llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:16:16][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:16:16][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-phk3xmfkolzs9eisnnkw", | |
| "object": "chat.completion", | |
| "created": 1757654161, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about the best way to greet them", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 28, | |
| "total_tokens": 36 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:16:18][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:16:18][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| [2025-09-12 00:16:18][DEBUG] llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:16:18][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:16:19][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:16:19][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:16:20][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:16:54][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 8192, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:16:54][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:16:54][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:16:54][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:16:54][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 8192, n_keep = 8 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:16:55][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:16:55][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:16:56][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:16:56][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:16:56][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:16:56][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:16:57][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:16:57][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:16:57][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:16:57][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:16:57][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| [2025-09-12 00:16:57][DEBUG] print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:17:00][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:17:09][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:17:09][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 9.19 ms / 36 runs ( 0.26 ms per token, 3917.73 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 15172.68 ms / 28 runs ( 541.88 ms per token, 1.85 tokens per second) | |
| llama_perf_context_print: total time = 15186.49 ms / 29 tokens | |
| llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:17:09][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:17:09][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-cmi03v0wqdhixfj5bbn1om", | |
| "object": "chat.completion", | |
| "created": 1757654214, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me start by acknowledging their greeting.\n\nHmm,", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 28, | |
| "total_tokens": 36 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:17:10][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:17:10][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:17:11][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:17:11][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:17:11][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:17:12][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:17:48][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:17:48][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:17:48][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 254, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:17:48][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:17:48][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:17:48][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:17:49][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:17:49][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:17:51][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:17:51][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:17:51][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:17:51][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:17:51][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:17:51][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:17:51][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:17:52][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:17:52][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:17:52][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| [2025-09-12 00:17:52][DEBUG] print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:17:55][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:18:03][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:18:03][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 9.80 ms / 36 runs ( 0.27 ms per token, 3673.47 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 15197.28 ms / 28 runs ( 542.76 ms per token, 1.84 tokens per second) | |
| llama_perf_context_print: total time = 15211.42 ms / 29 tokens | |
| llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:18:03][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:18:03][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-sa2a6b2k1jg1btir6v3m", | |
| "object": "chat.completion", | |
| "created": 1757654268, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\n", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 28, | |
| "total_tokens": 36 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:18:05][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:18:05][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:18:05][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:18:05][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:18:05][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:18:07][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:18:33][INFO] | |
| [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:18:33][INFO] | |
| [2025-09-12 00:18:33][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:18:45][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:18:45][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:18:45][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 254, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:18:45][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:18:45][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:18:45][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:18:46][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| No tokens to output. Continuing generation | |
| [2025-09-12 00:18:48][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:18:48][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:18:48][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [2025-09-12 00:18:48][DEBUG][LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:18:48][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:18:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:18:48][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:18:48][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| [2025-09-12 00:18:48][DEBUG] print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:18:48][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:18:48][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:18:48][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| [2025-09-12 00:18:48][DEBUG] print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:18:51][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:19:01][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:19:01][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 10.06 ms / 36 runs ( 0.28 ms per token, 3579.60 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 15165.60 ms / 28 runs ( 541.63 ms per token, 1.85 tokens per second) | |
| llama_perf_context_print: total time = 15180.15 ms / 29 tokens | |
| llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:19:01][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:19:01][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-zcdn1g8f3pi21sby0xvp9q", | |
| "object": "chat.completion", | |
| "created": 1757654325, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. First, I should acknowledge their greeting. Maybe say", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 28, | |
| "total_tokens": 36 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:19:02][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:19:02][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:19:02][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:19:02][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:19:02][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:19:04][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:19:11][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing! | |
| [2025-09-12 00:19:11][INFO] | |
| [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models | |
| [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions | |
| [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions | |
| [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings | |
| [2025-09-12 00:19:11][INFO] | |
| [2025-09-12 00:19:11][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:19:16][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 254, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:19:16][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:19:16][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:19:16][INFO][JIT] Requested model (reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix) is not loaded. Loading "DavidAU/Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF/Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf" now... | |
| [2025-09-12 00:19:16][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:19:16][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 8/8 of prompt (100%), 8 prefix, 0 non-prefix | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:19:17][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:19:17][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:19:18][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: OFF | |
| [2025-09-12 00:19:18][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:19:18][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '16771': | |
| Model: 11.16 GB | |
| Context: 2.26 GB | |
| Total: 13.42 GB | |
| [LM Studio] Not using full context length for VRAM overflow calculations due to single GPU setup. Instead, using '8192' as context length for the calculation. Original context length: '16771'. | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:19:19][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:19:19][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:19:19][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:19:19][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:19:19][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:19:19][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:19:19][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| [2025-09-12 00:19:19][DEBUG] print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:19:22][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:19:31][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:19:31][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 9.34 ms / 36 runs ( 0.26 ms per token, 3855.63 tokens per second) | |
| llama_perf_context_print: load time = 13243.14 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 15091.80 ms / 28 runs ( 538.99 ms per token, 1.86 tokens per second) | |
| llama_perf_context_print: total time = 15105.49 ms / 29 tokens | |
| [2025-09-12 00:19:31][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:19:31][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-ong1jptqvmfus4sqcmu9c", | |
| "object": "chat.completion", | |
| "created": 1757654356, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation here", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 28, | |
| "total_tokens": 36 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:19:31][DEBUG] llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:19:32][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 16771 | |
| llama_context: n_ctx_per_seq = 16771 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (16771) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:19:32][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| llama_kv_cache: CPU KV buffer size = 1157.06 MiB | |
| [2025-09-12 00:19:33][DEBUG] llama_kv_cache: size = 1157.06 MiB ( 16896 cells, 44 layers, 1/1 seqs), K (q8_0): 578.53 MiB, V (q8_0): 578.53 MiB | |
| [2025-09-12 00:19:33][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 45.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 290 (with bs=512), 47 (with bs=1) | |
| [2025-09-12 00:19:33][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 16896 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:19:33][INFO] | |
| [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:19:33][INFO] | |
| [2025-09-12 00:19:33][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:19:34][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:19:35][INFO] | |
| [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:19:35][INFO] | |
| [2025-09-12 00:19:35][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:19:57][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 254, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:19:57][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:20:03][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 254, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:20:03][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:21:22][INFO] | |
| [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:21:22][INFO] | |
| [2025-09-12 00:21:22][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:26:09][INFO] | |
| [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:26:09][INFO] | |
| [2025-09-12 00:26:09][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:27:53][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: ON | |
| [2025-09-12 00:27:53][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:27:53][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192': | |
| Model: 11.16 GB | |
| Context: 1.43 GB | |
| Total: 12.58 GB | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:27:53][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:27:53][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:27:53][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:27:53][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:27:54][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:27:54][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:27:54][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| [2025-09-12 00:27:54][DEBUG] print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:27:56][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:28:05][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 8192 | |
| llama_context: n_ctx_per_seq = 8192 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:28:05][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| [2025-09-12 00:28:05][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 280.50 MiB | |
| [2025-09-12 00:28:05][DEBUG] llama_kv_cache: CPU KV buffer size = 280.50 MiB | |
| [2025-09-12 00:28:05][DEBUG] llama_kv_cache: size = 561.00 MiB ( 8192 cells, 44 layers, 1/1 seqs), K (q8_0): 280.50 MiB, V (q8_0): 280.50 MiB | |
| [2025-09-12 00:28:05][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 28.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 246 (with bs=512), 3 (with bs=1) | |
| [2025-09-12 00:28:05][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:28:06][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:28:47][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created. | |
| [2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio | |
| [2025-09-12 00:28:47][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor. | |
| [2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor][Request (sKQs11)] New preprocess request received. | |
| [2025-09-12 00:28:47][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor][Request (sKQs11)] Preprocess request completed. | |
| [2025-09-12 00:28:47][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:28:47][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 9 | |
| Total prompt tokens: 9 | |
| Prompt tokens to decode: 9 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:28:48][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:28:48][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:29:03][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 8.82 ms / 39 runs ( 0.23 ms per token, 4423.27 tokens per second) | |
| llama_perf_context_print: load time = 12603.59 ms | |
| llama_perf_context_print: prompt eval time = 1035.86 ms / 9 tokens ( 115.10 ms per token, 8.69 tokens per second) | |
| llama_perf_context_print: eval time = 14731.45 ms / 29 runs ( 507.98 ms per token, 1.97 tokens per second) | |
| llama_perf_context_print: total time = 15780.55 ms / 38 tokens | |
| llama_perf_context_print: graphs reused = 28 | |
| [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:29:18][INFO] | |
| [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:29:18][INFO] | |
| [2025-09-12 00:29:18][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:29:18][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected. | |
| [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:29:19][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing! | |
| [2025-09-12 00:29:19][INFO] | |
| [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models | |
| [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions | |
| [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions | |
| [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings | |
| [2025-09-12 00:29:19][INFO] | |
| [2025-09-12 00:29:19][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:29:50][INFO] Server stopped. | |
| [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:30:01][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing! | |
| [2025-09-12 00:30:01][INFO] | |
| [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models | |
| [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions | |
| [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions | |
| [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings | |
| [2025-09-12 00:30:01][INFO] | |
| [2025-09-12 00:30:01][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:30:01][INFO] Server started. | |
| [2025-09-12 00:30:01][INFO] Just-in-time model loading active. | |
| [2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client created. | |
| [2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created. | |
| [2025-09-12 00:30:02][INFO][Plugin(lmstudio/js-code-sandbox)] stdout: [Tools Prvdr.] Register with LM Studio | |
| [2025-09-12 00:30:02][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox][Endpoint=setToolsProvider] Registering tools provider. | |
| [2025-09-12 00:30:02][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio | |
| [2025-09-12 00:30:03][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor. | |
| [2025-09-12 00:30:03][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected. | |
| [2025-09-12 00:30:25][INFO] Server stopped. | |
| [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:31:07][WARN][LM STUDIO SERVER] Server accepting connections from the local network. Only use this if you know what you are doing! | |
| [2025-09-12 00:31:07][INFO] | |
| [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> GET http://192.168.128.20:12345/v1/models | |
| [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/chat/completions | |
| [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/completions | |
| [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] -> POST http://192.168.128.20:12345/v1/embeddings | |
| [2025-09-12 00:31:07][INFO] | |
| [2025-09-12 00:31:07][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:31:07][INFO] Server started. | |
| [2025-09-12 00:31:07][INFO] Just-in-time model loading active. | |
| [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client created. | |
| [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client created. | |
| [2025-09-12 00:31:08][INFO][Plugin(lmstudio/js-code-sandbox)] stdout: [Tools Prvdr.] Register with LM Studio | |
| [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox][Endpoint=setToolsProvider] Registering tools provider. | |
| [2025-09-12 00:31:08][INFO][Plugin(lmstudio/rag-v1)] stdout: [PromptPreprocessor] Register with LM Studio | |
| [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1][Endpoint=setPromptPreprocessor] Registering promptPreprocessor. | |
| [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/js-code-sandbox] Client disconnected. | |
| [2025-09-12 00:31:08][DEBUG][Client=plugin:installed:lmstudio/rag-v1] Client disconnected. | |
| [2025-09-12 00:31:43][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: ON | |
| [2025-09-12 00:31:43][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:31:43][DEBUG][LM Studio] Model load size estimate with raw num offload layers '22' and context length '8192': | |
| Model: 11.16 GB | |
| Context: 1.43 GB | |
| Total: 12.58 GB | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [2025-09-12 00:31:43][DEBUG][LM Studio] Resolved GPU config options: | |
| Num Offload Layers: 22 | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:31:44][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:31:44][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:31:44][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 399 tensors from D:\AI-Models\__LMStudio\DavidAU\Reka-Flash-3-21B-Reasoning-Uncensored-MAX-NEO-Imatrix-GGUF\Reka-Flash-3-21B-Reasoning-MAX-NEO-D_AU-Q8_0-imat.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = Reka Flash 3 | |
| llama_model_loader: - kv 3: general.version str = 3 | |
| llama_model_loader: - kv 4: general.basename str = reka-flash | |
| llama_model_loader: - kv 5: general.size_label str = 21B | |
| llama_model_loader: - kv 6: general.license str = apache-2.0 | |
| llama_model_loader: - kv 7: llama.block_count u32 = 44 | |
| llama_model_loader: - kv 8: llama.context_length u32 = 32768 | |
| llama_model_loader: - kv 9: llama.embedding_length u32 = 6144 | |
| llama_model_loader: - kv 10: llama.feed_forward_length u32 = 19648 | |
| llama_model_loader: - kv 11: llama.attention.head_count u32 = 64 | |
| llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 13: llama.rope.freq_base f32 = 8000000.000000 | |
| llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 15: llama.vocab_size u32 = 100352 | |
| llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 96 | |
| llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.pre str = dbrx | |
| [2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| [2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| [2025-09-12 00:31:44][DEBUG] llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 100257 | |
| llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 100257 | |
| llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 100257 | |
| llama_model_loader: - kv 25: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... | |
| llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false | |
| llama_model_loader: - kv 27: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 28: general.file_type u32 = 7 | |
| llama_model_loader: - kv 29: quantize.imatrix.file str = E:/_imx/Reka-Flash-3-21B-Reasoning-NE... | |
| llama_model_loader: - kv 30: quantize.imatrix.dataset str = f:/llamacpp/_raw_imatrix/neo1-v2.txt | |
| llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 308 | |
| llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 180 | |
| llama_model_loader: - type f32: 89 tensors | |
| llama_model_loader: - type q8_0: 309 tensors | |
| llama_model_loader: - type bf16: 1 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 21.23 GiB (8.72 BPW) | |
| [2025-09-12 00:31:44][DEBUG] load: printing all EOG tokens: | |
| load: - 100257 ('<|endoftext|>') | |
| [2025-09-12 00:31:44][DEBUG] load: special tokens cache size = 21 | |
| [2025-09-12 00:31:44][DEBUG] load: token to piece cache size = 0.6145 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 32768 | |
| print_info: n_embd = 6144 | |
| print_info: n_layer = 44 | |
| print_info: n_head = 64 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 96 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 96 | |
| print_info: n_embd_head_v = 96 | |
| print_info: n_gqa = 8 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 19648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| [2025-09-12 00:31:44][DEBUG] print_info: rope scaling = linear | |
| print_info: freq_base_train = 8000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 32768 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = ?B | |
| print_info: model params = 20.91 B | |
| print_info: general.name = Reka Flash 3 | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 100352 | |
| print_info: n_merges = 100000 | |
| print_info: BOS token = 100257 '<|endoftext|>' | |
| print_info: EOS token = 100257 '<|endoftext|>' | |
| print_info: EOT token = 100257 '<|endoftext|>' | |
| print_info: UNK token = 100257 '<|endoftext|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 100258 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 100260 '<|fim_suffix|>' | |
| print_info: FIM MID token = 100259 '<|fim_middle|>' | |
| print_info: EOG token = 100257 '<|endoftext|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:31:47][DEBUG] load_tensors: offloading 22 repeating layers to GPU | |
| load_tensors: offloaded 22/45 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 9967.55 MiB | |
| load_tensors: CPU_Mapped model buffer size = 11768.32 MiB | |
| [2025-09-12 00:31:55][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 8192 | |
| llama_context: n_ctx_per_seq = 8192 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 8000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:31:55][DEBUG] llama_context: CPU output buffer size = 0.38 MiB | |
| [2025-09-12 00:31:56][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 280.50 MiB | |
| [2025-09-12 00:31:56][DEBUG] llama_kv_cache: CPU KV buffer size = 280.50 MiB | |
| [2025-09-12 00:31:56][DEBUG] llama_kv_cache: size = 561.00 MiB ( 8192 cells, 44 layers, 1/1 seqs), K (q8_0): 280.50 MiB, V (q8_0): 280.50 MiB | |
| [2025-09-12 00:31:56][DEBUG] llama_context: Vulkan0 compute buffer size = 1396.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 28.01 MiB | |
| llama_context: graph nodes = 1371 | |
| llama_context: graph splits = 246 (with bs=512), 3 (with bs=1) | |
| [2025-09-12 00:31:56][DEBUG] common_init_from_params: added <|endoftext|> logit bias = -inf | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:31:56][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:33:47][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 254, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:33:47][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:33:47][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:33:47][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 8192, n_batch = 512, n_predict = 254, n_keep = 8 | |
| Total prompt tokens: 8 | |
| Prompt tokens to decode: 8 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:33:48][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:33:48][DEBUG] No tokens to output. Continuing generation | |
| [2025-09-12 00:34:02][INFO][LM STUDIO SERVER] Client disconnected. Stopping generation... (If the model is busy processing the prompt, it will finish first.) | |
| [2025-09-12 00:34:02][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 9.17 ms / 37 runs ( 0.25 ms per token, 4033.14 tokens per second) | |
| llama_perf_context_print: load time = 12853.96 ms | |
| llama_perf_context_print: prompt eval time = 820.59 ms / 8 tokens ( 102.57 ms per token, 9.75 tokens per second) | |
| llama_perf_context_print: eval time = 14363.18 ms / 28 runs ( 512.97 ms per token, 1.95 tokens per second) | |
| llama_perf_context_print: total time = 15197.21 ms / 36 tokens | |
| llama_perf_context_print: graphs reused = 27 | |
| [2025-09-12 00:34:02][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Model generated tool calls: [] | |
| [2025-09-12 00:34:02][INFO][reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix] Generated prediction: { | |
| "id": "chatcmpl-t5ql74xf1rih385fceeszc", | |
| "object": "chat.completion", | |
| "created": 1757655227, | |
| "model": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " <reasoning>\nThe user just said \"Hi\". I need to respond appropriately. Let me think about how to start a conversation.\n\nFirst", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 8, | |
| "completion_tokens": 29, | |
| "total_tokens": 37 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "reka-flash-3-21b-reasoning-uncensored-max-neo-imatrix" | |
| } | |
| [2025-09-12 00:34:48][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine... | |
| [2025-09-12 00:34:48][DEBUG][WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior. | |
| [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf | |
| [2025-09-12 00:34:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:34:48][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = nomic-bert | |
| llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 | |
| llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 | |
| llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 | |
| llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 | |
| llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 | |
| llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 | |
| llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 | |
| llama_model_loader: - kv 8: general.file_type u32 = 15 | |
| llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false | |
| llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 | |
| llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 | |
| llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 | |
| llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 | |
| llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 | |
| llama_model_loader: - kv 15: tokenizer.ggml.model str = bert | |
| [2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... | |
| [2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... | |
| [2025-09-12 00:34:48][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 | |
| llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
| llama_model_loader: - kv 22: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 51 tensors | |
| llama_model_loader: - type q4_K: 43 tensors | |
| llama_model_loader: - type q5_K: 12 tensors | |
| llama_model_loader: - type q6_K: 6 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 79.49 MiB (4.88 BPW) | |
| [2025-09-12 00:34:48][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 102 ('[SEP]') | |
| load: special tokens cache size = 5 | |
| [2025-09-12 00:34:48][DEBUG] load: token to piece cache size = 0.2032 MB | |
| print_info: arch = nomic-bert | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 2048 | |
| print_info: n_embd = 768 | |
| print_info: n_layer = 12 | |
| print_info: n_head = 12 | |
| print_info: n_head_kv = 12 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 64 | |
| print_info: n_embd_head_v = 64 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 1.0e-12 | |
| print_info: f_norm_rms_eps = 0.0e+00 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| [2025-09-12 00:34:48][DEBUG] print_info: n_ff = 3072 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 0 | |
| print_info: pooling type = 1 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 2048 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 137M | |
| print_info: model params = 136.73 M | |
| print_info: general.name = nomic-embed-text-v1.5 | |
| print_info: vocab type = WPM | |
| print_info: n_vocab = 30522 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 101 '[CLS]' | |
| print_info: EOS token = 102 '[SEP]' | |
| print_info: UNK token = 100 '[UNK]' | |
| print_info: SEP token = 102 '[SEP]' | |
| print_info: PAD token = 0 '[PAD]' | |
| print_info: MASK token = 103 '[MASK]' | |
| print_info: LF token = 0 '[PAD]' | |
| print_info: EOG token = 102 '[SEP]' | |
| print_info: max token length = 21 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:34:48][DEBUG] load_tensors: offloading 12 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 13/13 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 66.90 MiB | |
| load_tensors: CPU_Mapped model buffer size = 12.59 MiB | |
| [2025-09-12 00:34:48][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 2048 | |
| llama_context: causal_attn = 0 | |
| llama_context: flash_attn = auto | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000.0 | |
| llama_context: freq_scale = 1 | |
| [2025-09-12 00:34:48][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| [2025-09-12 00:34:48][DEBUG] llama_context: Flash Attention was auto, set to enabled | |
| [2025-09-12 00:34:48][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 22.03 MiB | |
| llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1) | |
| llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1) | |
| common_init_from_params: added [SEP] logit bias = -inf | |
| [2025-09-12 00:34:48][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:34:49][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete! | |
| [INFO] [PaniniRagEngine] Model loaded into embedding engine! | |
| [INFO] [PaniniRagEngine] Model loaded without an active session. | |
| [2025-09-12 00:35:15][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "text-embedding-nomic-embed-text-v1.5", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:35:15][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now... | |
| [2025-09-12 00:35:16][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine... | |
| [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior. | |
| [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf | |
| [2025-09-12 00:35:16][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:35:16][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = nomic-bert | |
| llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 | |
| llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 | |
| llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 | |
| llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 | |
| llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 | |
| llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 | |
| llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 | |
| llama_model_loader: - kv 8: general.file_type u32 = 15 | |
| llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false | |
| llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 | |
| llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 | |
| llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 | |
| llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 | |
| llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 | |
| llama_model_loader: - kv 15: tokenizer.ggml.model str = bert | |
| [2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... | |
| [2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... | |
| [2025-09-12 00:35:16][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 | |
| llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
| llama_model_loader: - kv 22: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 51 tensors | |
| llama_model_loader: - type q4_K: 43 tensors | |
| llama_model_loader: - type q5_K: 12 tensors | |
| llama_model_loader: - type q6_K: 6 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 79.49 MiB (4.88 BPW) | |
| [2025-09-12 00:35:16][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 102 ('[SEP]') | |
| load: special tokens cache size = 5 | |
| [2025-09-12 00:35:16][DEBUG] load: token to piece cache size = 0.2032 MB | |
| print_info: arch = nomic-bert | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 2048 | |
| print_info: n_embd = 768 | |
| print_info: n_layer = 12 | |
| print_info: n_head = 12 | |
| print_info: n_head_kv = 12 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 64 | |
| print_info: n_embd_head_v = 64 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 1.0e-12 | |
| print_info: f_norm_rms_eps = 0.0e+00 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 3072 | |
| [2025-09-12 00:35:16][DEBUG] print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 0 | |
| print_info: pooling type = 1 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 2048 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 137M | |
| print_info: model params = 136.73 M | |
| print_info: general.name = nomic-embed-text-v1.5 | |
| print_info: vocab type = WPM | |
| print_info: n_vocab = 30522 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 101 '[CLS]' | |
| print_info: EOS token = 102 '[SEP]' | |
| print_info: UNK token = 100 '[UNK]' | |
| print_info: SEP token = 102 '[SEP]' | |
| print_info: PAD token = 0 '[PAD]' | |
| print_info: MASK token = 103 '[MASK]' | |
| print_info: LF token = 0 '[PAD]' | |
| print_info: EOG token = 102 '[SEP]' | |
| print_info: max token length = 21 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:35:17][DEBUG] load_tensors: offloading 12 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 13/13 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 66.90 MiB | |
| load_tensors: CPU_Mapped model buffer size = 12.59 MiB | |
| [2025-09-12 00:35:17][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 2048 | |
| llama_context: causal_attn = 0 | |
| llama_context: flash_attn = auto | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000.0 | |
| llama_context: freq_scale = 1 | |
| [2025-09-12 00:35:17][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| llama_context: Flash Attention was auto, set to enabled | |
| [2025-09-12 00:35:17][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 22.03 MiB | |
| llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1) | |
| llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1) | |
| common_init_from_params: added [SEP] logit bias = -inf | |
| [2025-09-12 00:35:17][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:35:17][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete! | |
| [INFO] [PaniniRagEngine] Model loaded into embedding engine! | |
| [INFO] [PaniniRagEngine] Model loaded without an active session. | |
| [2025-09-12 00:35:34][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "text-embedding-nomic-embed-text-v1.5", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:35:34][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now... | |
| [2025-09-12 00:35:36][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine... | |
| [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior. | |
| [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf | |
| [2025-09-12 00:35:36][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:35:36][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = nomic-bert | |
| llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 | |
| llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 | |
| llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 | |
| llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 | |
| llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 | |
| llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 | |
| llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 | |
| llama_model_loader: - kv 8: general.file_type u32 = 15 | |
| llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false | |
| llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 | |
| llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 | |
| llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 | |
| llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 | |
| llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 | |
| llama_model_loader: - kv 15: tokenizer.ggml.model str = bert | |
| [2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... | |
| [2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... | |
| [2025-09-12 00:35:36][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 | |
| llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
| llama_model_loader: - kv 22: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 51 tensors | |
| llama_model_loader: - type q4_K: 43 tensors | |
| llama_model_loader: - type q5_K: 12 tensors | |
| llama_model_loader: - type q6_K: 6 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 79.49 MiB (4.88 BPW) | |
| [2025-09-12 00:35:36][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 102 ('[SEP]') | |
| load: special tokens cache size = 5 | |
| [2025-09-12 00:35:36][DEBUG] load: token to piece cache size = 0.2032 MB | |
| print_info: arch = nomic-bert | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 2048 | |
| print_info: n_embd = 768 | |
| print_info: n_layer = 12 | |
| print_info: n_head = 12 | |
| print_info: n_head_kv = 12 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 64 | |
| print_info: n_embd_head_v = 64 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 1.0e-12 | |
| print_info: f_norm_rms_eps = 0.0e+00 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| [2025-09-12 00:35:36][DEBUG] print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 3072 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 0 | |
| print_info: pooling type = 1 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 2048 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 137M | |
| print_info: model params = 136.73 M | |
| print_info: general.name = nomic-embed-text-v1.5 | |
| print_info: vocab type = WPM | |
| print_info: n_vocab = 30522 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 101 '[CLS]' | |
| print_info: EOS token = 102 '[SEP]' | |
| print_info: UNK token = 100 '[UNK]' | |
| print_info: SEP token = 102 '[SEP]' | |
| print_info: PAD token = 0 '[PAD]' | |
| print_info: MASK token = 103 '[MASK]' | |
| print_info: LF token = 0 '[PAD]' | |
| print_info: EOG token = 102 '[SEP]' | |
| print_info: max token length = 21 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:35:36][DEBUG] load_tensors: offloading 12 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 13/13 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 66.90 MiB | |
| load_tensors: CPU_Mapped model buffer size = 12.59 MiB | |
| [2025-09-12 00:35:36][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 2048 | |
| llama_context: causal_attn = 0 | |
| llama_context: flash_attn = auto | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000.0 | |
| llama_context: freq_scale = 1 | |
| [2025-09-12 00:35:36][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| llama_context: Flash Attention was auto, set to enabled | |
| [2025-09-12 00:35:36][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 22.03 MiB | |
| llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1) | |
| llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1) | |
| common_init_from_params: added [SEP] logit bias = -inf | |
| [2025-09-12 00:35:36][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:35:36][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete! | |
| [INFO] [PaniniRagEngine] Model loaded into embedding engine! | |
| [INFO] [PaniniRagEngine] Model loaded without an active session. | |
| [2025-09-12 00:35:58][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "text-embedding-nomic-embed-text-v1.5", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:35:58][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now... | |
| [2025-09-12 00:36:00][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine... | |
| [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior. | |
| [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf | |
| [2025-09-12 00:36:00][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:36:00][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = nomic-bert | |
| llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 | |
| llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 | |
| llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 | |
| llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 | |
| llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 | |
| llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 | |
| llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 | |
| llama_model_loader: - kv 8: general.file_type u32 = 15 | |
| llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false | |
| llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 | |
| llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 | |
| llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 | |
| llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 | |
| llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 | |
| llama_model_loader: - kv 15: tokenizer.ggml.model str = bert | |
| [2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... | |
| [2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... | |
| [2025-09-12 00:36:00][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 | |
| llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
| llama_model_loader: - kv 22: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 51 tensors | |
| llama_model_loader: - type q4_K: 43 tensors | |
| llama_model_loader: - type q5_K: 12 tensors | |
| llama_model_loader: - type q6_K: 6 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 79.49 MiB (4.88 BPW) | |
| [2025-09-12 00:36:00][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 102 ('[SEP]') | |
| load: special tokens cache size = 5 | |
| [2025-09-12 00:36:00][DEBUG] load: token to piece cache size = 0.2032 MB | |
| print_info: arch = nomic-bert | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 2048 | |
| print_info: n_embd = 768 | |
| print_info: n_layer = 12 | |
| print_info: n_head = 12 | |
| print_info: n_head_kv = 12 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 64 | |
| print_info: n_embd_head_v = 64 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 1.0e-12 | |
| print_info: f_norm_rms_eps = 0.0e+00 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| [2025-09-12 00:36:00][DEBUG] print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 3072 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 0 | |
| print_info: pooling type = 1 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 2048 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 137M | |
| print_info: model params = 136.73 M | |
| print_info: general.name = nomic-embed-text-v1.5 | |
| print_info: vocab type = WPM | |
| print_info: n_vocab = 30522 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 101 '[CLS]' | |
| print_info: EOS token = 102 '[SEP]' | |
| print_info: UNK token = 100 '[UNK]' | |
| print_info: SEP token = 102 '[SEP]' | |
| print_info: PAD token = 0 '[PAD]' | |
| print_info: MASK token = 103 '[MASK]' | |
| print_info: LF token = 0 '[PAD]' | |
| print_info: EOG token = 102 '[SEP]' | |
| print_info: max token length = 21 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:36:00][DEBUG] load_tensors: offloading 12 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 13/13 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 66.90 MiB | |
| load_tensors: CPU_Mapped model buffer size = 12.59 MiB | |
| [2025-09-12 00:36:00][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 2048 | |
| llama_context: causal_attn = 0 | |
| llama_context: flash_attn = auto | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000.0 | |
| llama_context: freq_scale = 1 | |
| [2025-09-12 00:36:00][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| llama_context: Flash Attention was auto, set to enabled | |
| [2025-09-12 00:36:00][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 22.03 MiB | |
| llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1) | |
| llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1) | |
| common_init_from_params: added [SEP] logit bias = -inf | |
| [2025-09-12 00:36:00][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:36:00][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete! | |
| [INFO] [PaniniRagEngine] Model loaded into embedding engine! | |
| [INFO] [PaniniRagEngine] Model loaded without an active session. | |
| [2025-09-12 00:36:00][DEBUG] | |
| [2025-09-12 00:36:20][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "text-embedding-nomic-embed-text-v1.5", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:36:20][INFO][JIT] Requested model (text-embedding-nomic-embed-text-v1.5) is not loaded. Loading "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf" now... | |
| [2025-09-12 00:36:22][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine... | |
| [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior. | |
| [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf | |
| [2025-09-12 00:36:22][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:36:22][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = nomic-bert | |
| llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 | |
| llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 | |
| llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 | |
| llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 | |
| llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 | |
| llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 | |
| llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 | |
| llama_model_loader: - kv 8: general.file_type u32 = 15 | |
| llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false | |
| llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 | |
| llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 | |
| llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 | |
| llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 | |
| llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 | |
| llama_model_loader: - kv 15: tokenizer.ggml.model str = bert | |
| [2025-09-12 00:36:22][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... | |
| [2025-09-12 00:36:23][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... | |
| [2025-09-12 00:36:23][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 | |
| llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
| llama_model_loader: - kv 22: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 51 tensors | |
| llama_model_loader: - type q4_K: 43 tensors | |
| llama_model_loader: - type q5_K: 12 tensors | |
| llama_model_loader: - type q6_K: 6 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 79.49 MiB (4.88 BPW) | |
| [2025-09-12 00:36:23][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 102 ('[SEP]') | |
| load: special tokens cache size = 5 | |
| [2025-09-12 00:36:23][DEBUG] load: token to piece cache size = 0.2032 MB | |
| print_info: arch = nomic-bert | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 2048 | |
| print_info: n_embd = 768 | |
| print_info: n_layer = 12 | |
| print_info: n_head = 12 | |
| print_info: n_head_kv = 12 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 64 | |
| print_info: n_embd_head_v = 64 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 1.0e-12 | |
| print_info: f_norm_rms_eps = 0.0e+00 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 3072 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 0 | |
| [2025-09-12 00:36:23][DEBUG] print_info: pooling type = 1 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 2048 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 137M | |
| print_info: model params = 136.73 M | |
| print_info: general.name = nomic-embed-text-v1.5 | |
| print_info: vocab type = WPM | |
| print_info: n_vocab = 30522 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 101 '[CLS]' | |
| print_info: EOS token = 102 '[SEP]' | |
| print_info: UNK token = 100 '[UNK]' | |
| print_info: SEP token = 102 '[SEP]' | |
| print_info: PAD token = 0 '[PAD]' | |
| print_info: MASK token = 103 '[MASK]' | |
| print_info: LF token = 0 '[PAD]' | |
| print_info: EOG token = 102 '[SEP]' | |
| print_info: max token length = 21 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:36:23][DEBUG] load_tensors: offloading 12 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 13/13 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 66.90 MiB | |
| load_tensors: CPU_Mapped model buffer size = 12.59 MiB | |
| [2025-09-12 00:36:23][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 2048 | |
| llama_context: causal_attn = 0 | |
| llama_context: flash_attn = auto | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000.0 | |
| llama_context: freq_scale = 1 | |
| [2025-09-12 00:36:23][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| llama_context: Flash Attention was auto, set to enabled | |
| [2025-09-12 00:36:23][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 22.03 MiB | |
| llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1) | |
| llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1) | |
| common_init_from_params: added [SEP] logit bias = -inf | |
| [2025-09-12 00:36:23][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:36:23][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete! | |
| [INFO] [PaniniRagEngine] Model loaded into embedding engine! | |
| [INFO] [PaniniRagEngine] Model loaded without an active session. | |
| [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:37:47][INFO] | |
| [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:37:47][INFO] | |
| [2025-09-12 00:37:47][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:41:47][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: ON | |
| [2025-09-12 00:41:47][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:41:47][DEBUG][LM Studio] Model load size estimate with raw num offload layers 'max' and context length '2048': | |
| Model: 13.81 GB | |
| Context: 1.09 GB | |
| Total: 14.91 GB | |
| [2025-09-12 00:41:47][DEBUG][LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [LM Studio] Resolved GPU config options: | |
| Num Offload Layers: max | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:41:48][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:41:48][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:41:48][DEBUG] llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from D:\AI-Models\__LMStudio\Random\Nethena-13B\Nethena-13B.Q8_0.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.name str = LLaMA v2 | |
| llama_model_loader: - kv 2: llama.context_length u32 = 4096 | |
| llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 | |
| llama_model_loader: - kv 4: llama.block_count u32 = 40 | |
| llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 | |
| llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 | |
| llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 | |
| llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 | |
| llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 | |
| llama_model_loader: - kv 11: general.file_type u32 = 7 | |
| llama_model_loader: - kv 12: tokenizer.ggml.model str = llama | |
| [2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... | |
| [2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... | |
| [2025-09-12 00:41:48][DEBUG] llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... | |
| llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 | |
| llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 | |
| llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 32000 | |
| llama_model_loader: - kv 20: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 81 tensors | |
| llama_model_loader: - type q8_0: 282 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 12.88 GiB (8.50 BPW) | |
| [2025-09-12 00:41:48][DEBUG] load: bad special token: 'tokenizer.ggml.padding_token_id' = 32000, using default id -1 | |
| [2025-09-12 00:41:48][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 2 ('</s>') | |
| load: special tokens cache size = 3 | |
| [2025-09-12 00:41:48][DEBUG] load: token to piece cache size = 0.1684 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 4096 | |
| print_info: n_embd = 5120 | |
| print_info: n_layer = 40 | |
| print_info: n_head = 40 | |
| print_info: n_head_kv = 40 | |
| print_info: n_rot = 128 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 128 | |
| [2025-09-12 00:41:48][DEBUG] print_info: n_embd_head_v = 128 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 5120 | |
| print_info: n_embd_v_gqa = 5120 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 13824 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 10000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 4096 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 13B | |
| print_info: model params = 13.02 B | |
| print_info: general.name = LLaMA v2 | |
| print_info: vocab type = SPM | |
| print_info: n_vocab = 32000 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 1 '<s>' | |
| print_info: EOS token = 2 '</s>' | |
| print_info: UNK token = 0 '<unk>' | |
| print_info: LF token = 13 '<0x0A>' | |
| print_info: EOG token = 2 '</s>' | |
| print_info: max token length = 48 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:41:59][DEBUG] load_tensors: offloading 40 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 41/41 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 13023.85 MiB | |
| load_tensors: CPU_Mapped model buffer size = 166.02 MiB | |
| [2025-09-12 00:43:06][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 10000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:43:06][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| [2025-09-12 00:43:06][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 850.00 MiB | |
| [2025-09-12 00:43:06][DEBUG] llama_kv_cache: size = 850.00 MiB ( 2048 cells, 40 layers, 1/1 seqs), K (q8_0): 425.00 MiB, V (q8_0): 425.00 MiB | |
| [2025-09-12 00:43:06][DEBUG] llama_context: Vulkan0 compute buffer size = 117.01 MiB | |
| llama_context: Vulkan_Host compute buffer size = 14.01 MiB | |
| llama_context: graph nodes = 1247 | |
| llama_context: graph splits = 2 | |
| common_init_from_params: added </s> logit bias = -inf | |
| [2025-09-12 00:43:06][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:43:07][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:43:28][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine... | |
| [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior. | |
| [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf | |
| [2025-09-12 00:43:28][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:43:28][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = nomic-bert | |
| llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 | |
| llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 | |
| llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 | |
| llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 | |
| llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 | |
| llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 | |
| llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 | |
| llama_model_loader: - kv 8: general.file_type u32 = 15 | |
| llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false | |
| llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 | |
| llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 | |
| llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 | |
| llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 | |
| llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 | |
| llama_model_loader: - kv 15: tokenizer.ggml.model str = bert | |
| [2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... | |
| [2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... | |
| [2025-09-12 00:43:28][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 | |
| llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
| llama_model_loader: - kv 22: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 51 tensors | |
| llama_model_loader: - type q4_K: 43 tensors | |
| llama_model_loader: - type q5_K: 12 tensors | |
| llama_model_loader: - type q6_K: 6 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 79.49 MiB (4.88 BPW) | |
| [2025-09-12 00:43:28][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 102 ('[SEP]') | |
| load: special tokens cache size = 5 | |
| [2025-09-12 00:43:28][DEBUG] load: token to piece cache size = 0.2032 MB | |
| print_info: arch = nomic-bert | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 2048 | |
| print_info: n_embd = 768 | |
| print_info: n_layer = 12 | |
| print_info: n_head = 12 | |
| print_info: n_head_kv = 12 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 64 | |
| print_info: n_embd_head_v = 64 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 1.0e-12 | |
| [2025-09-12 00:43:28][DEBUG] print_info: f_norm_rms_eps = 0.0e+00 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 3072 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 0 | |
| print_info: pooling type = 1 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 2048 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 137M | |
| print_info: model params = 136.73 M | |
| print_info: general.name = nomic-embed-text-v1.5 | |
| print_info: vocab type = WPM | |
| print_info: n_vocab = 30522 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 101 '[CLS]' | |
| print_info: EOS token = 102 '[SEP]' | |
| print_info: UNK token = 100 '[UNK]' | |
| print_info: SEP token = 102 '[SEP]' | |
| print_info: PAD token = 0 '[PAD]' | |
| print_info: MASK token = 103 '[MASK]' | |
| print_info: LF token = 0 '[PAD]' | |
| print_info: EOG token = 102 '[SEP]' | |
| print_info: max token length = 21 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:43:28][DEBUG] load_tensors: offloading 12 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 13/13 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 66.90 MiB | |
| load_tensors: CPU_Mapped model buffer size = 12.59 MiB | |
| [2025-09-12 00:43:28][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 2048 | |
| llama_context: causal_attn = 0 | |
| llama_context: flash_attn = auto | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000.0 | |
| llama_context: freq_scale = 1 | |
| [2025-09-12 00:43:28][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| llama_context: Flash Attention was auto, set to enabled | |
| [2025-09-12 00:43:28][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 22.03 MiB | |
| llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1) | |
| llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1) | |
| common_init_from_params: added [SEP] logit bias = -inf | |
| [2025-09-12 00:43:28][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:43:28][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete! | |
| [INFO] [PaniniRagEngine] Model loaded into embedding engine! | |
| [INFO] [PaniniRagEngine] Model loaded without an active session. | |
| [2025-09-12 00:43:28][DEBUG] | |
| [2025-09-12 00:44:03][INFO] Server stopped. | |
| [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Success! HTTP server listening on port 12345 | |
| [2025-09-12 00:44:06][INFO] | |
| [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Supported endpoints: | |
| [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> GET http://localhost:12345/v1/models | |
| [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/chat/completions | |
| [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/completions | |
| [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] -> POST http://localhost:12345/v1/embeddings | |
| [2025-09-12 00:44:06][INFO] | |
| [2025-09-12 00:44:06][INFO][LM STUDIO SERVER] Logs are saved into C:\Users\razra\.cache\lm-studio\server-logs | |
| [2025-09-12 00:44:06][INFO] Server started. | |
| [2025-09-12 00:44:06][INFO] Just-in-time model loading active. | |
| [2025-09-12 00:45:26][DEBUG][INFO] [PaniniRagEngine] Loading model into embedding engine... | |
| [WARNING] Batch size (512) is < context length (2048). Resetting batch size to context length to avoid unexpected behavior. | |
| [INFO] [LlamaEmbeddingEngine] Loading model from path: C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf | |
| [2025-09-12 00:45:26][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:45:26][DEBUG] llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\razra\AppData\Local\Programs\lm-studio\LM Studio\resources\app\.webpack\bin\bundled-models\nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = nomic-bert | |
| llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 | |
| llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 | |
| llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 | |
| llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 | |
| llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 | |
| llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 | |
| llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 | |
| llama_model_loader: - kv 8: general.file_type u32 = 15 | |
| llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false | |
| llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 | |
| llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 | |
| llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 | |
| llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 | |
| llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 | |
| llama_model_loader: - kv 15: tokenizer.ggml.model str = bert | |
| [2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... | |
| [2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... | |
| [2025-09-12 00:45:26][DEBUG] llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 | |
| llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
| llama_model_loader: - kv 22: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 51 tensors | |
| llama_model_loader: - type q4_K: 43 tensors | |
| llama_model_loader: - type q5_K: 12 tensors | |
| llama_model_loader: - type q6_K: 6 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 79.49 MiB (4.88 BPW) | |
| [2025-09-12 00:45:26][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 102 ('[SEP]') | |
| load: special tokens cache size = 5 | |
| [2025-09-12 00:45:26][DEBUG] load: token to piece cache size = 0.2032 MB | |
| print_info: arch = nomic-bert | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 2048 | |
| print_info: n_embd = 768 | |
| print_info: n_layer = 12 | |
| print_info: n_head = 12 | |
| print_info: n_head_kv = 12 | |
| print_info: n_rot = 64 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| [2025-09-12 00:45:26][DEBUG] print_info: n_embd_head_k = 64 | |
| print_info: n_embd_head_v = 64 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 768 | |
| print_info: n_embd_v_gqa = 768 | |
| print_info: f_norm_eps = 1.0e-12 | |
| print_info: f_norm_rms_eps = 0.0e+00 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 3072 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 0 | |
| print_info: pooling type = 1 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 2048 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 137M | |
| print_info: model params = 136.73 M | |
| print_info: general.name = nomic-embed-text-v1.5 | |
| print_info: vocab type = WPM | |
| print_info: n_vocab = 30522 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 101 '[CLS]' | |
| print_info: EOS token = 102 '[SEP]' | |
| print_info: UNK token = 100 '[UNK]' | |
| print_info: SEP token = 102 '[SEP]' | |
| print_info: PAD token = 0 '[PAD]' | |
| print_info: MASK token = 103 '[MASK]' | |
| print_info: LF token = 0 '[PAD]' | |
| print_info: EOG token = 102 '[SEP]' | |
| print_info: max token length = 21 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:45:26][DEBUG] load_tensors: offloading 12 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 13/13 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 66.90 MiB | |
| load_tensors: CPU_Mapped model buffer size = 12.59 MiB | |
| [2025-09-12 00:45:26][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 2048 | |
| llama_context: causal_attn = 0 | |
| llama_context: flash_attn = auto | |
| llama_context: kv_unified = true | |
| llama_context: freq_base = 1000.0 | |
| llama_context: freq_scale = 1 | |
| [2025-09-12 00:45:26][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| llama_context: Flash Attention was auto, set to enabled | |
| [2025-09-12 00:45:26][DEBUG] llama_context: Vulkan0 compute buffer size = 108.00 MiB | |
| llama_context: Vulkan_Host compute buffer size = 22.03 MiB | |
| llama_context: graph nodes = 372 (with bs=2048), 408 (with bs=1) | |
| llama_context: graph splits = 4 (with bs=2048), 2 (with bs=1) | |
| common_init_from_params: added [SEP] logit bias = -inf | |
| [2025-09-12 00:45:26][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:45:26][DEBUG][INFO] [LlamaEmbeddingEngine] Model load complete! | |
| [INFO] [PaniniRagEngine] Model loaded into embedding engine! | |
| [ | |
| [2025-09-12 00:45:26][DEBUG] INFO] [PaniniRagEngine] Model loaded without an active session. | |
| [2025-09-12 00:45:45][DEBUG][LM Studio] GPU Configuration: | |
| Strategy: evenly | |
| Priority: [] | |
| Disabled GPUs: [] | |
| Limit weight offload to dedicated GPU Memory: OFF | |
| Offload KV Cache to GPU: ON | |
| [2025-09-12 00:45:45][DEBUG][LM Studio] Live GPU memory info: | |
| No live GPU info available | |
| [2025-09-12 00:45:45][DEBUG][LM Studio] Model load size estimate with raw num offload layers 'max' and context length '2048': | |
| Model: 13.81 GB | |
| Context: 1.09 GB | |
| Total: 14.91 GB | |
| [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment | |
| [2025-09-12 00:45:45][DEBUG][LM Studio] Resolved GPU config options: | |
| Num Offload Layers: max | |
| Num CPU Expert Layers: 0 | |
| Main GPU: 0 | |
| Tensor Split: [0] | |
| Disabled GPUs: [] | |
| [2025-09-12 00:45:45][DEBUG] CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | | |
| [2025-09-12 00:45:45][DEBUG] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6700 XT) - 11474 MiB free | |
| [2025-09-12 00:45:45][DEBUG] llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from D:\AI-Models\__LMStudio\Random\Nethena-13B\Nethena-13B.Q8_0.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = llama | |
| llama_model_loader: - kv 1: general.name str = LLaMA v2 | |
| llama_model_loader: - kv 2: llama.context_length u32 = 4096 | |
| llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 | |
| llama_model_loader: - kv 4: llama.block_count u32 = 40 | |
| llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 | |
| llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 | |
| llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 | |
| llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 | |
| llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 | |
| llama_model_loader: - kv 11: general.file_type u32 = 7 | |
| llama_model_loader: - kv 12: tokenizer.ggml.model str = llama | |
| [2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... | |
| [2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... | |
| [2025-09-12 00:45:45][DEBUG] llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... | |
| llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 | |
| llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 | |
| llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 | |
| llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 32000 | |
| llama_model_loader: - kv 20: general.quantization_version u32 = 2 | |
| llama_model_loader: - type f32: 81 tensors | |
| llama_model_loader: - type q8_0: 282 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q8_0 | |
| print_info: file size = 12.88 GiB (8.50 BPW) | |
| [2025-09-12 00:45:45][DEBUG] load: bad special token: 'tokenizer.ggml.padding_token_id' = 32000, using default id -1 | |
| [2025-09-12 00:45:45][DEBUG] load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: printing all EOG tokens: | |
| load: - 2 ('</s>') | |
| load: special tokens cache size = 3 | |
| [2025-09-12 00:45:45][DEBUG] load: token to piece cache size = 0.1684 MB | |
| print_info: arch = llama | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 4096 | |
| print_info: n_embd = 5120 | |
| print_info: n_layer = 40 | |
| print_info: n_head = 40 | |
| print_info: n_head_kv = 40 | |
| print_info: n_rot = 128 | |
| print_info: n_swa = 0 | |
| print_info: is_swa_any = 0 | |
| print_info: n_embd_head_k = 128 | |
| print_info: n_embd_head_v = 128 | |
| print_info: n_gqa = 1 | |
| print_info: n_embd_k_gqa = 5120 | |
| print_info: n_embd_v_gqa = 5120 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 13824 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 0 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 10000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 4096 | |
| print_info: rope_finetuned = unknown | |
| print_info: model type = 13B | |
| print_info: model params = 13.02 B | |
| print_info: general.name = LLaMA v2 | |
| print_info: vocab type = SPM | |
| print_info: n_vocab = 32000 | |
| print_info: n_merges = 0 | |
| print_info: BOS token = 1 '<s>' | |
| print_info: EOS token = 2 '</s>' | |
| print_info: UNK token = 0 '<unk>' | |
| print_info: LF token = 13 '<0x0A>' | |
| print_info: EOG token = 2 '</s>' | |
| print_info: max token length = 48 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| [2025-09-12 00:45:48][DEBUG] load_tensors: offloading 40 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 41/41 layers to GPU | |
| load_tensors: Vulkan0 model buffer size = 13023.85 MiB | |
| load_tensors: CPU_Mapped model buffer size = 166.02 MiB | |
| [2025-09-12 00:45:53][DEBUG] llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 2048 | |
| llama_context: n_ctx_per_seq = 2048 | |
| llama_context: n_batch = 512 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = enabled | |
| llama_context: kv_unified = false | |
| llama_context: freq_base = 10000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized | |
| [2025-09-12 00:45:53][DEBUG] llama_context: Vulkan_Host output buffer size = 0.12 MiB | |
| [2025-09-12 00:45:53][DEBUG] llama_kv_cache: Vulkan0 KV buffer size = 850.00 MiB | |
| [2025-09-12 00:45:54][DEBUG] llama_kv_cache: size = 850.00 MiB ( 2048 cells, 40 layers, 1/1 seqs), K (q8_0): 425.00 MiB, V (q8_0): 425.00 MiB | |
| [2025-09-12 00:45:54][DEBUG] llama_context: Vulkan0 compute buffer size = 117.01 MiB | |
| llama_context: Vulkan_Host compute buffer size = 14.01 MiB | |
| llama_context: graph nodes = 1247 | |
| llama_context: graph splits = 2 | |
| common_init_from_params: added </s> logit bias = -inf | |
| [2025-09-12 00:45:54][DEBUG] common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| [2025-09-12 00:45:54][DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 9 | |
| [2025-09-12 00:46:40][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "nethena-13b@q8_0", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:46:40][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:46:40][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:46:40][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14 | |
| Total prompt tokens: 14 | |
| Prompt tokens to decode: 14 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:46:41][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:46:46][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 0.91 ms / 23 runs ( 0.04 ms per token, 25358.32 tokens per second) | |
| llama_perf_context_print: load time = 8894.17 ms | |
| llama_perf_context_print: prompt eval time = 919.45 ms / 14 tokens ( 65.68 ms per token, 15.23 tokens per second) | |
| llama_perf_context_print: eval time = 4928.79 ms / 8 runs ( 616.10 ms per token, 1.62 tokens per second) | |
| llama_perf_context_print: total time = 5850.57 ms / 22 tokens | |
| llama_perf_context_print: graphs reused = 7 | |
| [2025-09-12 00:46:46][INFO][nethena-13b@q8_0] Model generated tool calls: [] | |
| [2025-09-12 00:46:46][INFO][nethena-13b@q8_0] Generated prediction: { | |
| "id": "chatcmpl-rdmujbhvp5pargg7qaws7", | |
| "object": "chat.completion", | |
| "created": 1757656000, | |
| "model": "nethena-13b@q8_0", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " Hello! How can I help you?", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 14, | |
| "completion_tokens": 9, | |
| "total_tokens": 23 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "nethena-13b@q8_0" | |
| } | |
| [2025-09-12 00:47:35][DEBUG] Received request: GET to /api/tags | |
| [2025-09-12 00:47:35][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway | |
| [2025-09-12 00:48:08][DEBUG] Received request: GET to /v1/completions/api/tags | |
| [2025-09-12 00:48:08][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway | |
| [2025-09-12 00:48:34][DEBUG] Received request: GET to /v1/completions/api/tags | |
| [2025-09-12 00:48:34][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway | |
| [2025-09-12 00:48:40][DEBUG] Received request: GET to /v1/completions/api/tags | |
| [2025-09-12 00:48:40][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway | |
| [2025-09-12 00:49:04][DEBUG] Received request: GET to /v1/completions/api/tags | |
| [2025-09-12 00:49:04][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway | |
| [2025-09-12 00:49:08][DEBUG] Received request: GET to /v1/completions/api/tags | |
| [2025-09-12 00:49:08][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/tags). Returning 200 anyway | |
| [2025-09-12 00:49:43][DEBUG] Received request: GET to /api/v1/models | |
| [2025-09-12 00:49:43][INFO] Returning 20 models from v1 API | |
| [2025-09-12 00:49:58][DEBUG] Received request: GET to /api/v1/models | |
| [2025-09-12 00:49:58][INFO] Returning 20 models from v1 API | |
| [2025-09-12 00:52:37][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "nethena-13b@q8_0", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:52:37][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:52:37][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| [2025-09-12 00:52:37][DEBUG] Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix | |
| Total prompt tokens: 14 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:52:38][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:52:43][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 1.01 ms / 24 runs ( 0.04 ms per token, 23692.00 tokens per second) | |
| llama_perf_context_print: load time = 8894.17 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 6072.72 ms / 10 runs ( 607.27 ms per token, 1.65 tokens per second) | |
| llama_perf_context_print: total time = 6077.33 ms / 11 tokens | |
| llama_perf_context_print: graphs reused = 10 | |
| [2025-09-12 00:52:43][INFO][nethena-13b@q8_0] Model generated tool calls: [] | |
| [2025-09-12 00:52:43][INFO][nethena-13b@q8_0] Generated prediction: { | |
| "id": "chatcmpl-r2s94r9fkeqa5nt1tvgna5", | |
| "object": "chat.completion", | |
| "created": 1757656357, | |
| "model": "nethena-13b@q8_0", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " Hello! How can I help you today?", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 14, | |
| "completion_tokens": 10, | |
| "total_tokens": 24 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "nethena-13b@q8_0" | |
| } | |
| [2025-09-12 00:52:55][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "nethena-13b@q8_0", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:52:55][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:52:55][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix | |
| Total prompt tokens: 14 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:52:56][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:53:01][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 0.93 ms / 23 runs ( 0.04 ms per token, 24651.66 tokens per second) | |
| llama_perf_context_print: load time = 8894.17 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 5397.13 ms / 9 runs ( 599.68 ms per token, 1.67 tokens per second) | |
| llama_perf_context_print: total time = 5398.86 ms / 10 tokens | |
| llama_perf_context_print: graphs reused = 9 | |
| [2025-09-12 00:53:01][INFO][nethena-13b@q8_0] Model generated tool calls: [] | |
| [2025-09-12 00:53:01][INFO][nethena-13b@q8_0] Generated prediction: { | |
| "id": "chatcmpl-fe5qlymqnimr3izqhwkyqq", | |
| "object": "chat.completion", | |
| "created": 1757656375, | |
| "model": "nethena-13b@q8_0", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " Hello! How can I help you?", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 14, | |
| "completion_tokens": 9, | |
| "total_tokens": 23 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "nethena-13b@q8_0" | |
| } | |
| [2025-09-12 00:53:26][DEBUG] Received request: POST to /v1/chat/completions with body { | |
| "model": "nethena-13b@q8_0", | |
| "temperature": 0.7, | |
| "top_p": 1, | |
| "typical_p": 1, | |
| "max_tokens": 2048, | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": "Hi" | |
| } | |
| ] | |
| } | |
| [2025-09-12 00:53:26][INFO][LM STUDIO SERVER] Running chat completion on conversation with 1 messages. | |
| [2025-09-12 00:53:26][DEBUG] Received request: POST to /v1/embeddings with body { | |
| "model": "text-embedding-nomic-embed-text-v1.5", | |
| "input": [ | |
| "Test input" | |
| ] | |
| } | |
| [2025-09-12 00:53:26][INFO] Received request to embed multiple: [ | |
| "Test input" | |
| ] | |
| [2025-09-12 00:53:26][DEBUG] Sampling params: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 | |
| top_k = 40, top_p = 1.000, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| Sampling: | |
| logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| Generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 14 | |
| Looking for non-prefix contiguous prompt sequences of size >= 256 to reuse from cache | |
| Cache reuse summary: 14/14 of prompt (100%), 14 prefix, 0 non-prefix | |
| Total prompt tokens: 14 | |
| Prompt tokens to decode: 1 | |
| BeginProcessingPrompt | |
| [2025-09-12 00:53:27][INFO] Returning embeddings (not shown in logs) | |
| [2025-09-12 00:53:27][DEBUG] FinishedProcessingPrompt. Progress: 100 | |
| [2025-09-12 00:53:32][DEBUG] Target model llama_perf stats: | |
| llama_perf_sampler_print: sampling time = 1.01 ms / 24 runs ( 0.04 ms per token, 23762.38 tokens per second) | |
| llama_perf_context_print: load time = 8894.17 ms | |
| llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) | |
| llama_perf_context_print: eval time = 5993.23 ms / 10 runs ( 599.32 ms per token, 1.67 tokens per second) | |
| llama_perf_context_print: total time = 5995.08 ms / 11 tokens | |
| llama_perf_context_print: graphs reused = 10 | |
| [2025-09-12 00:53:32][INFO][nethena-13b@q8_0] Model generated tool calls: [] | |
| [2025-09-12 00:53:32][INFO][nethena-13b@q8_0] Generated prediction: { | |
| "id": "chatcmpl-ahoig45pfuw2yw3kjxjc2o", | |
| "object": "chat.completion", | |
| "created": 1757656406, | |
| "model": "nethena-13b@q8_0", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": " Hello! How can I help you today?", | |
| "reasoning_content": "", | |
| "tool_calls": [] | |
| }, | |
| "logprobs": null, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 14, | |
| "completion_tokens": 10, | |
| "total_tokens": 24 | |
| }, | |
| "stats": {}, | |
| "system_fingerprint": "nethena-13b@q8_0" | |
| } | |
| [2025-09-12 00:58:45][DEBUG] Received request: GET to /api/v1/models | |
| [2025-09-12 00:58:45][INFO] Returning 20 models from v1 API | |
| [2025-09-12 00:59:34][DEBUG] Received request: GET to /api/tags | |
| [2025-09-12 00:59:34][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway | |
| [2025-09-12 00:59:37][DEBUG] Received request: GET to /api/tags | |
| [2025-09-12 00:59:37][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway | |
| [2025-09-12 01:00:15][DEBUG] Received request: GET to /v1/completions/api/v1/models | |
| [2025-09-12 01:00:15][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway | |
| [2025-09-12 01:00:51][DEBUG] Received request: GET to /v1/completions/api/v1/models | |
| [2025-09-12 01:00:51][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway | |
| [2025-09-12 01:01:13][DEBUG] Received request: GET to /v1/completions/api/v1/models | |
| [2025-09-12 01:01:13][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway | |
| [2025-09-12 01:01:28][DEBUG] Received request: GET to /api/tags | |
| [2025-09-12 01:01:28][ERROR] Unexpected endpoint or method. (GET /api/tags). Returning 200 anyway | |
| [2025-09-12 01:01:30][DEBUG] Received request: GET to /v1/completions/api/v1/models | |
| [2025-09-12 01:01:30][ERROR] Unexpected endpoint or method. (GET /v1/completions/api/v1/models). Returning 200 anyway |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment