Created
April 4, 2025 16:57
-
-
Save miminashi/3d6d88cf3e42709e83fb6290b61e769d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ubuntu@t120h-p100-ubuntu20:~/llama.cpp$ ./build/bin/llama-cli -hf llama-cli -hf unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q4_K_M -no-cnv --prompt "自己紹介してください" -ngl 65 --split-mode row | |
| ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no | |
| ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no | |
| ggml_cuda_init: found 4 CUDA devices: | |
| Device 0: Tesla K80, compute capability 3.7, VMM: yes | |
| Device 1: Tesla K80, compute capability 3.7, VMM: yes | |
| Device 2: Tesla K80, compute capability 3.7, VMM: yes | |
| Device 3: Tesla K80, compute capability 3.7, VMM: yes | |
| build: 4997 (d3f1f0ac) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu | |
| main: llama backend init | |
| main: load the model and apply lora adapter, if any | |
| common_download_file: previous metadata file found /home/ubuntu/.cache/llama.cpp/unsloth_DeepSeek-R1-Distill-Qwen-32B-GGUF_DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf.json: {"etag":"\"1c2e5c0c8ee0dca2a6ed2fd6944d714a-1000\"","lastModified":"Mon, 20 Jan 2025 17:58:25 GMT","url":"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf"} | |
| curl_perform_with_retry: Trying to download from https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf (attempt 1 of 3)... | |
| llama_model_load_from_file_impl: using device CUDA0 (Tesla K80) - 11357 MiB free | |
| llama_model_load_from_file_impl: using device CUDA1 (Tesla K80) - 11357 MiB free | |
| llama_model_load_from_file_impl: using device CUDA2 (Tesla K80) - 11357 MiB free | |
| llama_model_load_from_file_impl: using device CUDA3 (Tesla K80) - 12123 MiB free | |
| llama_model_loader: loaded meta data with 27 key-value pairs and 771 tensors from /home/ubuntu/.cache/llama.cpp/unsloth_DeepSeek-R1-Distill-Qwen-32B-GGUF_DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf (version GGUF V3 (latest)) | |
| llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
| llama_model_loader: - kv 0: general.architecture str = qwen2 | |
| llama_model_loader: - kv 1: general.type str = model | |
| llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B | |
| llama_model_loader: - kv 3: general.organization str = Deepseek Ai | |
| llama_model_loader: - kv 4: general.basename str = DeepSeek-R1-Distill-Qwen | |
| llama_model_loader: - kv 5: general.size_label str = 32B | |
| llama_model_loader: - kv 6: qwen2.block_count u32 = 64 | |
| llama_model_loader: - kv 7: qwen2.context_length u32 = 131072 | |
| llama_model_loader: - kv 8: qwen2.embedding_length u32 = 5120 | |
| llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 27648 | |
| llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 40 | |
| llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 8 | |
| llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000 | |
| llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
| llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 | |
| llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen | |
| llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
| llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
| llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
| llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 | |
| llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 | |
| llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151654 | |
| llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true | |
| llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false | |
| llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... | |
| llama_model_loader: - kv 25: general.quantization_version u32 = 2 | |
| llama_model_loader: - kv 26: general.file_type u32 = 15 | |
| llama_model_loader: - type f32: 321 tensors | |
| llama_model_loader: - type q4_K: 385 tensors | |
| llama_model_loader: - type q6_K: 65 tensors | |
| print_info: file format = GGUF V3 (latest) | |
| print_info: file type = Q4_K - Medium | |
| print_info: file size = 18.48 GiB (4.85 BPW) | |
| load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect | |
| load: special tokens cache size = 22 | |
| load: token to piece cache size = 0.9310 MB | |
| print_info: arch = qwen2 | |
| print_info: vocab_only = 0 | |
| print_info: n_ctx_train = 131072 | |
| print_info: n_embd = 5120 | |
| print_info: n_layer = 64 | |
| print_info: n_head = 40 | |
| print_info: n_head_kv = 8 | |
| print_info: n_rot = 128 | |
| print_info: n_swa = 0 | |
| print_info: n_swa_pattern = 1 | |
| print_info: n_embd_head_k = 128 | |
| print_info: n_embd_head_v = 128 | |
| print_info: n_gqa = 5 | |
| print_info: n_embd_k_gqa = 1024 | |
| print_info: n_embd_v_gqa = 1024 | |
| print_info: f_norm_eps = 0.0e+00 | |
| print_info: f_norm_rms_eps = 1.0e-05 | |
| print_info: f_clamp_kqv = 0.0e+00 | |
| print_info: f_max_alibi_bias = 0.0e+00 | |
| print_info: f_logit_scale = 0.0e+00 | |
| print_info: f_attn_scale = 0.0e+00 | |
| print_info: n_ff = 27648 | |
| print_info: n_expert = 0 | |
| print_info: n_expert_used = 0 | |
| print_info: causal attn = 1 | |
| print_info: pooling type = 0 | |
| print_info: rope type = 2 | |
| print_info: rope scaling = linear | |
| print_info: freq_base_train = 1000000.0 | |
| print_info: freq_scale_train = 1 | |
| print_info: n_ctx_orig_yarn = 131072 | |
| print_info: rope_finetuned = unknown | |
| print_info: ssm_d_conv = 0 | |
| print_info: ssm_d_inner = 0 | |
| print_info: ssm_d_state = 0 | |
| print_info: ssm_dt_rank = 0 | |
| print_info: ssm_dt_b_c_rms = 0 | |
| print_info: model type = 32B | |
| print_info: model params = 32.76 B | |
| print_info: general.name = DeepSeek R1 Distill Qwen 32B | |
| print_info: vocab type = BPE | |
| print_info: n_vocab = 152064 | |
| print_info: n_merges = 151387 | |
| print_info: BOS token = 151646 '<|begin▁of▁sentence|>' | |
| print_info: EOS token = 151643 '<|end▁of▁sentence|>' | |
| print_info: EOT token = 151643 '<|end▁of▁sentence|>' | |
| print_info: PAD token = 151654 '<|vision_pad|>' | |
| print_info: LF token = 198 'Ċ' | |
| print_info: FIM PRE token = 151659 '<|fim_prefix|>' | |
| print_info: FIM SUF token = 151661 '<|fim_suffix|>' | |
| print_info: FIM MID token = 151660 '<|fim_middle|>' | |
| print_info: FIM PAD token = 151662 '<|fim_pad|>' | |
| print_info: FIM REP token = 151663 '<|repo_name|>' | |
| print_info: FIM SEP token = 151664 '<|file_sep|>' | |
| print_info: EOG token = 151643 '<|end▁of▁sentence|>' | |
| print_info: EOG token = 151662 '<|fim_pad|>' | |
| print_info: EOG token = 151663 '<|repo_name|>' | |
| print_info: EOG token = 151664 '<|file_sep|>' | |
| print_info: max token length = 256 | |
| load_tensors: loading model tensors, this can take a while... (mmap = true) | |
| load_tensors: offloading 64 repeating layers to GPU | |
| load_tensors: offloading output layer to GPU | |
| load_tensors: offloaded 65/65 layers to GPU | |
| load_tensors: CUDA0_Split model buffer size = 4545.94 MiB | |
| load_tensors: CUDA1_Split model buffer size = 4401.56 MiB | |
| load_tensors: CUDA2_Split model buffer size = 4365.47 MiB | |
| load_tensors: CUDA3_Split model buffer size = 5191.11 MiB | |
| load_tensors: CUDA0 model buffer size = 1.06 MiB | |
| load_tensors: CUDA1 model buffer size = 1.06 MiB | |
| load_tensors: CUDA2 model buffer size = 1.06 MiB | |
| load_tensors: CUDA3 model buffer size = 1.08 MiB | |
| load_tensors: CPU_Mapped model buffer size = 417.66 MiB | |
| ................................................................................................ | |
| llama_context: constructing llama_context | |
| llama_context: n_seq_max = 1 | |
| llama_context: n_ctx = 4096 | |
| llama_context: n_ctx_per_seq = 4096 | |
| llama_context: n_batch = 2048 | |
| llama_context: n_ubatch = 512 | |
| llama_context: causal_attn = 1 | |
| llama_context: flash_attn = 0 | |
| llama_context: freq_base = 1000000.0 | |
| llama_context: freq_scale = 1 | |
| llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized | |
| llama_context: CUDA_Host output buffer size = 0.58 MiB | |
| init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1 | |
| init: CUDA0 KV buffer size = 256.00 MiB | |
| init: CUDA1 KV buffer size = 256.00 MiB | |
| init: CUDA2 KV buffer size = 256.00 MiB | |
| init: CUDA3 KV buffer size = 256.00 MiB | |
| llama_context: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB | |
| llama_context: CUDA0 compute buffer size = 368.00 MiB | |
| llama_context: CUDA1 compute buffer size = 368.00 MiB | |
| llama_context: CUDA2 compute buffer size = 368.00 MiB | |
| llama_context: CUDA3 compute buffer size = 368.00 MiB | |
| llama_context: CUDA_Host compute buffer size = 18.01 MiB | |
| llama_context: graph nodes = 2374 | |
| llama_context: graph splits = 5 | |
| common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 | |
| common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) | |
| main: llama threadpool init, n_threads = 40 | |
| system_info: n_threads = 40 (n_threads_batch = 40) / 80 | CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | | |
| sampler seed: 3920427018 | |
| sampler params: | |
| repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 | |
| dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 | |
| top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 | |
| mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
| sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist | |
| generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1 | |
| 自己紹介してください is1462115borborborborborborborborborTIMERTIMER稀缺elibGMEMTIMERTIMERborborborborborTIMERCHANTمديريةTIMERمديريةCHANT_sensitiveTIMERTIMERTIMERGMEMمديريةمديريةTIMERTIMERborCHANTCHANT | |
| llama_perf_sampler_print: sampling time = 7.38 ms / 53 runs ( 0.14 ms per token, 7179.63 tokens per second) | |
| llama_perf_context_print: load time = 18924.89 ms | |
| llama_perf_context_print: prompt eval time = 11728.00 ms / 6 tokens ( 1954.67 ms per token, 0.51 tokens per second) | |
| llama_perf_context_print: eval time = 518277.90 ms / 46 runs (11266.91 ms per token, 0.09 tokens per second) | |
| llama_perf_context_print: total time = 530430.09 ms / 52 tokens | |
| Interrupted by user |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment