Skip to content

Instantly share code, notes, and snippets.

@saforem2
Created January 27, 2025 22:08
Show Gist options
  • Select an option

  • Save saforem2/b2152f46bf474eb5b7c85f0f5ebcd6f5 to your computer and use it in GitHub Desktop.

Select an option

Save saforem2/b2152f46bf474eb5b7c85f0f5ebcd6f5 to your computer and use it in GitHub Desktop.

Megatron-DeepSpeed on Aurora

Sam Foreman
2025-01-27

Following the instructions from:

https://docs.alcf.anl.gov/aurora/data-science/frameworks/megatron-deepspeed/

  • Login to compute node + create isolated directory:

    #[03:18:17 PM][aurora-uan-0012][~][⏱️ 1h58m35s]
    $ ssh x4309c4s1b0n0
    
    #[03:21:14 PM][x4309c4s1b0n0][~]
    $ cd /flare/Aurora_deployment/foremans/
    
    #[03:21:28 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans]
    $ cd tmp               
    
    #[03:21:30 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans/tmp]
    $ NOW=$(tstamp) && mkdir $NOW && cd $NOW
  • Clone argonne-lcf/Megatron-DeepSpeed:

    #[03:21:32 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131]
    $ git clone https://github.com/argonne-lcf/Megatron-DeepSpeed              
    Cloning into 'Megatron-DeepSpeed'...
    remote: Enumerating objects: 16435, done.
    remote: Counting objects: 100% (19/19), done.
    remote: Compressing objects: 100% (10/10), done.
    remote: Total 16435 (delta 12), reused 9 (delta 9), pack-reused 16416 (from 3)
    Receiving objects: 100% (16435/16435), 7.68 MiB | 21.85 MiB/s, done.
    Resolving deltas: 100% (12113/12113), done.
    Updating files: 100% (621/621), done.
    took: 0h:00m:04s
    
    #[03:22:00 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131][⏱️ 4s]
    $ cd Megatron-DeepSpeed     
    #[03:22:02 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main
    $ export PBS_O_WORKDIR=$(pwd)
    
    #[03:22:19 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main
    ; source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh) && ezpz_setup_envsmert
    Using WORKING_DIR: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed
    
    #[03:22:24 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main
    $ ezpz_setup_env             
    No conda_prefix OR virtual_env found in environment...
    Setting up conda...
    
    Due to MODULEPATH changes, the following have been reloaded:
      1) hwloc/master-git.1793e43-level-zero     2) mpich/opt/4.3.0rc3
    
    The following have been reloaded with a version change:
      1) oneapi/eng-compiler/2024.07.30.002 => oneapi/release/2024.2.1     2) yaksa/0.3-aw2kkvy => yaksa/0.3-euoqglg
    
    Lmod has detected the following error: The following module(s) are unknown: "mpich"
    
    Please check the spelling or version number. Also try "module spider ..."
    It is also possible your cache file is out-of-date; it may help to try:
      $ module --ignore_cache load "mpich"
    
    Also make sure that all modulefiles written in TCL start with the string #%Module
    
    
    
    Found conda at: /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1
    No VIRTUAL_ENV found in environment!
        - Trying to setup from /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1
        - Using VENV_DIR=/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1
    
        - Creating a new virtual env on top of aurora_nre_models_frameworks-2024.2.1_u1 in /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1
    [python] Using /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3
    
    [🍋 ezpz/bin/utils.sh
        • USER=foremans
        • MACHINE=aurora
        • HOST=x4309c4s1b0n0
        • TSTAMP=2025-01-27-152242
    
    [ezpz_setup_host_pbs]
        • Using hostfile: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
        • Found in environment:
            • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
            • Writing PBS vars to: /home/foremans/.pbsenv
    
    [ezpz_save_pbs_env]
        • Setting:
            • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
            • JOBENV_FILE: /home/foremans/.pbsenv
    
    [HOSTS]
        • [host:0] - x4102c5s2b0n0.hostmgmt2102.cm.aurora.alcf.anl.gov
        • [host:1] - x4309c3s7b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov
        • [host:2] - x4309c4s0b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov
        • [host:3] - x4309c4s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov
    
    [DIST INFO]
        • NGPUS=48
        • NHOSTS=4
        • NGPU_PER_HOST=12
        • HOSTFILE=/var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
        • DIST_LAUNCH=mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni
    
    [LAUNCH]:
        • To launch across all available GPUs, use: launch
    
          launch = mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni
    
    took: 0h:00m:17s
  • Install dependencies:

    #[🐍 aurora_nre_models_frameworks-2024.2.1_u1](👻 aurora_nre_models_frameworks-2024.2.1_u
    #[03:22:42 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main][⏱️ 17s]
    $ python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
    Obtaining ezpz from git+https://github.com/saforem2/ezpz#egg=ezpz
    Cloning https://github.com/saforem2/ezpz to /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/src/ezpz
    Running command git clone --filter=blob:none --quiet https://github.com/saforem2/ezpz /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/src/ezpz
    Resolved https://github.com/saforem2/ezpz to commit 29138b89ddfc6119c7fd593a12e498d0aee0c8ea
    Installing build dependencies ... done
    Checking if build backend supports build_editable ... done
    Getting requirements to build editable ... done
    Installing backend dependencies ... done
    Preparing editable metadata (pyproject.toml) ... done
    Collecting ambivalent@ git+https://github.com/saforem2/ambivalent
    Cloning https://github.com/saforem2/ambivalent to /tmp/pip-install-rqu4a__2/ambivalent_4bcdc457047c40fc8abdc2207eb76795
    Running command git clone --filter=blob:none --quiet https://github.com/saforem2/ambivalent /tmp/pip-install-rqu4a__2/ambivalent_4bcdc457047c40fc8abdc2207eb76795
    Resolved https://github.com/saforem2/ambivalent to commit eac43ada80b6d4b2f71bf45cee9329993f622e87
    Installing build dependencies ... done
    Getting requirements to build wheel ... done
    Preparing metadata (pyproject.toml) ... done
    Requirement already satisfied: tensorboard in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (2.15.2)
    Requirement already satisfied: mpi4py in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (3.1.6)
    Collecting jaxlib
    Using cached jaxlib-0.5.0-cp310-cp310-manylinux2014_x86_64.whl (102.0 MB)
    Collecting wandb
    Using cached wandb-0.19.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.3 MB)
    Requirement already satisfied: joblib in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (1.4.2)
    Collecting pyinstrument
    Using cached pyinstrument-5.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (145 kB)
    Requirement already satisfied: seaborn in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (0.13.2)
    Requirement already satisfied: hydra-core in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (1.3.2)
    Collecting xarray
    Using cached xarray-2025.1.1-py3-none-any.whl (1.2 MB)
    Collecting plotext
    Using cached plotext-5.3.2-py3-none-any.whl (64 kB)
    Collecting jax
    Using cached jax-0.5.0-py3-none-any.whl (2.3 MB)
    Requirement already satisfied: torch in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (2.3.1+cxx11.abi)
    Collecting sentencepiece
    Using cached sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
    Requirement already satisfied: omegaconf in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (2.3.0)
    Requirement already satisfied: tqdm in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (4.67.1)
    Collecting jaxtyping
    Downloading jaxtyping-0.2.37-py3-none-any.whl (56 kB)
        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 kB 2.4 MB/s eta 0:00:00
    Requirement already satisfied: h5py in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (3.12.1)
    Collecting hydra-colorlog
    Using cached hydra_colorlog-1.2.0-py3-none-any.whl (3.6 kB)
    Collecting rich
    Using cached rich-13.9.4-py3-none-any.whl (242 kB)
    Collecting sh
    Using cached sh-2.2.1-py3-none-any.whl (38 kB)
    Requirement already satisfied: ipython in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (8.31.0)
    Requirement already satisfied: ml-dtypes in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (0.3.2)
    Requirement already satisfied: requests in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2.32.3)
    Requirement already satisfied: matplotlib in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.5.3)
    Collecting colormaps
    Using cached colormaps-0.4.2-py3-none-any.whl (727 kB)
    Requirement already satisfied: numpy>=1.19.3 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from h5py->ezpz) (1.26.4)
    Collecting colorlog
    Using cached colorlog-6.9.0-py3-none-any.whl (11 kB)
    Requirement already satisfied: packaging in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from hydra-core->ezpz) (24.0)
    Requirement already satisfied: antlr4-python3-runtime==4.9.* in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from hydra-core->ezpz) (4.9.3)
    Requirement already satisfied: PyYAML>=5.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from omegaconf->ezpz) (6.0.2)
    Requirement already satisfied: pexpect>4.3 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (4.9.0)
    Requirement already satisfied: decorator in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (5.1.1)
    Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (3.0.50)
    Requirement already satisfied: traitlets>=5.13.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (5.14.3)
    Requirement already satisfied: jedi>=0.16 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (0.19.2)
    Requirement already satisfied: stack_data in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (0.6.3)
    Requirement already satisfied: matplotlib-inline in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (0.1.7)
    Requirement already satisfied: exceptiongroup in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (1.2.2)
    Requirement already satisfied: typing_extensions>=4.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (4.12.2)
    Requirement already satisfied: pygments>=2.4.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (2.19.1)
    Requirement already satisfied: opt_einsum in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jax->ezpz) (3.4.0)
    Requirement already satisfied: scipy>=1.11.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jax->ezpz) (1.12.0)
    Collecting ml-dtypes
    Using cached ml_dtypes-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
    Collecting wadler-lindig>=0.1.3
    Downloading wadler_lindig-0.1.3-py3-none-any.whl (20 kB)
    Collecting markdown-it-py>=2.2.0
    Using cached markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
    Requirement already satisfied: pandas>=1.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from seaborn->ezpz) (1.5.0)
    Requirement already satisfied: werkzeug>=1.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (3.1.3)
    Requirement already satisfied: six>1.9 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (1.16.0)
    Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (2.37.0)
    Requirement already satisfied: setuptools>=41.0.0 in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (65.5.0)
    Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (4.25.5)
    Requirement already satisfied: grpcio>=1.48.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (1.69.0)
    Requirement already satisfied: markdown>=2.6.8 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (3.7)
    Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (0.7.2)
    Requirement already satisfied: absl-py>=0.4 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (2.1.0)
    Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (1.2.1)
    Requirement already satisfied: networkx in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (3.4.2)
    Requirement already satisfied: jinja2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (3.1.5)
    Requirement already satisfied: fsspec in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (2024.12.0)
    Requirement already satisfied: sympy in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (1.13.3)
    Requirement already satisfied: filelock in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (3.16.1)
    Requirement already satisfied: click!=8.0.0,>=7.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (8.1.8)
    Requirement already satisfied: pydantic<3,>=2.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (2.10.5)
    Collecting sentry-sdk>=2.0.0
    Using cached sentry_sdk-2.20.0-py2.py3-none-any.whl (322 kB)
    Collecting setproctitle
    Using cached setproctitle-1.3.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
    Requirement already satisfied: platformdirs in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (4.2.2)
    Requirement already satisfied: gitpython!=3.1.29,>=1.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (3.1.44)
    Collecting docker-pycreds>=0.4.0
    Using cached docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
    Requirement already satisfied: psutil>=5.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (6.1.1)
    Collecting pandas>=1.2
    Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
    Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from gitpython!=3.1.29,>=1.0.0->wandb->ezpz) (4.0.12)
    Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard->ezpz) (0.4.1)
    Requirement already satisfied: rsa<5,>=3.1.4 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard->ezpz) (4.9)
    Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard->ezpz) (5.5.0)
    Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth-oauthlib<2,>=0.5->tensorboard->ezpz) (2.0.0)
    Requirement already satisfied: parso<0.9.0,>=0.8.4 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jedi>=0.16->ipython->ezpz) (0.8.4)
    Collecting mdurl~=0.1
    Using cached mdurl-0.1.2-py3-none-any.whl (10.0 kB)
    Requirement already satisfied: pyparsing>=2.2.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.2.1)
    Requirement already satisfied: python-dateutil>=2.7 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2.9.0)
    Requirement already satisfied: kiwisolver>=1.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (1.4.8)
    Requirement already satisfied: pillow>=6.2.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (11.1.0)
    Requirement already satisfied: fonttools>=4.22.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (4.55.4)
    Requirement already satisfied: cycler>=0.10 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (0.12.1)
    Requirement already satisfied: pytz>=2020.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pandas>=1.2->seaborn->ezpz) (2024.1)
    Collecting tzdata>=2022.7
    Using cached tzdata-2025.1-py2.py3-none-any.whl (346 kB)
    Requirement already satisfied: ptyprocess>=0.5 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pexpect>4.3->ipython->ezpz) (0.7.0)
    Requirement already satisfied: wcwidth in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython->ezpz) (0.2.13)
    Requirement already satisfied: annotated-types>=0.6.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic<3,>=2.6->wandb->ezpz) (0.7.0)
    Requirement already satisfied: pydantic-core==2.27.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic<3,>=2.6->wandb->ezpz) (2.27.2)
    Requirement already satisfied: charset-normalizer<4,>=2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.3.2)
    Requirement already satisfied: certifi>=2017.4.17 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2024.12.14)
    Requirement already satisfied: idna<4,>=2.5 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.7)
    Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2.2.1)
    Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from werkzeug>=1.0.1->tensorboard->ezpz) (3.0.2)
    Requirement already satisfied: asttokens>=2.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from stack_data->ipython->ezpz) (3.0.0)
    Requirement already satisfied: executing>=1.2.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from stack_data->ipython->ezpz) (2.1.0)
    Requirement already satisfied: pure-eval in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from stack_data->ipython->ezpz) (0.2.3)
    Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from sympy->torch->ezpz) (1.3.0)
    Requirement already satisfied: smmap<6,>=3.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb->ezpz) (5.0.2)
    Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard->ezpz) (0.6.1)
    Requirement already satisfied: oauthlib>=3.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard->ezpz) (3.2.2)
    Building wheels for collected packages: ezpz, ambivalent
    Building editable for ezpz (pyproject.toml) ... done
    Created wheel for ezpz: filename=ezpz-0.2-py3-none-any.whl size=8397 sha256=90fafced23daa139888b1d14bfb9b3658239e342374f4a95d114b9dd29f0bf94
    Stored in directory: /tmp/pip-ephem-wheel-cache-8bsj59up/wheels/eb/89/66/9ab50e62a2bd66fcc997952b73eb94cdb41a99455b21f42909
    Building wheel for ambivalent (pyproject.toml) ... done
    Created wheel for ambivalent: filename=ambivalent-0.2.0-py3-none-any.whl size=13235 sha256=2e3833397c9f871f02a9bde8c53f363e7123d271eb47ac53945072d64b54f6a8
    Stored in directory: /tmp/pip-ephem-wheel-cache-8bsj59up/wheels/7b/e6/96/887dca4e5d3c307c41d4cf84d23f97791a334efab8f1163d30
    Successfully built ezpz ambivalent
    Installing collected packages: sentencepiece, wadler-lindig, tzdata, sh, setproctitle, sentry-sdk, pyinstrument, plotext, ml-dtypes, mdurl, docker-pycreds, colormaps, colorlog, pandas, markdown-it-py, jaxtyping, jaxlib, xarray, wandb, rich, jax, hydra-colorlog, ambivalent, ezpz
    Attempting uninstall: ml-dtypes
      Found existing installation: ml-dtypes 0.3.2
      Not uninstalling ml-dtypes at /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages, outside environment /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1
      Can't uninstall 'ml-dtypes'. No files were found to uninstall.
    Attempting uninstall: pandas
      Found existing installation: pandas 1.5.0
      Not uninstalling pandas at /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages, outside environment /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1
      Can't uninstall 'pandas'. No files were found to uninstall.
    ERROR: pip\'s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    tensorflow 2.15.1 requires ml-dtypes~=0.3.1, but you have ml-dtypes 0.5.1 which is incompatible.
    intel-extension-for-tensorflow 2.15.0.1 requires absl-py==1.4.0, but you have absl-py 2.1.0 which is incompatible.
    intel-extension-for-tensorflow 2.15.0.1 requires protobuf<4.24, but you have protobuf 4.25.5 which is incompatible.
    Successfully installed ambivalent-0.2.0 colorlog-6.9.0 colormaps-0.4.2 docker-pycreds-0.4.0 ezpz-0.2 hydra-colorlog-1.2.0 jax-0.5.0 jaxlib-0.5.0 jaxtyping-0.2.37 markdown-it-py-3.0.0 mdurl-0.1.2 ml-dtypes-0.5.1 pandas-2.2.3 plotext-5.3.2 pyinstrument-5.0.1 rich-13.9.4 sentencep
    iece-0.2.0 sentry-sdk-2.20.0 setproctitle-1.3.4 sh-2.2.1 tzdata-2025.1 wadler-lindig-0.1.3 wandb-0.19.4 xarray-2025.1.1
    
    [notice] A new release of pip is available: 23.0.1 -> 25.0
    [notice] To update, run: pip install --upgrade pip
    took: 0h:01m:42s
    
    #[🐍 aurora_nre_models_frameworks-2024.2.1_u1](👻 aurora_nre_models_frameworks-2024.2.1_u
    #[03:25:31 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main][⏱️ 23s
    $ python3 -m pip install deepspeed==0.16.2                                                       
    Collecting deepspeed
      Downloading deepspeed-0.16.3.tar.gz (1.4 MB)
          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 36.9 MB/s eta 0:00:00
      Preparing metadata (setup.py) ... done
    Requirement already satisfied: einops in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (0.8.0)
    Requirement already satisfied: hjson in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (3.1.0)
    Collecting msgpack
      Using cached msgpack-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (378 kB)
    Requirement already satisfied: ninja in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (1.11.1.3)
    Requirement already satisfied: numpy in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (1.26.4)
    Requirement already satisfied: packaging>=20.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (24.0)
    Requirement already satisfied: psutil in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (6.1.1)
    Requirement already satisfied: py-cpuinfo in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (9.0.0)
    Requirement already satisfied: pydantic>=2.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (2.10.5)
    Requirement already satisfied: torch in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (2.3.1+cxx11.abi)
    Requirement already satisfied: tqdm in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (4.67.1)
    Requirement already satisfied: pydantic-core==2.27.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2)
    Requirement already satisfied: annotated-types>=0.6.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0)
    Requirement already satisfied: typing-extensions>=4.12.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic>=2.0.0->deepspeed) (4.12.2)
    Requirement already satisfied: filelock in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (3.16.1)
    Requirement already satisfied: networkx in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (3.4.2)
    Requirement already satisfied: sympy in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (1.13.3)
    Requirement already satisfied: jinja2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (3.1.5)
    Requirement already satisfied: fsspec in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (2024.12.0)
    Requirement already satisfied: MarkupSafe>=2.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jinja2->torch->deepspeed) (3.0.2)
    Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from sympy->torch->deepspeed) (1.3.0)
    Building wheels for collected packages: deepspeed
      Building wheel for deepspeed (setup.py) ... done
      Created wheel for deepspeed: filename=deepspeed-0.16.3-py3-none-any.whl size=1549949 sha256=7b6d0b9906f21c6cc74da2ae82a1ab28a93e5acb99b42fea7189a2deabd360f7
      Stored in directory: /home/foremans/.cache/pip/wheels/ca/e2/8f/3a91068b57481b104c9c450a20239ec874f6141f8b3769e0dd
    Successfully built deepspeed
    Installing collected packages: msgpack, deepspeed
    Successfully installed deepspeed-0.16.3 msgpack-1.1.0
    
    [notice] A new release of pip is available: 23.0.1 -> 25.0
    [notice] To update, run: pip install --upgrade pip
    took: 0h:00m:29s
  • Launch training:

    #[🐍 aurora_nre_models_frameworks-2024.2.1_u1](👻 aurora_nre_models_frameworks-2024.2.1_u
    #[03:26:04 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main][⏱️ 29s
    $ TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh
    Using WORKING_DIR: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed
    Running on: aurora
    Found ezpz in /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/deps/ezpz
    Using WORKING_DIR: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed
    Using virtual_env: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1 on top of conda from: /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1
    [python] Using /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3
    
    [🍋 ezpz/bin/utils.sh
        • USER=foremans
        • MACHINE=aurora
        • HOST=x4309c4s1b0n0
        • TSTAMP=2025-01-27-152607
    
    [ezpz_setup_host_pbs]
        • Using hostfile: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
        • Found in environment:
            • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
            • Writing PBS vars to: /home/foremans/.pbsenv
    
    [ezpz_save_pbs_env]
        • Setting:
            • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
            • JOBENV_FILE: /home/foremans/.pbsenv
    
    [HOSTS]
        • [host:0] - x4102c5s2b0n0.hostmgmt2102.cm.aurora.alcf.anl.gov
        • [host:1] - x4309c3s7b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov
        • [host:2] - x4309c4s0b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov
        • [host:3] - x4309c4s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov
    
    [DIST INFO]
        • NGPUS=48
        • NHOSTS=4
        • NGPU_PER_HOST=12
        • HOSTFILE=/var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
        • DIST_LAUNCH=mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni
    
    [LAUNCH]:
        • To launch across all available GPUs, use: launch
    
          launch = mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni
    
    
    [notice] A new release of pip is available: 23.0.1 -> 25.0
    [notice] To update, run: pip install --upgrade pip
    [ezpz_install] Found ezpz @ /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/src/ezpz
    [install_dependencies] Ensuring all dependencies from /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/requirements/requirements.txt installed...
    
    [notice] A new release of pip is available: 23.0.1 -> 25.0
    [notice] To update, run: pip install --upgrade pip
    [setParams] Using GRAD_ACC_STEPS: 16
    TRAIN_TOKENS=2000000000000 (=2000B tokens)
    TRAIN_ITERS=1271565
    DS_CONFIG: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json
    ZS=1, MB=1, GB=384, PP=1, DTYPE=bf16
    {
      "train_batch_size": 384,
      "train_micro_batch_size_per_gpu": 1,
      "gradient_clipping": 1,
      "steps_per_print": 1,
      "gradient_accumulation_steps": 16,
      "zero_force_ds_cpu_optimizer": false,
      "zero_allow_untested_optimizer": true,
      "wall_clock_breakdown": false,
      "zero_optimization": {
        "stage": 1
      },
      "fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
      },
      "bfloat16": {
        "enabled": true,
        "loss_scale": 1
      },
      "comms_logger": {
        "enabled": false,
        "verbose": false,
        "debug": false
      },
      "flops_profiler": {
        "enabled": true,
        "profile_step": 2,
        "module_depth": -1,
        "top_modules": 1,
        "detailed": true,
        "output_file": null
      }
    }
    Checkpoints will be saved to: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash
    
      Please see logs at: logs/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/20250127-152609_48_x4309c4s1b0n0
    Setting up tokenizer with Llama2Tokenizer
    Using data_file_list: ALCF/data-lists/aurora/books.txt
    Using tokenizer: Llama2Tokenizer. Setting up data with ALCF/data-lists/aurora/books.txt
    Calling:  setData() with ALCF/data-lists/aurora/books.txt
    --------------------
    Updated environment:
    DATA_FILE_LIST: ALCF/data-lists/aurora/books.txt
    NUM_DOCS: 3
      WEIGHT_SUM: 0.0072042092147565125
    DFL_STEM: books
    DATA_CACHE_PATH: .cache/books/index-cache
    DATA_FLAGS: 
    --------------------
    [setData] DATA_FLAGS: 
    [setData] TOKENIZER_FLAGS: --tokenizer-type Llama2Tokenizer --tokenizer-model /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model
    Requirement already satisfied: pybind11 in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (2.13.6)
    
    [notice] A new release of pip is available: 23.0.1 -> 25.0
    [notice] To update, run: pip install --upgrade pip
    make: Nothing to be done for 'default'.
    /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed
    ++++++++++++++++++++++++++++++++++++++++++++++++++
    - MPICH_DIR=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.2.1/mpich-4.3.0rc3-hipyfz6
    - Using /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3
    - WORLD_SIZE:48
    - BACKEND: ccl
    - MODEL_TYPE: llama-gb384-seq4096-pp1-tp2-10layers-32heads-4096hidden
    - Using DATA_FILE_LIST: ALCF/data-lists/aurora/books.txt
    ++++++++++++++++++++++++++++++++++++++++++++++++++
    
    Currently Loaded Modules:
      1) gcc-runtime/12.2.0-267awrk   3) mpfr/4.2.1-fhgnwe7   5) gcc/12.2.0         7) cray-pals/1.4.0      9) oneapi/release/2024.2.1  11) frameworks/2024.2.1_u1               13) yaksa/0.3-euoqglg
      2) gmp/6.2.1-yctcuid            4) mpc/1.3.1-ygprpb4    6) libfabric/1.20.1   8) cray-libpals/1.4.0  10) pti-gpu/d3639de          12) hwloc/master-git.1793e43-level-zero  14) mpich/opt/4.3.0rc3
    
    
    
    Saving environment to checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.env
    Not currently running. Continuing!
    Launching with: MPICH
      mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni --pmi=pmix --genvall /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models
    _frameworks-2024.2.1_u1/bin/python3 -Wignore /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/pretrain_gpt_alcf.py
    Using data_cache_path: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache
    Training Arguments: 
    
    --accumulate-allreduce-grads-in-fp32
    --adam-beta1=0.9
    --adam-beta2=0.95
    --adam-eps=0.00001
    --attention-dropout 0
    --bf16
    --blend-sample-in-corpus
    --clip-grad=1.0
    --data-cache-path=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache
    --data-file-list=ALCF/data-lists/aurora/books.txt
    --deepspeed
    --deepspeed_config=/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json
    --disable-bias-linear
    --distributed-backend=ccl
    --ds-sequence-parallel-size=1
    --eval-interval=100
    --eval-iters=20
    --ffn-hidden-size 11008
    --global-batch-size=384
    --hidden-dropout 0
    --hidden-size=4096
    --load=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash
    --log-interval=1
    --log-optimizer-states-to-tensorboard
    --log-timers-to-tensorboard
    --lr 0.0002
    --lr-decay-style cosine
    --lr-warmup-fraction 0.05
    --max-position-embeddings=4096
    --micro-batch-size=1
    --no-bias-dropout-fusion
    --no-bias-gelu-fusion
    --no-gradient-accumulation-fusion
    --no-masked-softmax-fusion
    --no-pipeline-parallel
    --no-query-key-layer-scaling
    --normalization rmsnorm
    --num-attention-heads=32
    --num-key-value-heads 8
    --num-layers=10
    --optimizer=adamw
    --pipeline-model-parallel-size=1
    --save=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash
    --save-interval=50
    --seq-length=4096
    --shuffle-sample-in-corpus
    --split=990,10,0
    --swiglu
    --tensorboard-dir checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/tensorboard
    --tensor-model-parallel-size=2
    --timing-log-level=1
    --tokenizer-type Llama2Tokenizer --tokenizer-model /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model
    --train-iters=1271565
    --untie-embeddings-and-output-weights
    --use-checkpoint-opt_param-scheduler
    --use-flash-attn-builder
    --use-rotary-position-embeddings
    --weight-decay=0.1
    --zero-stage=1
    mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni --pmi=pmix --genvall /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_
    frameworks-2024.2.1_u1/bin/python3 -Wignore /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/pretrain_gpt_alcf.py --use-checkpoint-opt_param-scheduler --lr 0.0002 --lr-decay-style cosine --lr-warmup-fraction 0.05 --swiglu --hidden-dropout 0 --attention
    -dropout 0 --normalization rmsnorm --disable-bias-linear --no-query-key-layer-scaling --use-rotary-position-embeddings --untie-embeddings-and-output-weights --num-key-value-heads 8 --ffn-hidden-size 11008 --use-flash-attn-builder   --tokenizer-type Llama2Tokenizer --tokenizer-m
    odel /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model --log-timers-to-tensorboard --log-optimizer-states-to-tensorboard --tensorboard-dir checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_fla
    sh/tensorboard --deepspeed --no-pipeline-parallel --deepspeed_config=/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json --zero-stage=1 --bf16 --shuffle-sample-in-corpus --blend-sample-in-corpus --accumulate-al
    lreduce-grads-in-fp32 --no-bias-gelu-fusion --no-bias-dropout-fusion --no-masked-softmax-fusion --no-gradient-accumulation-fusion --optimizer=adamw --tensor-model-parallel-size=2 --pipeline-model-parallel-size=1 --max-position-embeddings=4096 --micro-batch-size=1 --ds-sequence-
    parallel-size=1 --global-batch-size=384 --split=990,10,0 --timing-log-level=1 --eval-interval=100 --eval-iters=20 --save-interval=50 --log-interval=1 --save=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash --load=checkpoints/ws48_d
    s_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash --seq-length=4096 --num-layers=10 --hidden-size=4096 --train-iters=1271565 --distributed-backend=ccl --weight-decay=0.1 --adam-beta1=0.9 --adam-beta2=0.95 --adam-eps=0.00001 --clip-grad=1.0 --num-atte
    ntion-heads=32 --data-cache-path=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache --data-file-list=ALCF/data-lists/aurora/books.txt
    [!! NOTE] View output at:
      logs/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/20250127-152609_48_x4309c4s1b0n0/output.log
    Disabling local launch: multi-node application
    Connected to tcp://x4102c5s2b0n0.hostmgmt2102.cm.aurora.alcf.anl.gov:7919
    Launching application 968d94e2-bb10-4ae0-9ecf-6e537424c448
    [2025-01-27 15:26:19,638] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,684] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,689] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,692] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,697] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,717] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,717] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,718] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,456] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,483] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,487] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,488] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,491] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,492] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,492] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,493] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,029] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,048] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,049] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,079] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,090] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,090] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,090] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,091] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,091] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,093] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,093] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:26,093] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,170] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,197] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,218] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,218] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,220] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,221] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,221] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:27,223] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-01-27 15:26:29,360] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,360] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,360] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,363] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,363] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,363] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,365] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,365] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,365] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,371] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,372] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,372] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,374] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,374] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,374] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,376] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,377] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,377] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,379] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,379] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,379] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,381] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,381] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,382] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,384] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,384] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,384] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,386] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,387] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,387] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,389] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,390] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,390] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:29,392] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:29,393] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:29,393] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,565] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,565] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,565] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,565] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
    [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None
    [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=12, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=28, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=14, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=2, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=16, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=18, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=20, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=22, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=4, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=24, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=6, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=26, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=30, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=8, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=32, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=10, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=34, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend ccl
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=36, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=38, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=40, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=42, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=5, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=44, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=46, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=7, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=1, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=3, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=9, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=11, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=17, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=25, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=21, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=31, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=23, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=33, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=13, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=15, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=27, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=37, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=19, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=39, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=29, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=41, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=43, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=35, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=45, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500
    [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=47, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500
    --------------------------------------------------
    DeepSpeed C++/CUDA extension op report
    --------------------------------------------------
    NOTE: Ops not installed will be just-in-time (JIT) compiled at
          runtime if needed. Op compatibility means that your system
          meet the required dependencies to JIT install the op.
    --------------------------------------------------
    JIT compiled ops requires ninja
    ninja .................. [OKAY]
    --------------------------------------------------
    op name ................ installed .. compatible
    --------------------------------------------------
    deepspeed_not_implemented  [NO] ....... [OKAY]
      [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
      [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
    async_io ............... [NO] ....... [NO]
    cpu_adagrad ............ [NO] ....... [OKAY]
    cpu_adam ............... [NO] ....... [OKAY]
    flash_attn ............. [NO] ....... [OKAY]
    fused_adam ............. [NO] ....... [OKAY]
    transformer_inference .. [NO] ....... [OKAY]
    pack_bits .............. [NO] ....... [OKAY]
    --------------------------------------------------
    DeepSpeed general environment info:
    torch install path ............... ['/opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages/torch']
    torch version .................... 2.3.1+cxx11.abi
    deepspeed install path ........... ['/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages/deepspeed']
    deepspeed info ................... 0.16.3, unknown, unknown
    deepspeed wheel compiled w. ...... torch 2.3 
    shared memory (/dev/shm) size .... 503.18 GB
    [2025-01-27 15:26:57.010367][INFO][ezpz/configs:287] - **** Git info for DeepSpeed: git_hash=3af7eb4b git_branch=main ****
    2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_KVS_MODE changed to be mpi (default:pmi)
    2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_KVS_CONNECTION_TIMEOUT changed to be 3600 (default:120)
    2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_BCAST changed to be double_tree (default:)
    2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_ENABLE_SYCL_KERNELS changed to be 1 (default:0)
    2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_SYCL_ESIMD changed to be 1 (default:0)
    2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be pmix (default:hydra)
    2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD changed to be 32768 (default:1000)
    2025:01:27-15:26:57:(206687) |CCL_WARN| CCL_ALLGATHERV_MEDIUM_SIZE_THRESHOLD=0 is unknown to and unused by oneCCL code but is present in the environment, check if it is not mistyped.
    2025:01:27-15:26:57:(206687) |CCL_WARN| CCL_SKIP_SCHEDULER=1 is unknown to and unused by oneCCL code but is present in the environment, check if it is not mistyped.
    [2025-01-27 15:26:57.371214][INFO][ezpz/dist:812] - Using device='xpu' with backend='deepspeed' + 'ccl' for distributed training.
    [2025-01-27 15:26:57.372119][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 0/47] 
    [2025-01-27 15:26:57.371090][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][12/47] 
    [2025-01-27 15:26:57.371067][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][13/47] 
    [2025-01-27 15:26:57.371099][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][36/47] 
    [2025-01-27 15:26:57.371062][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][15/47] 
    [2025-01-27 15:26:57.371083][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][16/47] 
    [2025-01-27 15:26:57.371066][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][17/47] 
    [2025-01-27 15:26:57.371065][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][19/47] 
    [2025-01-27 15:26:57.371081][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][37/47] 
    [2025-01-27 15:26:57.371200][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 1/47] 
    [2025-01-27 15:26:57.371091][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][20/47] 
    [2025-01-27 15:26:57.371113][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][38/47] 
    [2025-01-27 15:26:57.371192][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 7/47] 
    [2025-01-27 15:26:57.371065][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][21/47] 
    [2025-01-27 15:26:57.371069][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][24/47] 
    [2025-01-27 15:26:57.371080][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][39/47] 
    [2025-01-27 15:26:57.371202][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 8/47] 
    [2025-01-27 15:26:57.371093][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][22/47] 
    [2025-01-27 15:26:57.371052][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][25/47] 
    [2025-01-27 15:26:57.371094][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][40/47] 
    [2025-01-27 15:26:57.371192][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 9/47] 
    [2025-01-27 15:26:57.371066][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][23/47] 
    [2025-01-27 15:26:57.371054][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][27/47] 
    [2025-01-27 15:26:57.371082][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][41/47] 
    [2025-01-27 15:26:57.371190][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][11/47] 
    [2025-01-27 15:26:57.371544][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][14/47] 
    [2025-01-27 15:26:57.371079][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][28/47] 
    [2025-01-27 15:26:57.371124][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][44/47] 
    [2025-01-27 15:26:57.371200][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 3/47] 
    [2025-01-27 15:26:57.371171][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][18/47] 
    [2025-01-27 15:26:57.371053][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][29/47] 
    [2025-01-27 15:26:57.371089][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][45/47] 
    [2025-01-27 15:26:57.371234][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 4/47] 
    [2025-01-27 15:26:57.371053][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][31/47] 
    [2025-01-27 15:26:57.371081][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][47/47] 
    [2025-01-27 15:26:57.371205][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 5/47] 
    [2025-01-27 15:26:57.371070][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][32/47] 
    [2025-01-27 15:26:57.371104][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][43/47] 
    [2025-01-27 15:26:57.371268][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 6/47] 
    [2025-01-27 15:26:57.371074][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][33/47] 
    [2025-01-27 15:26:57.371559][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][46/47] 
    [2025-01-27 15:26:57.371496][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][10/47] 
    [2025-01-27 15:26:57.371063][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][34/47] 
    [2025-01-27 15:26:57.371482][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][42/47] 
    [2025-01-27 15:26:57.371710][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 2/47] 
    [2025-01-27 15:26:57.371053][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][35/47] 
    [2025-01-27 15:26:57.371361][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][26/47] 
    [2025-01-27 15:26:57.371618][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][30/47] 
    [2025-01-27 15:26:57.375272][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:69] - Import python modules in 40.977957248687744 seconds
    [2025-01-27 15:26:57.375780][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:70] - ez.setup_torch time: 0.8100743293762207 seconds
    [2025-01-27 15:26:57.376177][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:80] - Setting up W&B from: 0 with AuroraGPT
    [2025-01-27 15:26:57.376572][INFO][ezpz/dist:1065] - Setting up wandb from rank: 0
    [2025-01-27 15:26:57.376937][INFO][ezpz/dist:1066] - Using: WB PROJECT: AuroraGPT
    wandb: Currently logged in as: foremans (aurora_gpt). Use `wandb login --relogin` to force relogin
    wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
    2025-01-27 15:26:57.769668: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
    2025-01-27 15:26:57.769690: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
    2025-01-27 15:26:57.771206: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
    2025-01-27 15:26:58.541647: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
    2025-01-27 15:26:59.500297: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric
      more than once, or if the name is already used by other metrics.
    2025-01-27 15:26:59.500450: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the me
    tric more than once, or if the name is already used by other metrics.
    2025-01-27 15:26:59.501493: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric
      more than once, or if the name is already used by other metrics.
    2025-01-27 15:26:59.501503: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link 
    the metric more than once, or if the name is already used by other metrics.
    2025-01-27 15:26:59.717453: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
    2025-01-27 15:26:59.747528: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero
    2025-01-27 15:26:59.747878: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747882: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747884: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747886: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747887: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747889: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747890: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747892: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747893: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747895: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747897: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:26:59.747898: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    > setting tensorboard ...
    WARNING: WANDB writing requested but no legit wandb project or experiment name provided, therefore no WANDB logs will be written according to random generated project or experiment name.
    wandb: Tracking run with wandb version 0.19.4
    wandb: Run data is saved locally in /lus/flare/projects/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/wandb/run-20250127_152657-knsggy9p
    wandb: Run `wandb offline` to turn off syncing.
    wandb: Syncing run easy-valley-1380
    wandb: ⭐️ View project at https://wandb.ai/aurora_gpt/AuroraGPT
    wandb: 🚀 View run at https://wandb.ai/aurora_gpt/AuroraGPT/runs/knsggy9
    [2025-01-27 15:27:01.618844][INFO][ezpz/dist:1093] - W&B RUN: [easy-valley-1380](https://wandb.ai/aurora_gpt/AuroraGPT/runs/knsggy9p)
    [2025-01-27 15:27:01.629484][INFO][ezpz/dist:297] - Updating wandb.run: easy-valley-1380 config with "DIST_INFO"
    [2025-01-27 15:27:01.634114][INFO][ezpz/dist:1125] - Running on machine='Aurora'
    using world size: 48, data-parallel-size: 24, sequence-parallel size: 1, tensor-model-parallel size: 2, pipeline-model-parallel size: 1 
    using torch.bfloat16 for parameters ...
    ------------------------ arguments ------------------------
      accumulate_allreduce_grads_in_fp32 .............. True
      adam_beta1 ...................................... 0.9
      adam_beta2 ...................................... 0.95
      adam_eps ........................................ 1e-05
      add_bias_linear ................................. False
      add_position_embedding .......................... False
      adlr_autoresume ................................. False
      adlr_autoresume_interval ........................ 1000
      aml_data_download_path .......................... None
      apply_layernorm_1p .............................. False
      apply_query_key_layer_scaling ................... False
      apply_residual_connection_post_layernorm ........ False
      async_tensor_model_parallel_allreduce ........... False
      attention_dropout ............................... 0.0
      attention_softmax_in_fp32 ....................... False
      barrier_with_L1_time ............................ True
      bert_binary_head ................................ True
      bert_embedder_type .............................. megatron
      bert_load ....................................... None
      bf16 ............................................ True
      bias_dropout_fusion ............................. False
      bias_gelu_fusion ................................ False
      biencoder_projection_dim ........................ 0
      biencoder_shared_query_context_model ............ False
      blend_sample_in_corpus .......................... True
      block_data_path ................................. None
      checkpoint_activations .......................... False
      checkpoint_in_cpu ............................... False
      checkpoint_num_layers ........................... 1
      classes_fraction ................................ 1.0
      clip_grad ....................................... 1.0
      compression_training ............................ False
      consumed_train_samples .......................... 0
      consumed_train_tokens ........................... 0
      consumed_valid_samples .......................... 0
      contigious_checkpointing ........................ False
      cpu_optimizer ................................... False
      cpu_torch_adam .................................. False
      create_moe_param_group .......................... False
      curriculum_learning_legacy ...................... False
      data_cache_path ................................. checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache
      data_efficiency_curriculum_learning ............. False
      data_file_list .................................. ALCF/data-lists/aurora/books.txt
      data_impl ....................................... infer
      data_parallel_random_init ....................... False
      data_parallel_size .............................. 24
      data_path ....................................... None
      data_per_class_fraction ......................... 1.0
      data_sharding ................................... True
      dataloader_type ................................. single
      DDP_impl ........................................ local
      decoder_num_layers .............................. None
      decoder_seq_length .............................. None
      deepscale ....................................... False
      deepscale_config ................................ None
      deepspeed ....................................... True
      deepspeed_activation_checkpointing .............. False
      deepspeed_config ................................ /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json
      dino_bottleneck_size ............................ 256
      dino_freeze_last_layer .......................... 1
      dino_head_hidden_size ........................... 2048
      dino_local_crops_number ......................... 10
      dino_local_img_size ............................. 96
      dino_norm_last_layer ............................ False
      dino_teacher_temp ............................... 0.07
      dino_warmup_teacher_temp ........................ 0.04
      dino_warmup_teacher_temp_epochs ................. 30
      distribute_checkpointed_activations ............. False
      distribute_saved_activations .................... False
      distributed_backend ............................. ccl
      distributed_timeout_minutes ..................... 10
      ds_fused_adam ................................... False
      ds_inference .................................... False
      ds_pipeline_enabled ............................. False
      ds_sequence_parallel_size ....................... 1
      embedding_path .................................. None
      embedding_weights_in_fp32 ....................... False
      empty_unused_memory_level ....................... 0
      enable_expert_tensor_parallelism ................ False
      enable_zbh1_exact_semantics ..................... False
      enable_zbh1_pipeline ............................ False
      encoder_num_layers .............................. 10
      encoder_seq_length .............................. 4096
      end_weight_decay ................................ 0.1
      eod_mask_loss ................................... False
      eval_interval ................................... 100
      eval_iters ...................................... 20
      evidence_data_path .............................. None
      exit_duration_in_mins ........................... None
      exit_interval ................................... None
      exit_on_missing_checkpoint ...................... False
      exit_signal_handler ............................. False
      expert_interval ................................. 2
      ffn_hidden_size ................................. 11008
      finetune ........................................ False
      force_ds_sequence_parallel ...................... False
      fp16 ............................................ False
      fp16_lm_cross_entropy ........................... False
      fp32_residual_connection ........................ False
      fp8_amax_compute_algo ........................... most_recent
      fp8_amax_history_len ............................ 1
      fp8_e4m3 ........................................ False
      fp8_hybrid ...................................... False
      fp8_interval .................................... 1
      fp8_margin ...................................... 0
      fp8_wgrad ....................................... True
      global_batch_size ............................... 384
      gradient_accumulation_fusion .................... False
      head_lr_mult .................................... 1.0
      hidden_dropout .................................. 0.0
      hidden_size ..................................... 4096
      hidden_size_teacher ............................. None
      hysteresis ...................................... 2
      ict_head_size ................................... None
      ict_load ........................................ None
      img_h ........................................... 224
      img_w ........................................... 224
      indexer_batch_size .............................. 128
      indexer_log_interval ............................ 1000
      inference ....................................... False
      inference_batch_times_seqlen_threshold .......... 512
      init_method_std ................................. 0.02
      init_method_xavier_uniform ...................... False
      initial_loss_scale .............................. 4294967296
      iter_per_epoch .................................. 1250
      kd .............................................. False
      kd_alpha_ce ..................................... 1
      kd_beta_ce ...................................... 1
      kd_temp ......................................... 1.0
      kill_switch_file ................................ None
      kv_channels ..................................... 128
      layernorm_epsilon ............................... 1e-05
      lazy_mpu_init ................................... None
      load ............................................ checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash
      load_tag ........................................ None
      load_teacher .................................... None
      local_rank ...................................... None
      log_batch_size_to_tensorboard ................... False
      log_interval .................................... 1
      log_learning_rate_to_tensorboard ................ True
      log_loss_scale_to_tensorboard ................... True
      log_memory_to_tensorboard ....................... False
      log_num_zeros_in_grad ........................... False
      log_optimizer_states_to_tensorboard ............. True
      log_params_norm ................................. False
      log_timers_to_tensorboard ....................... True
      log_validation_ppl_to_tensorboard ............... False
      log_world_size_to_tensorboard ................... False
      loss_scale ...................................... None
      loss_scale_window ............................... 1000
      lr .............................................. 0.0002
      lr_decay_iters .................................. None
      lr_decay_samples ................................ None
      lr_decay_style .................................. cosine
      lr_decay_tokens ................................. None
      lr_warmup_fraction .............................. 0.05
      lr_warmup_iters ................................. 0
      lr_warmup_samples ............................... 0
      lr_warmup_tokens ................................ None
      make_vocab_size_divisible_by .................... 128
      mask_factor ..................................... 1.0
      mask_prob ....................................... 0.15
      mask_type ....................................... random
      masked_softmax_fusion ........................... False
      max_position_embeddings ......................... 4096
      max_tokens_to_oom ............................... 12000
      mem_efficient_ln ................................ True
      memory_centric_tiled_linear ..................... False
      merge_file ...................................... None
      micro_batch_size ................................ 1
      min_loss_scale .................................. 1.0
      min_lr .......................................... 0.0
      mlp_type ........................................ standard
      mmap_warmup ..................................... False
      moe_eval_capacity_factor ........................ 1.0
      moe_expert_parallel_size ........................ 1
      moe_loss_coeff .................................. 0.1
      moe_min_capacity ................................ 4
      moe_token_dropping .............................. True
      moe_top2_2nd_expert_sampling .................... True
      moe_train_capacity_factor ....................... 1.0
      mos ............................................. False
      multiprocessing_context ......................... fork
      no_load_lr_state ................................ False
      no_load_optim ................................... None
      no_load_rng ..................................... None
      no_persist_layer_norm ........................... False
      no_pipeline_parallel ............................ True
      no_save_optim ................................... None
      no_save_rng ..................................... None
      normalization ................................... rmsnorm
      num_attention_heads ............................. 32
      num_attention_heads_teacher ..................... None
      num_channels .................................... 3
      num_classes ..................................... 1000
      num_experts ..................................... [1]
      num_experts_switch .............................. None
      num_experts_teacher ............................. [1]
      num_key_value_heads ............................. 8
      num_layers ...................................... 10
      num_layers_per_virtual_pipeline_stage ........... None
      num_layers_teacher .............................. None
      num_workers ..................................... 2
      onnx_safe ....................................... None
      openai_gelu ..................................... False
      optimizer ....................................... adamw
      output_bert_embeddings .......................... False
      overlap_p2p_comm ................................ False
      override_opt_param_scheduler .................... False
      params_dtype .................................... torch.bfloat16
      partition_activations ........................... False
      patch_dim ....................................... 16
      perform_initialization .......................... True
      pipeline_model_parallel_size .................... 1
      pipeline_model_parallel_split_rank .............. None
      profile ......................................... None
      profile_backward ................................ False
      profile_ranks ................................... None
      profile_steps ................................... 2,3
      query_in_block_prob ............................. 0.1
      rampup_batch_size ............................... None
      random_ltd ...................................... False
      rank ............................................ 0
      recompute_granularity ........................... None
      recompute_method ................................ None
      recompute_num_layers ............................ 1
      remote_device ................................... none
      repeated_dataloader ............................. False
      reset_attention_mask ............................ False
      reset_iteration ................................. False
      reset_position_ids .............................. False
      retriever_report_topk_accuracies ................ []
      retriever_score_scaling ......................... False
      retriever_seq_length ............................ 256
      retro_add_retriever ............................. False
      retro_cyclic_train_iters ........................ None
      retro_encoder_attention_dropout ................. 0.1
      retro_encoder_hidden_dropout .................... 0.1
      retro_encoder_layers ............................ 2
      retro_num_neighbors ............................. 2
      retro_num_retrieved_chunks ...................... 2
      retro_return_doc_ids ............................ False
      retro_workdir ................................... None
      return_data_index ............................... False
      rope_theta ...................................... 10000
      rotary_percent .................................. 1.0
      sample_rate ..................................... 1.0
      save ............................................ checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash
      save_interval ................................... 50
      scatter_gather_tensors_in_pipeline .............. True
      scattered_embeddings ............................ False
      schedulefree_for_each ........................... False
      seed ............................................ 1234
      seq_length ...................................... 4096
      sequence_parallel ............................... False
      sgd_momentum .................................... 0.9
      short_seq_prob .................................. 0.1
      shuffle_sample_in_corpus ........................ True
      skip_train ...................................... False
      sophiag_beta1 ................................... 0.9
      sophiag_beta2 ................................... 0.95
      sophiag_rho ..................................... 0.01
      split ........................................... 990,10,0
      split_transformers .............................. False
      squared_relu .................................... False
      standalone_embedding_stage ...................... False
      start_weight_decay .............................. 0.1
      swiglu .......................................... True
      swin_backbone_type .............................. tiny
      synchronize_each_layer .......................... False
      tensor_model_parallel_size ...................... 2
      tensorboard_dir ................................. checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/tensorboard
      tensorboard_log_interval ........................ 1
      tensorboard_queue_size .......................... 1000
      test_data_path .................................. None
      tile_factor ..................................... 1
      timing_log_level ................................ 1
      timing_log_option ............................... minmax
      titles_data_path ................................ None
      tokenizer_model ................................. /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model
      tokenizer_type .................................. Llama2Tokenizer
      topk ............................................ 1
      trace_dir ....................................... ./trace/
      train_data_exact_num_epochs ..................... None
      train_data_path ................................. None
      train_desc_path ................................. None
      train_doc_idx_path .............................. None
      train_idx_path .................................. None
      train_iters ..................................... 1271565
      train_iters_to_skip ............................. None
      train_range_to_skip ............................. None
      train_sample_idx_path ........................... None
      train_samples ................................... None
      train_shuffle_idx_path .......................... None
      train_tokens .................................... None
      transformer_impl ................................ local
      transformer_pipeline_model_parallel_size ........ 1
      trust_remote_code ............................... False
      universal_checkpoint ............................ False
      untie_embeddings_and_output_weights ............. True
      use_checkpoint_args ............................. False
      use_checkpoint_opt_param_scheduler .............. True
      use_contiguous_buffers_in_local_ddp ............. True
      use_cpu_initialization .......................... None
      use_dataset_only ................................ False
      use_distributed_optimizer ....................... False
      use_flash_attn .................................. True
      use_flash_attn_builder .......................... True
      use_flash_attn_triton ........................... False
      use_flash_attn_v1 ............................... False
      use_flash_attn_v2 ............................... False
      use_mics ........................................ False
      use_one_sent_docs ............................... False
      use_pin_memory .................................. False
      use_ring_exchange_p2p ........................... False
      use_rotary_position_embeddings .................. True
      use_tutel ....................................... False
      valid_data_path ................................. None
      variable_seq_lengths ............................ False
      virtual_pipeline_model_parallel_size ............ None
      vision_backbone_type ............................ vit
      vision_pretraining .............................. False
      vision_pretraining_type ......................... classify
      vocab_extra_ids ................................. 0
      vocab_file ...................................... None
      vocab_size ...................................... None
      wandb_exp_name .................................. 
      wandb_project ................................... 
      wandb_save_dir .................................. 
      weight_decay .................................... 0.1
      weight_decay_incr_style ......................... constant
      world_size ...................................... 48
      zero_allgather_bucket_size ...................... 0.0
      zero_contigious_gradients ....................... False
      zero_reduce_bucket_size ......................... 0.0
      zero_reduce_scatter ............................. False
      zero_stage ...................................... 1
    -------------------- end of arguments ---------------------
    setting number of micro-batches to constant 16
    > building Llama2Tokenizer tokenizer ...
      > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
    torch distributed is already initialized, skipping initialization ...
    > initialized tensor model parallel with size 2
    > initialized pipeline model parallel with size 1
    > setting random seeds to 1234 ...
    > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
    make: Entering directory '/lus/flare/projects/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/megatron/data'
    make: Nothing to be done for 'default'.
    make: Leaving directory '/lus/flare/projects/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/megatron/data'
    > compiling dataset index builder ...
    >>> done with dataset index builder. Compilation time: 0.181 seconds
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:01.870188][INFO][megatron/training:185] - time to finish initialize_megatron: 5.465544700622559 seconds
    [2025-01-27 15:27:11.352855][INFO][megatron/training:193] - allreduce call time: 9.482640743255615 seconds
    [2025-01-27 15:27:11.484061][INFO][megatron/training:195] - time to initialize megatron (seconds)=42.589
    [2025-01-27 15:27:11.485145][INFO][megatron/training:96] - [after megatron is initialized] datetime=2025-01-27 15:27:11 
    [2025-01-27 15:27:11.501247][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:87] - building GPT model ...
    [2025-01-27 15:27:11,671] [INFO] [utils.py:781:see_memory_usage] Before Building Model
    [2025-01-27 15:27:11,672] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
    [2025-01-27 15:27:11,672] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 38.39 GB, percent = 3.4%
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,812] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,812] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,818] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,819] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,819] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,820] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,820] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,823] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,823] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
      > number of parameters on (tensor, pipeline) model parallel rank (1, 0)=1017204736
    [2025-01-27 15:27:11,824] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,823] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,824] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,829] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,829] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,832] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,833] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    [2025-01-27 15:27:11.834706][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:147] - --------------------------------------------------------------------------------
    [2025-01-27 15:27:11.835369][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:148] - Number of parameters in model: 1017204736
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    [2025-01-27 15:27:11.835818][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:149] - --------------------------------------------------------------------------------
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,836] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,837] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,838] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,839] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,840] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,840] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,842] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,842] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,844] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,845] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,846] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,848] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,849] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,850] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,850] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,851] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,851] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,852] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,852] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,852] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,853] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,854] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    >fused kernel is only supported in cuda, skip loading fused kernel
    [2025-01-27 15:27:11,866] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    [2025-01-27 15:27:11,963] [INFO] [utils.py:781:see_memory_usage] After Building Model
    [2025-01-27 15:27:11,970] [INFO] [utils.py:782:see_memory_usage] MA 1.91 GB         Max_MA 1.91 GB         CA 1.91 GB         Max_CA 2 GB 
    [2025-01-27 15:27:11,971] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 39.25 GB, percent = 3.5%
    [2025-01-27 15:27:11.972857][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:157] - Patching tensorboard from checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/tensorboard
    2025-01-27 15:27:12.255633: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
    2025-01-27 15:27:12.255663: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
    2025-01-27 15:27:12.256979: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
    2025-01-27 15:27:12.817632: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
    2025-01-27 15:27:14.237006: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric
      more than once, or if the name is already used by other metrics.
    2025-01-27 15:27:14.237225: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the me
    tric more than once, or if the name is already used by other metrics.
    2025-01-27 15:27:14.238601: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric
      more than once, or if the name is already used by other metrics.
    2025-01-27 15:27:14.238612: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link 
    the metric more than once, or if the name is already used by other metrics.
    2025-01-27 15:27:14.501613: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
    2025-01-27 15:27:14.532454: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero
    2025-01-27 15:27:14.532802: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532804: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532806: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532808: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532809: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532811: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532812: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532814: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532815: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532817: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532819: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    2025-01-27 15:27:14.532820: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
    [2025-01-27 15:27:14.849463][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:164] - Updating WandB run.config: [easy-valley-1380](https://wandb.ai/aurora_gpt/AuroraGPT/runs/knsggy9p)
    [2025-01-27 15:27:14.852175][INFO][ezpz/dist:123] - `model_provider`, {'pre_process': True, 'post_process': True}) took: dt=3.3509s
      > number of parameters on (tensor, pipeline) model parallel rank (0, 0)=1017204736
    [2025-01-27 15:27:14.853564][INFO][ezpz/dist:123] - `get_model`((<function model_provider at 0x1469a39ea8c0>, <ModelType.encoder_or_decoder: 1>)) took: dt=3.3524s
    [2025-01-27 15:27:14.854867][INFO][megatron/utils:368] - > learning rate decay style: cosine
    [2025-01-27 15:27:14.855387][INFO][ezpz/dist:123] - `get_optimizer_param_scheduler`((AdamW (
    Parameter Group 0
        amsgrad: False
        betas: (0.9, 0.95)
        capturable: False
        differentiable: False
        eps: 1e-05
        foreach: None
        fused: None
        lr: 0.0
        lr_mult: 1.0
        maximize: False
        name: wd_no_scale_lr
        wd_mult: 1.0
        weight_decay: 0.1
    
    Parameter Group 1
        amsgrad: False
        betas: (0.9, 0.95)
        capturable: False
        differentiable: False
        eps: 1e-05
        foreach: None
        fused: None
        lr: 0.0
        lr_mult: 1.0
        maximize: False
        name: no_wd_no_scale_lr
        wd_mult: 0.0
        weight_decay: 0.0
    ),)) took: dt=0.0005s
    [2025-01-27 15:27:14.857339][INFO][megatron/training:692] - DeepSpeed is enabled.
    [2025-01-27 15:27:14.857770][INFO][megatron/training:747] - Did NOT catch: ('args.data_efficiency_curriculum_learning' and 'build_train_valid_test_datasets_provider is not None')
    [2025-01-27 15:27:14.858278][INFO][megatron/training:756] - Calling 'deepspeed.initialize'...
    [2025-01-27 15:27:14.858687][INFO][megatron/training:757] - Wrapped with: profiler=<megatron.utils.Profile object at 0x1469a39d4250>
    [2025-01-27 15:27:14,859] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
    [2025-01-27 15:27:14,859] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24
    [2025-01-27 15:27:16,862] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: True
    [2025-01-27 15:27:16,863] [INFO] [logging.py:128:log_dist] [Rank 0] Using client Optimizer as basic optimizer
    [2025-01-27 15:27:16,863] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
    [2025-01-27 15:27:16,864] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
    [2025-01-27 15:27:16,864] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
    [2025-01-27 15:27:16,864] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
    [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
    [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
    [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
    [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
    [2025-01-27 15:27:17,770] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
    [2025-01-27 15:27:17,770] [INFO] [utils.py:782:see_memory_usage] MA 2.05 GB         Max_MA 2.05 GB         CA 2.06 GB         Max_CA 2 GB 
    [2025-01-27 15:27:17,771] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 53.14 GB, percent = 4.7%
    [2025-01-27 15:27:17,952] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
    [2025-01-27 15:27:17,952] [INFO] [utils.py:782:see_memory_usage] MA 2.05 GB         Max_MA 2.21 GB         CA 2.21 GB         Max_CA 2 GB 
    [2025-01-27 15:27:17,953] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 46.81 GB, percent = 4.1%
    [2025-01-27 15:27:17,953] [INFO] [stage_1_and_2.py:545:__init__] optimizer state initialized
    [2025-01-27 15:27:18,126] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
    [2025-01-27 15:27:18,127] [INFO] [utils.py:782:see_memory_usage] MA 2.05 GB         Max_MA 2.05 GB         CA 2.21 GB         Max_CA 2 GB 
    [2025-01-27 15:27:18,127] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 46.81 GB, percent = 4.1%
    [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
    [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client LR scheduler
    [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.optimizer_param_scheduler.OptimizerParamScheduler object at 0x1469a39d68c0>
    [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
    [2025-01-27 15:27:18,129] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
    [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
        "partition_activations": false, 
        "contiguous_memory_optimization": false, 
        "cpu_checkpointing": false, 
        "number_checkpoints": null, 
        "synchronize_checkpoint_boundary": false, 
        "profile": false
    }
    [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
    [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print]   amp_enabled .................. False
    [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print]   amp_params ................... False
    [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print]   autotuning_config ............ {
        "enabled": false, 
        "start_step": null, 
        "end_step": null, 
        "metric_path": null, 
        "arg_mappings": null, 
        "metric": "throughput", 
        "model_info": null, 
        "results_dir": "autotuning_results", 
        "exps_dir": "autotuning_exps", 
        "overwrite": true, 
        "fast": true, 
        "start_profile_step": 3, 
        "end_profile_step": 5, 
        "tuner_type": "gridsearch", 
        "tuner_early_stopping": 5, 
        "tuner_num_trials": 50, 
        "model_info_path": null, 
        "mp_size": 1, 
        "max_train_batch_size": null, 
        "min_train_batch_size": 1, 
        "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
        "min_train_micro_batch_size_per_gpu": 1, 
        "num_tuning_micro_batch_sizes": 3
    }
    [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
    [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x14684ffb3df0>
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   communication_data_type ...... None
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symm
    etric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration':
      'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset'
    : 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_group
    s': {}}, 'layer_reduction': {'enabled': False}}
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'ra
    ndom_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   disable_allgather ............ False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   dump_state ................... False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
    [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
        "enabled": true, 
        "recompute_fwd_factor": 0.0, 
        "profile_step": 2, 
        "module_depth": -1, 
        "top_modules": 1, 
        "detailed": true, 
        "output_file": null
    }
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   fp16_enabled ................. False
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   global_rank .................. 0
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 16
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
    [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print]   graph_harvesting ............. False
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   memory_breakdown ............. False
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=No
    ne, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   nebula_config ................ {
        "enabled": false, 
        "persistent_storage_path": null, 
        "persistent_time_interval": 100, 
        "num_of_version_in_retention": 2, 
        "enable_nebula_load": true, 
        "load_path": null
    }
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   optimizer_name ............... None
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   optimizer_params ............. None
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   pld_enabled .................. False
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   pld_params ................... False
    [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print]   prescale_gradients ........... False
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   scheduler_name ............... None
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   scheduler_params ............. None
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   sparse_attention ............. None
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   steps_per_print .............. 1
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   train_batch_size ............. 384
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... True
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   weight_quantization_config ... None
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   world_size ................... 24
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_com
    m=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persis
    tence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters
    =True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient
    _linear=True pipeline_loading_checkpoint=False override_module_apply=True
    [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print]   zero_enabled ................. True
    [2025-01-27 15:27:18,134] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. False
    [2025-01-27 15:27:18,134] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 1
    [2025-01-27 15:27:18,134] [INFO] [config.py:989:print_user_config]   json = {
        "train_batch_size": 384, 
        "train_micro_batch_size_per_gpu": 1, 
        "gradient_clipping": 1.0, 
        "steps_per_print": 1, 
        "gradient_accumulation_steps": 16, 
        "zero_force_ds_cpu_optimizer": false, 
        "zero_allow_untested_optimizer": true, 
        "wall_clock_breakdown": false, 
        "zero_optimization": {
            "stage": 1
        }, 
        "fp16": {
            "enabled": false, 
            "loss_scale": 0, 
            "loss_scale_window": 1000, 
            "hysteresis": 2, 
            "min_loss_scale": 1
        }, 
        "bfloat16": {
            "enabled": true, 
            "loss_scale": 1.0
        }, 
        "comms_logger": {
            "enabled": false, 
            "verbose": false, 
            "debug": false
        }, 
        "flops_profiler": {
            "enabled": true, 
            "profile_step": 2, 
            "module_depth": -1, 
            "top_modules": 1, 
            "detailed": true, 
            "output_file": null
        }
    }
    [2025-01-27 15:27:18.134311][INFO][megatron/training:767] - 'deepspeed.initialize' took: 3.27604s
    [2025-01-27 15:27:18.138694][INFO][megatron/checkpointing:568] - Unable to load lr_state_dict from lr_state_dict_fp=PosixPath('checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/lr_state_dict_0_of_48.yaml'), but strict=False. Returni
    ng empty dictionary: lr_state_dict={}
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18.141424][INFO][megatron/utils:368] - WARNING: could not find the metadata file checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash 
    [2025-01-27 15:27:18.142199][INFO][megatron/utils:368] -     will not load any checkpoints and will start from random
    [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,142] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,145] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,145] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,146] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,147] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,148] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,148] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,149] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    [2025-01-27 15:27:18,150] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass 
    an explicit checkpoint tag when loading a checkpoint.
    (min, max) time across ranks (ms):
        load-checkpoint ................................: (14.95, 15.08)
    [2025-01-27 15:27:27.718035][INFO][ezpz/dist:123] - `setup_model_and_optimizer`((<function model_provider at 0x1469a39ea8c0>, <ModelType.encoder_or_decoder: 1>), {'teacher': False, 'data_post_process': <function data_post_process at 0x1469a39eab90>, 'build_train_valid_test_data
    sets_provider': <function train_valid_test_datasets_provider at 0x1469a39eb7f0>}) took: dt=16.2168s
    [2025-01-27 15:27:27.725286][INFO][megatron/training:96] - [after model, optimizer, and learning rate scheduler are built] datetime=2025-01-27 15:27:27 
    [2025-01-27 15:27:27.726306][INFO][megatron/training:1510] - > building train, validation, and test datasets ...
    [2025-01-27 15:27:27.726859][INFO][megatron/training:1493] -  > datasets target sizes (minimum size):
    [2025-01-27 15:27:27.727356][INFO][megatron/training:1494] -     train:      488280960
    [2025-01-27 15:27:27.727827][INFO][megatron/training:1495] -     validation: 97658880
    [2025-01-27 15:27:27.728241][INFO][megatron/training:1496] -     test:       7680
    [2025-01-27 15:27:27.728652][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:465] - > building train, validation, and test datasets for GPT ...
    [2025-01-27 15:27:27.729098][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:468] - Reading datasets from ALCF/data-lists/aurora/books.txt
    [2025-01-27 15:27:27.792129][WARNING][utils/_logger.megatron.data.gpt_dataset:68] -  > WARNING: could not find index map files, building on rank 0
        using:
          number of documents:       24114
          number of epochs:          353
          sequence length:           4096
          total number of samples:   211911109
    > building indices for blendable datasets ...
      > sample ratios:
        dataset 0, input: 0.430653, achieved: 0.430653
        dataset 1, input: 0.430584, achieved: 0.430584
        dataset 2, input: 0.138763, achieved: 0.138763
    [2025-01-27 15:28:02.702761][INFO][data/gpt_dataset.megatron.data.gpt_dataset:191] - [BuildConcatDataset] Caught args.shuffle_sample_in_corpus=True across 490722366 samples
    [2025-01-27 15:28:02.707002][WARNING][utils/_logger.megatron.data.gpt_dataset:68] -  > WARNING: could not find index map files, building on rank 0
        using:
          number of documents:       244
          number of epochs:          6889
          sequence length:           4096
          total number of samples:   42269595
    > building indices for blendable datasets ...
      > sample ratios:
        dataset 0, input: 0.430653, achieved: 0.430653
        dataset 1, input: 0.430584, achieved: 0.430584
        dataset 2, input: 0.138763, achieved: 0.138763
    [2025-01-27 15:28:09.319430][INFO][data/gpt_dataset.megatron.data.gpt_dataset:191] - [BuildConcatDataset] Caught args.shuffle_sample_in_corpus=True across 98147176 samples
      > WARNING: could not find index map files for blendable dataset, building indices on rank 0 ...
    > building indices for blendable datasets ...
      > sample ratios:
        dataset 0, input: 1, achieved: 1
    [2025-01-27 15:28:12.134453][INFO][data/blendable_dataset.megatron.data.blendable_dataset:52] - > elapsed time for building blendable dataset indices: 2.80 (sec)
    [2025-01-27 15:28:17.021709][INFO][data/blendable_dataset.megatron.data.blendable_dataset:87] -  > finished saving index map files in 4.886175542000274 seconds
    [2025-01-27 15:28:17.023655][INFO][data/blendable_dataset.megatron.data.blendable_dataset:112] - > loading blendable dataset index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/82b02ab7f8cd8f2eb97205bd2
    481c4df_index.npy
    [2025-01-27 15:28:17.050282][INFO][data/blendable_dataset.megatron.data.blendable_dataset:115] - > loading blendable dataset sample index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/82b02ab7f8cd8f2eb9
    7205bd2481c4df_sample_index.npy
    [2025-01-27 15:28:17.053629][INFO][data/blendable_dataset.megatron.data.blendable_dataset:118] - > finished loading in 0.02997312500156113 seconds
    [2025-01-27 15:28:17.099847][INFO][data/blendable_dataset.megatron.data.blendable_dataset:130] - > size of blendable dataset: 490722366 samples
      > WARNING: could not find index map files for blendable dataset, building indices on rank 0 ...
    > building indices for blendable datasets ...
      > sample ratios:
        dataset 0, input: 1, achieved: 1
    [2025-01-27 15:28:17.646764][INFO][data/blendable_dataset.megatron.data.blendable_dataset:52] - > elapsed time for building blendable dataset indices: 0.52 (sec)
    [2025-01-27 15:28:18.721111][INFO][data/blendable_dataset.megatron.data.blendable_dataset:87] -  > finished saving index map files in 1.0732649540004786 seconds
    [2025-01-27 15:28:18.723038][INFO][data/blendable_dataset.megatron.data.blendable_dataset:112] - > loading blendable dataset index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/3be1e743e088817f147da28b5
    384ce8d_index.npy
    [2025-01-27 15:28:18.738154][INFO][data/blendable_dataset.megatron.data.blendable_dataset:115] - > loading blendable dataset sample index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/3be1e743e088817f14
    7da28b5384ce8d_sample_index.npy
    [2025-01-27 15:28:18.741530][INFO][data/blendable_dataset.megatron.data.blendable_dataset:118] - > finished loading in 0.018488385001546703 seconds
    [2025-01-27 15:28:18.768816][INFO][data/blendable_dataset.megatron.data.blendable_dataset:130] - > size of blendable dataset: 98147176 samples
    [2025-01-27 15:28:18.772962][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:515] - > finished creating GPT datasets. Took: 17090845417855.19922s
    [2025-01-27 15:28:18.773597][INFO][ezpz/dist:123] - `train_valid_test_datasets_provider`(([488280960, 97658880, 7680],)) took: dt=51.0449s
    [2025-01-27 15:28:18.774290][INFO][ezpz/dist:123] - `build_train_valid_test_datasets`((<function train_valid_test_datasets_provider at 0x1469a39eb7f0>,)) took: dt=51.0474s
    [2025-01-27 15:28:18.943567][INFO][ezpz/dist:123] - `build_train_valid_test_data_loaders`((<function train_valid_test_datasets_provider at 0x1469a39eb7f0>,)) took: dt=51.2172s
    [2025-01-27 15:28:21.477616][INFO][ezpz/dist:123] - `build_train_valid_test_data_iterators`((<function train_valid_test_datasets_provider at 0x1469a39eb7f0>,)) took: dt=53.7512s
    [2025-01-27 15:28:23.271739][INFO][megatron/training:96] - [after dataloaders are built] datetime=2025-01-27 15:28:23 
    [2025-01-27 15:28:23.272761][INFO][megatron/training:287] - done with setup ...
    (min, max) time across ranks (ms):
        model-and-optimizer-setup ......................: (16163.47, 16223.38)
        train/valid/test-data-iterators-setup ..........: (51051.70, 55543.92)
    [2025-01-27 15:28:23.278723][INFO][megatron/training:293] - training ...
    [2025-01-27 15:28:23.297907][INFO][megatron/training:96] - [before the start of training step] datetime=2025-01-27 15:28:23 
    [2025-01-27 15:28:33,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 547.48 | optimizer_gradients: 16.65 | optimizer_step: 57.34
    [2025-01-27 15:28:33,108] [INFO] [logging.py:128:log_dist] [Rank 0] step=1, skipped=0, lr=[3.1457298683118837e-09, 3.1457298683118837e-09], mom=[(0.9, 0.95), (0.9, 0.95)]
    [2025-01-27 15:28:33,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2842.71 | bwd_microstep: 4363.09 | bwd_inner_microstep: 3903.83 | bwd_allreduce_microstep: 459.02 | step_microstep: 2259.08
    [2025-01-27 15:28:33,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2842.72 | bwd: 4363.08 | bwd_inner: 3903.88 | bwd_allreduce: 459.02 | step: 2259.08
    [2025-01-27 15:28:33.129475][INFO][megatron/training_log:661] -  iteration=       1/ 1271565 | consumed_samples=         384 | consumed_tokens=     1572864 | elapsed_time_per_iteration_ms=9845.0 | learning_rate=3.14573e-09 | global_batch_size=  384 | lm loss=11.208542 | loss_sc
    ale=1.0 | grad_norm=16.175 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=39.005 | tokens_per_gpu_per_second_tgs=3328.401 | [LM]TFLOPs=44.71 | [DS]TFLOPs=43.03 |
    [2025-01-27 15:28:33.131878][INFO][megatron/utils:249] - [Rank 0] (after 1 iterations) memory (MB) | allocated: 2427.544921875 | max allocated: 9358.35107421875 | reserved: 10778.0 | max reserved: 10778.0
    (min, max) time across ranks (ms):
        forward-backward ...............................: (7524.37, 7549.68)
        optimizer ......................................: (2256.28, 2259.51)
    [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 146.39 | optimizer_gradients: 0.43 | optimizer_step: 1.08
    [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] step=2, skipped=0, lr=[6.291459736623767e-09, 6.291459736623767e-09], mom=[(0.9, 0.95), (0.9, 0.95)]
    [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 1417.73 | bwd_microstep: 3879.56 | bwd_inner_microstep: 3510.52 | bwd_allreduce_microstep: 368.84 | step_microstep: 153.12
    [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1417.75 | bwd: 3879.56 | bwd_inner: 3510.57 | bwd_allreduce: 368.84 | step: 153.12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment