Sam Foreman
2025-01-27
Following the instructions from:
https://docs.alcf.anl.gov/aurora/data-science/frameworks/megatron-deepspeed/
-
Login to compute node + create isolated directory:
#[03:18:17 PM][aurora-uan-0012][~][⏱️ 1h58m35s] $ ssh x4309c4s1b0n0 #[03:21:14 PM][x4309c4s1b0n0][~] $ cd /flare/Aurora_deployment/foremans/ #[03:21:28 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans] $ cd tmp #[03:21:30 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans/tmp] $ NOW=$(tstamp) && mkdir $NOW && cd $NOW
-
Clone argonne-lcf/Megatron-DeepSpeed:
#[03:21:32 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131] $ git clone https://github.com/argonne-lcf/Megatron-DeepSpeed Cloning into 'Megatron-DeepSpeed'... remote: Enumerating objects: 16435, done. remote: Counting objects: 100% (19/19), done. remote: Compressing objects: 100% (10/10), done. remote: Total 16435 (delta 12), reused 9 (delta 9), pack-reused 16416 (from 3) Receiving objects: 100% (16435/16435), 7.68 MiB | 21.85 MiB/s, done. Resolving deltas: 100% (12113/12113), done. Updating files: 100% (621/621), done. took: 0h:00m:04s #[03:22:00 PM][x4309c4s1b0n0][/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131][⏱️ 4s] $ cd Megatron-DeepSpeed
#[03:22:02 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main $ export PBS_O_WORKDIR=$(pwd) #[03:22:19 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main ; source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh) && ezpz_setup_envsmert Using WORKING_DIR: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed #[03:22:24 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main $ ezpz_setup_env No conda_prefix OR virtual_env found in environment... Setting up conda... Due to MODULEPATH changes, the following have been reloaded: 1) hwloc/master-git.1793e43-level-zero 2) mpich/opt/4.3.0rc3 The following have been reloaded with a version change: 1) oneapi/eng-compiler/2024.07.30.002 => oneapi/release/2024.2.1 2) yaksa/0.3-aw2kkvy => yaksa/0.3-euoqglg Lmod has detected the following error: The following module(s) are unknown: "mpich" Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore_cache load "mpich" Also make sure that all modulefiles written in TCL start with the string #%Module Found conda at: /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1 No VIRTUAL_ENV found in environment! - Trying to setup from /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1 - Using VENV_DIR=/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1 - Creating a new virtual env on top of aurora_nre_models_frameworks-2024.2.1_u1 in /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1 [python] Using /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3 [🍋 ezpz/bin/utils.sh • USER=foremans • MACHINE=aurora • HOST=x4309c4s1b0n0 • TSTAMP=2025-01-27-152242 [ezpz_setup_host_pbs] • Using hostfile: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • Found in environment: • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • Writing PBS vars to: /home/foremans/.pbsenv [ezpz_save_pbs_env] • Setting: • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • JOBENV_FILE: /home/foremans/.pbsenv [HOSTS] • [host:0] - x4102c5s2b0n0.hostmgmt2102.cm.aurora.alcf.anl.gov • [host:1] - x4309c3s7b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov • [host:2] - x4309c4s0b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov • [host:3] - x4309c4s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov [DIST INFO] • NGPUS=48 • NHOSTS=4 • NGPU_PER_HOST=12 • HOSTFILE=/var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • DIST_LAUNCH=mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni [LAUNCH]: • To launch across all available GPUs, use: launch launch = mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni took: 0h:00m:17s
-
Install dependencies:
#[🐍 aurora_nre_models_frameworks-2024.2.1_u1](👻 aurora_nre_models_frameworks-2024.2.1_u #[03:22:42 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main][⏱️ 17s] $ python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
Obtaining ezpz from git+https://github.com/saforem2/ezpz#egg=ezpz Cloning https://github.com/saforem2/ezpz to /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/src/ezpz Running command git clone --filter=blob:none --quiet https://github.com/saforem2/ezpz /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/src/ezpz Resolved https://github.com/saforem2/ezpz to commit 29138b89ddfc6119c7fd593a12e498d0aee0c8ea Installing build dependencies ... done Checking if build backend supports build_editable ... done Getting requirements to build editable ... done Installing backend dependencies ... done Preparing editable metadata (pyproject.toml) ... done Collecting ambivalent@ git+https://github.com/saforem2/ambivalent Cloning https://github.com/saforem2/ambivalent to /tmp/pip-install-rqu4a__2/ambivalent_4bcdc457047c40fc8abdc2207eb76795 Running command git clone --filter=blob:none --quiet https://github.com/saforem2/ambivalent /tmp/pip-install-rqu4a__2/ambivalent_4bcdc457047c40fc8abdc2207eb76795 Resolved https://github.com/saforem2/ambivalent to commit eac43ada80b6d4b2f71bf45cee9329993f622e87 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: tensorboard in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (2.15.2) Requirement already satisfied: mpi4py in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (3.1.6) Collecting jaxlib Using cached jaxlib-0.5.0-cp310-cp310-manylinux2014_x86_64.whl (102.0 MB) Collecting wandb Using cached wandb-0.19.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.3 MB) Requirement already satisfied: joblib in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (1.4.2) Collecting pyinstrument Using cached pyinstrument-5.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (145 kB) Requirement already satisfied: seaborn in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (0.13.2) Requirement already satisfied: hydra-core in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (1.3.2) Collecting xarray Using cached xarray-2025.1.1-py3-none-any.whl (1.2 MB) Collecting plotext Using cached plotext-5.3.2-py3-none-any.whl (64 kB) Collecting jax Using cached jax-0.5.0-py3-none-any.whl (2.3 MB) Requirement already satisfied: torch in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (2.3.1+cxx11.abi) Collecting sentencepiece Using cached sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB) Requirement already satisfied: omegaconf in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (2.3.0) Requirement already satisfied: tqdm in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (4.67.1) Collecting jaxtyping Downloading jaxtyping-0.2.37-py3-none-any.whl (56 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 kB 2.4 MB/s eta 0:00:00 Requirement already satisfied: h5py in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (3.12.1) Collecting hydra-colorlog Using cached hydra_colorlog-1.2.0-py3-none-any.whl (3.6 kB) Collecting rich Using cached rich-13.9.4-py3-none-any.whl (242 kB) Collecting sh Using cached sh-2.2.1-py3-none-any.whl (38 kB) Requirement already satisfied: ipython in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (8.31.0) Requirement already satisfied: ml-dtypes in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ezpz) (0.3.2) Requirement already satisfied: requests in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2.32.3) Requirement already satisfied: matplotlib in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.5.3) Collecting colormaps Using cached colormaps-0.4.2-py3-none-any.whl (727 kB) Requirement already satisfied: numpy>=1.19.3 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from h5py->ezpz) (1.26.4) Collecting colorlog Using cached colorlog-6.9.0-py3-none-any.whl (11 kB) Requirement already satisfied: packaging in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from hydra-core->ezpz) (24.0) Requirement already satisfied: antlr4-python3-runtime==4.9.* in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from hydra-core->ezpz) (4.9.3) Requirement already satisfied: PyYAML>=5.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from omegaconf->ezpz) (6.0.2) Requirement already satisfied: pexpect>4.3 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (4.9.0) Requirement already satisfied: decorator in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (5.1.1) Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (3.0.50) Requirement already satisfied: traitlets>=5.13.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (5.14.3) Requirement already satisfied: jedi>=0.16 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (0.19.2) Requirement already satisfied: stack_data in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (0.6.3) Requirement already satisfied: matplotlib-inline in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (0.1.7) Requirement already satisfied: exceptiongroup in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (1.2.2) Requirement already satisfied: typing_extensions>=4.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (4.12.2) Requirement already satisfied: pygments>=2.4.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from ipython->ezpz) (2.19.1) Requirement already satisfied: opt_einsum in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jax->ezpz) (3.4.0) Requirement already satisfied: scipy>=1.11.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jax->ezpz) (1.12.0) Collecting ml-dtypes Using cached ml_dtypes-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB) Collecting wadler-lindig>=0.1.3 Downloading wadler_lindig-0.1.3-py3-none-any.whl (20 kB) Collecting markdown-it-py>=2.2.0 Using cached markdown_it_py-3.0.0-py3-none-any.whl (87 kB) Requirement already satisfied: pandas>=1.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from seaborn->ezpz) (1.5.0) Requirement already satisfied: werkzeug>=1.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (3.1.3) Requirement already satisfied: six>1.9 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (1.16.0) Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (2.37.0) Requirement already satisfied: setuptools>=41.0.0 in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (65.5.0) Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (4.25.5) Requirement already satisfied: grpcio>=1.48.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (1.69.0) Requirement already satisfied: markdown>=2.6.8 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (3.7) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (0.7.2) Requirement already satisfied: absl-py>=0.4 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (2.1.0) Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from tensorboard->ezpz) (1.2.1) Requirement already satisfied: networkx in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (3.4.2) Requirement already satisfied: jinja2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (3.1.5) Requirement already satisfied: fsspec in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (2024.12.0) Requirement already satisfied: sympy in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (1.13.3) Requirement already satisfied: filelock in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->ezpz) (3.16.1) Requirement already satisfied: click!=8.0.0,>=7.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (8.1.8) Requirement already satisfied: pydantic<3,>=2.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (2.10.5) Collecting sentry-sdk>=2.0.0 Using cached sentry_sdk-2.20.0-py2.py3-none-any.whl (322 kB) Collecting setproctitle Using cached setproctitle-1.3.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB) Requirement already satisfied: platformdirs in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (4.2.2) Requirement already satisfied: gitpython!=3.1.29,>=1.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (3.1.44) Collecting docker-pycreds>=0.4.0 Using cached docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB) Requirement already satisfied: psutil>=5.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from wandb->ezpz) (6.1.1) Collecting pandas>=1.2 Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB) Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from gitpython!=3.1.29,>=1.0.0->wandb->ezpz) (4.0.12) Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard->ezpz) (0.4.1) Requirement already satisfied: rsa<5,>=3.1.4 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard->ezpz) (4.9) Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard->ezpz) (5.5.0) Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from google-auth-oauthlib<2,>=0.5->tensorboard->ezpz) (2.0.0) Requirement already satisfied: parso<0.9.0,>=0.8.4 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jedi>=0.16->ipython->ezpz) (0.8.4) Collecting mdurl~=0.1 Using cached mdurl-0.1.2-py3-none-any.whl (10.0 kB) Requirement already satisfied: pyparsing>=2.2.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.2.1) Requirement already satisfied: python-dateutil>=2.7 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2.9.0) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (1.4.8) Requirement already satisfied: pillow>=6.2.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (11.1.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (4.55.4) Requirement already satisfied: cycler>=0.10 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from matplotlib->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (0.12.1) Requirement already satisfied: pytz>=2020.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pandas>=1.2->seaborn->ezpz) (2024.1) Collecting tzdata>=2022.7 Using cached tzdata-2025.1-py2.py3-none-any.whl (346 kB) Requirement already satisfied: ptyprocess>=0.5 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pexpect>4.3->ipython->ezpz) (0.7.0) Requirement already satisfied: wcwidth in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython->ezpz) (0.2.13) Requirement already satisfied: annotated-types>=0.6.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic<3,>=2.6->wandb->ezpz) (0.7.0) Requirement already satisfied: pydantic-core==2.27.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic<3,>=2.6->wandb->ezpz) (2.27.2) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.3.2) Requirement already satisfied: certifi>=2017.4.17 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2024.12.14) Requirement already satisfied: idna<4,>=2.5 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests->ambivalent@ git+https://github.com/saforem2/ambivalent->ezpz) (2.2.1) Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from werkzeug>=1.0.1->tensorboard->ezpz) (3.0.2) Requirement already satisfied: asttokens>=2.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from stack_data->ipython->ezpz) (3.0.0) Requirement already satisfied: executing>=1.2.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from stack_data->ipython->ezpz) (2.1.0) Requirement already satisfied: pure-eval in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from stack_data->ipython->ezpz) (0.2.3) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from sympy->torch->ezpz) (1.3.0) Requirement already satisfied: smmap<6,>=3.0.1 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb->ezpz) (5.0.2) Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard->ezpz) (0.6.1) Requirement already satisfied: oauthlib>=3.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard->ezpz) (3.2.2) Building wheels for collected packages: ezpz, ambivalent Building editable for ezpz (pyproject.toml) ... done Created wheel for ezpz: filename=ezpz-0.2-py3-none-any.whl size=8397 sha256=90fafced23daa139888b1d14bfb9b3658239e342374f4a95d114b9dd29f0bf94 Stored in directory: /tmp/pip-ephem-wheel-cache-8bsj59up/wheels/eb/89/66/9ab50e62a2bd66fcc997952b73eb94cdb41a99455b21f42909 Building wheel for ambivalent (pyproject.toml) ... done Created wheel for ambivalent: filename=ambivalent-0.2.0-py3-none-any.whl size=13235 sha256=2e3833397c9f871f02a9bde8c53f363e7123d271eb47ac53945072d64b54f6a8 Stored in directory: /tmp/pip-ephem-wheel-cache-8bsj59up/wheels/7b/e6/96/887dca4e5d3c307c41d4cf84d23f97791a334efab8f1163d30 Successfully built ezpz ambivalent Installing collected packages: sentencepiece, wadler-lindig, tzdata, sh, setproctitle, sentry-sdk, pyinstrument, plotext, ml-dtypes, mdurl, docker-pycreds, colormaps, colorlog, pandas, markdown-it-py, jaxtyping, jaxlib, xarray, wandb, rich, jax, hydra-colorlog, ambivalent, ezpz Attempting uninstall: ml-dtypes Found existing installation: ml-dtypes 0.3.2 Not uninstalling ml-dtypes at /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages, outside environment /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1 Can't uninstall 'ml-dtypes'. No files were found to uninstall. Attempting uninstall: pandas Found existing installation: pandas 1.5.0 Not uninstalling pandas at /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages, outside environment /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1 Can't uninstall 'pandas'. No files were found to uninstall. ERROR: pip\'s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow 2.15.1 requires ml-dtypes~=0.3.1, but you have ml-dtypes 0.5.1 which is incompatible. intel-extension-for-tensorflow 2.15.0.1 requires absl-py==1.4.0, but you have absl-py 2.1.0 which is incompatible. intel-extension-for-tensorflow 2.15.0.1 requires protobuf<4.24, but you have protobuf 4.25.5 which is incompatible. Successfully installed ambivalent-0.2.0 colorlog-6.9.0 colormaps-0.4.2 docker-pycreds-0.4.0 ezpz-0.2 hydra-colorlog-1.2.0 jax-0.5.0 jaxlib-0.5.0 jaxtyping-0.2.37 markdown-it-py-3.0.0 mdurl-0.1.2 ml-dtypes-0.5.1 pandas-2.2.3 plotext-5.3.2 pyinstrument-5.0.1 rich-13.9.4 sentencep iece-0.2.0 sentry-sdk-2.20.0 setproctitle-1.3.4 sh-2.2.1 tzdata-2025.1 wadler-lindig-0.1.3 wandb-0.19.4 xarray-2025.1.1 [notice] A new release of pip is available: 23.0.1 -> 25.0 [notice] To update, run: pip install --upgrade pip took: 0h:01m:42s #[🐍 aurora_nre_models_frameworks-2024.2.1_u1](👻 aurora_nre_models_frameworks-2024.2.1_u #[03:25:31 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main][⏱️ 23s $ python3 -m pip install deepspeed==0.16.2 Collecting deepspeed Downloading deepspeed-0.16.3.tar.gz (1.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 36.9 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: einops in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (0.8.0) Requirement already satisfied: hjson in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (3.1.0) Collecting msgpack Using cached msgpack-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (378 kB) Requirement already satisfied: ninja in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (1.11.1.3) Requirement already satisfied: numpy in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (1.26.4) Requirement already satisfied: packaging>=20.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (24.0) Requirement already satisfied: psutil in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (6.1.1) Requirement already satisfied: py-cpuinfo in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (9.0.0) Requirement already satisfied: pydantic>=2.0.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (2.10.5) Requirement already satisfied: torch in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (2.3.1+cxx11.abi) Requirement already satisfied: tqdm in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from deepspeed) (4.67.1) Requirement already satisfied: pydantic-core==2.27.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic>=2.0.0->deepspeed) (2.27.2) Requirement already satisfied: annotated-types>=0.6.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic>=2.0.0->deepspeed) (0.7.0) Requirement already satisfied: typing-extensions>=4.12.2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from pydantic>=2.0.0->deepspeed) (4.12.2) Requirement already satisfied: filelock in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (3.16.1) Requirement already satisfied: networkx in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (3.4.2) Requirement already satisfied: sympy in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (1.13.3) Requirement already satisfied: jinja2 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (3.1.5) Requirement already satisfied: fsspec in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from torch->deepspeed) (2024.12.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from jinja2->torch->deepspeed) (3.0.2) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (from sympy->torch->deepspeed) (1.3.0) Building wheels for collected packages: deepspeed Building wheel for deepspeed (setup.py) ... done Created wheel for deepspeed: filename=deepspeed-0.16.3-py3-none-any.whl size=1549949 sha256=7b6d0b9906f21c6cc74da2ae82a1ab28a93e5acb99b42fea7189a2deabd360f7 Stored in directory: /home/foremans/.cache/pip/wheels/ca/e2/8f/3a91068b57481b104c9c450a20239ec874f6141f8b3769e0dd Successfully built deepspeed Installing collected packages: msgpack, deepspeed Successfully installed deepspeed-0.16.3 msgpack-1.1.0 [notice] A new release of pip is available: 23.0.1 -> 25.0 [notice] To update, run: pip install --upgrade pip took: 0h:00m:29s
-
Launch training:
#[🐍 aurora_nre_models_frameworks-2024.2.1_u1](👻 aurora_nre_models_frameworks-2024.2.1_u #[03:26:04 PM][x4309c4s1b0n0][/f/A/f/t/2/Megatron-DeepSpeed][🌱 main][⏱️ 29s $ TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh Using WORKING_DIR: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed Running on: aurora Found ezpz in /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/deps/ezpz Using WORKING_DIR: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed Using virtual_env: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1 on top of conda from: /opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1 [python] Using /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3 [🍋 ezpz/bin/utils.sh • USER=foremans • MACHINE=aurora • HOST=x4309c4s1b0n0 • TSTAMP=2025-01-27-152607 [ezpz_setup_host_pbs] • Using hostfile: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • Found in environment: • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • Writing PBS vars to: /home/foremans/.pbsenv [ezpz_save_pbs_env] • Setting: • HOSTFILE: /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • JOBENV_FILE: /home/foremans/.pbsenv [HOSTS] • [host:0] - x4102c5s2b0n0.hostmgmt2102.cm.aurora.alcf.anl.gov • [host:1] - x4309c3s7b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov • [host:2] - x4309c4s0b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov • [host:3] - x4309c4s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov [DIST INFO] • NGPUS=48 • NHOSTS=4 • NGPU_PER_HOST=12 • HOSTFILE=/var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov • DIST_LAUNCH=mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni [LAUNCH]: • To launch across all available GPUs, use: launch launch = mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni [notice] A new release of pip is available: 23.0.1 -> 25.0 [notice] To update, run: pip install --upgrade pip [ezpz_install] Found ezpz @ /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/src/ezpz [install_dependencies] Ensuring all dependencies from /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/requirements/requirements.txt installed... [notice] A new release of pip is available: 23.0.1 -> 25.0 [notice] To update, run: pip install --upgrade pip [setParams] Using GRAD_ACC_STEPS: 16 TRAIN_TOKENS=2000000000000 (=2000B tokens) TRAIN_ITERS=1271565 DS_CONFIG: /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json ZS=1, MB=1, GB=384, PP=1, DTYPE=bf16 { "train_batch_size": 384, "train_micro_batch_size_per_gpu": 1, "gradient_clipping": 1, "steps_per_print": 1, "gradient_accumulation_steps": 16, "zero_force_ds_cpu_optimizer": false, "zero_allow_untested_optimizer": true, "wall_clock_breakdown": false, "zero_optimization": { "stage": 1 }, "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bfloat16": { "enabled": true, "loss_scale": 1 }, "comms_logger": { "enabled": false, "verbose": false, "debug": false }, "flops_profiler": { "enabled": true, "profile_step": 2, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } } Checkpoints will be saved to: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash Please see logs at: logs/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/20250127-152609_48_x4309c4s1b0n0 Setting up tokenizer with Llama2Tokenizer Using data_file_list: ALCF/data-lists/aurora/books.txt Using tokenizer: Llama2Tokenizer. Setting up data with ALCF/data-lists/aurora/books.txt Calling: setData() with ALCF/data-lists/aurora/books.txt -------------------- Updated environment: DATA_FILE_LIST: ALCF/data-lists/aurora/books.txt NUM_DOCS: 3 WEIGHT_SUM: 0.0072042092147565125 DFL_STEM: books DATA_CACHE_PATH: .cache/books/index-cache DATA_FLAGS: -------------------- [setData] DATA_FLAGS: [setData] TOKENIZER_FLAGS: --tokenizer-type Llama2Tokenizer --tokenizer-model /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model Requirement already satisfied: pybind11 in ./venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages (2.13.6) [notice] A new release of pip is available: 23.0.1 -> 25.0 [notice] To update, run: pip install --upgrade pip make: Nothing to be done for 'default'. /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed ++++++++++++++++++++++++++++++++++++++++++++++++++ - MPICH_DIR=/opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.2.1/mpich-4.3.0rc3-hipyfz6 - Using /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3 - WORLD_SIZE:48 - BACKEND: ccl - MODEL_TYPE: llama-gb384-seq4096-pp1-tp2-10layers-32heads-4096hidden - Using DATA_FILE_LIST: ALCF/data-lists/aurora/books.txt ++++++++++++++++++++++++++++++++++++++++++++++++++ Currently Loaded Modules: 1) gcc-runtime/12.2.0-267awrk 3) mpfr/4.2.1-fhgnwe7 5) gcc/12.2.0 7) cray-pals/1.4.0 9) oneapi/release/2024.2.1 11) frameworks/2024.2.1_u1 13) yaksa/0.3-euoqglg 2) gmp/6.2.1-yctcuid 4) mpc/1.3.1-ygprpb4 6) libfabric/1.20.1 8) cray-libpals/1.4.0 10) pti-gpu/d3639de 12) hwloc/master-git.1793e43-level-zero 14) mpich/opt/4.3.0rc3 Saving environment to checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.env Not currently running. Continuing! Launching with: MPICH mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni --pmi=pmix --genvall /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models _frameworks-2024.2.1_u1/bin/python3 -Wignore /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/pretrain_gpt_alcf.py Using data_cache_path: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache Training Arguments: --accumulate-allreduce-grads-in-fp32 --adam-beta1=0.9 --adam-beta2=0.95 --adam-eps=0.00001 --attention-dropout 0 --bf16 --blend-sample-in-corpus --clip-grad=1.0 --data-cache-path=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache --data-file-list=ALCF/data-lists/aurora/books.txt --deepspeed --deepspeed_config=/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json --disable-bias-linear --distributed-backend=ccl --ds-sequence-parallel-size=1 --eval-interval=100 --eval-iters=20 --ffn-hidden-size 11008 --global-batch-size=384 --hidden-dropout 0 --hidden-size=4096 --load=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash --log-interval=1 --log-optimizer-states-to-tensorboard --log-timers-to-tensorboard --lr 0.0002 --lr-decay-style cosine --lr-warmup-fraction 0.05 --max-position-embeddings=4096 --micro-batch-size=1 --no-bias-dropout-fusion --no-bias-gelu-fusion --no-gradient-accumulation-fusion --no-masked-softmax-fusion --no-pipeline-parallel --no-query-key-layer-scaling --normalization rmsnorm --num-attention-heads=32 --num-key-value-heads 8 --num-layers=10 --optimizer=adamw --pipeline-model-parallel-size=1 --save=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash --save-interval=50 --seq-length=4096 --shuffle-sample-in-corpus --split=990,10,0 --swiglu --tensorboard-dir checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/tensorboard --tensor-model-parallel-size=2 --timing-log-level=1 --tokenizer-type Llama2Tokenizer --tokenizer-model /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model --train-iters=1271565 --untie-embeddings-and-output-weights --use-checkpoint-opt_param-scheduler --use-flash-attn-builder --use-rotary-position-embeddings --weight-decay=0.1 --zero-stage=1 mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/1289812.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --cpu-bind depth -d 8 --no-vni --pmi=pmix --genvall /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_ frameworks-2024.2.1_u1/bin/python3 -Wignore /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/pretrain_gpt_alcf.py --use-checkpoint-opt_param-scheduler --lr 0.0002 --lr-decay-style cosine --lr-warmup-fraction 0.05 --swiglu --hidden-dropout 0 --attention -dropout 0 --normalization rmsnorm --disable-bias-linear --no-query-key-layer-scaling --use-rotary-position-embeddings --untie-embeddings-and-output-weights --num-key-value-heads 8 --ffn-hidden-size 11008 --use-flash-attn-builder --tokenizer-type Llama2Tokenizer --tokenizer-m odel /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model --log-timers-to-tensorboard --log-optimizer-states-to-tensorboard --tensorboard-dir checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_fla sh/tensorboard --deepspeed --no-pipeline-parallel --deepspeed_config=/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json --zero-stage=1 --bf16 --shuffle-sample-in-corpus --blend-sample-in-corpus --accumulate-al lreduce-grads-in-fp32 --no-bias-gelu-fusion --no-bias-dropout-fusion --no-masked-softmax-fusion --no-gradient-accumulation-fusion --optimizer=adamw --tensor-model-parallel-size=2 --pipeline-model-parallel-size=1 --max-position-embeddings=4096 --micro-batch-size=1 --ds-sequence- parallel-size=1 --global-batch-size=384 --split=990,10,0 --timing-log-level=1 --eval-interval=100 --eval-iters=20 --save-interval=50 --log-interval=1 --save=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash --load=checkpoints/ws48_d s_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash --seq-length=4096 --num-layers=10 --hidden-size=4096 --train-iters=1271565 --distributed-backend=ccl --weight-decay=0.1 --adam-beta1=0.9 --adam-beta2=0.95 --adam-eps=0.00001 --clip-grad=1.0 --num-atte ntion-heads=32 --data-cache-path=checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache --data-file-list=ALCF/data-lists/aurora/books.txt [!! NOTE] View output at: logs/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/20250127-152609_48_x4309c4s1b0n0/output.log Disabling local launch: multi-node application Connected to tcp://x4102c5s2b0n0.hostmgmt2102.cm.aurora.alcf.anl.gov:7919 Launching application 968d94e2-bb10-4ae0-9ecf-6e537424c448 [2025-01-27 15:26:19,638] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,684] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,689] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,692] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,697] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,717] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,717] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,718] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:19,719] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,456] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,483] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,487] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,488] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,491] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,492] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,492] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,493] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:25,494] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,029] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,048] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,049] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,079] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,090] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,090] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,090] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,091] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,091] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,093] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,093] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:26,093] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,170] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,197] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,218] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,218] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,220] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,221] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,221] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:27,223] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-01-27 15:26:29,360] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,360] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,360] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,363] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,363] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,363] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,365] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,365] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,365] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,371] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,372] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,372] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,374] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,374] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,374] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,376] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,377] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,377] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,379] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,379] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,379] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,381] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,381] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,382] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,384] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,384] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,384] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,386] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,387] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,387] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,389] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,390] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,390] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:29,392] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:29,393] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:29,393] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,565] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,565] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,565] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,565] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,565] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,564] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend [2025-01-27 15:26:56,564] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-27 15:26:56,564] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=12, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=28, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=14, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=2, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=16, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=18, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=20, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=22, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=4, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=24, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=6, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=26, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=30, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=8, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=32, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=10, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=34, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend ccl [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=36, local_rank=0, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=38, local_rank=2, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=40, local_rank=4, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=42, local_rank=6, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=5, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=44, local_rank=8, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,590] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=46, local_rank=10, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=7, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=1, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=3, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=9, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=11, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=17, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=25, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=21, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=31, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=23, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=33, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=13, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=15, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=27, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=37, local_rank=1, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=19, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=39, local_rank=3, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=29, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=41, local_rank=5, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=43, local_rank=7, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=35, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=45, local_rank=9, world_size=48, master_addr=10.115.11.185, master_port=29500 [2025-01-27 15:26:56,591] [INFO] [comm.py:718:mpi_discovery] Discovered MPI settings of world_rank=47, local_rank=11, world_size=48, master_addr=10.115.11.185, master_port=29500 -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- deepspeed_not_implemented [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] flash_attn ............. [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] pack_bits .............. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/opt/aurora/24.180.3/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages/torch'] torch version .................... 2.3.1+cxx11.abi deepspeed install path ........... ['/flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/venvs/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.16.3, unknown, unknown deepspeed wheel compiled w. ...... torch 2.3 shared memory (/dev/shm) size .... 503.18 GB [2025-01-27 15:26:57.010367][INFO][ezpz/configs:287] - **** Git info for DeepSpeed: git_hash=3af7eb4b git_branch=main **** 2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_KVS_MODE changed to be mpi (default:pmi) 2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_KVS_CONNECTION_TIMEOUT changed to be 3600 (default:120) 2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_BCAST changed to be double_tree (default:) 2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_ENABLE_SYCL_KERNELS changed to be 1 (default:0) 2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_SYCL_ESIMD changed to be 1 (default:0) 2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_PROCESS_LAUNCHER changed to be pmix (default:hydra) 2025:01:27-15:26:57:(206687) |CCL_WARN| value of CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD changed to be 32768 (default:1000) 2025:01:27-15:26:57:(206687) |CCL_WARN| CCL_ALLGATHERV_MEDIUM_SIZE_THRESHOLD=0 is unknown to and unused by oneCCL code but is present in the environment, check if it is not mistyped. 2025:01:27-15:26:57:(206687) |CCL_WARN| CCL_SKIP_SCHEDULER=1 is unknown to and unused by oneCCL code but is present in the environment, check if it is not mistyped. [2025-01-27 15:26:57.371214][INFO][ezpz/dist:812] - Using device='xpu' with backend='deepspeed' + 'ccl' for distributed training. [2025-01-27 15:26:57.372119][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 0/47] [2025-01-27 15:26:57.371090][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][12/47] [2025-01-27 15:26:57.371067][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][13/47] [2025-01-27 15:26:57.371099][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][36/47] [2025-01-27 15:26:57.371062][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][15/47] [2025-01-27 15:26:57.371083][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][16/47] [2025-01-27 15:26:57.371066][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][17/47] [2025-01-27 15:26:57.371065][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][19/47] [2025-01-27 15:26:57.371081][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][37/47] [2025-01-27 15:26:57.371200][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 1/47] [2025-01-27 15:26:57.371091][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][20/47] [2025-01-27 15:26:57.371113][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][38/47] [2025-01-27 15:26:57.371192][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 7/47] [2025-01-27 15:26:57.371065][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][21/47] [2025-01-27 15:26:57.371069][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][24/47] [2025-01-27 15:26:57.371080][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][39/47] [2025-01-27 15:26:57.371202][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 8/47] [2025-01-27 15:26:57.371093][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][22/47] [2025-01-27 15:26:57.371052][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][25/47] [2025-01-27 15:26:57.371094][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][40/47] [2025-01-27 15:26:57.371192][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 9/47] [2025-01-27 15:26:57.371066][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][23/47] [2025-01-27 15:26:57.371054][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][27/47] [2025-01-27 15:26:57.371082][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][41/47] [2025-01-27 15:26:57.371190][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][11/47] [2025-01-27 15:26:57.371544][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][14/47] [2025-01-27 15:26:57.371079][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][28/47] [2025-01-27 15:26:57.371124][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][44/47] [2025-01-27 15:26:57.371200][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 3/47] [2025-01-27 15:26:57.371171][INFO][ezpz/dist:854] - ['x4309c3s7b0n0'][18/47] [2025-01-27 15:26:57.371053][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][29/47] [2025-01-27 15:26:57.371089][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][45/47] [2025-01-27 15:26:57.371234][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 4/47] [2025-01-27 15:26:57.371053][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][31/47] [2025-01-27 15:26:57.371081][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][47/47] [2025-01-27 15:26:57.371205][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 5/47] [2025-01-27 15:26:57.371070][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][32/47] [2025-01-27 15:26:57.371104][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][43/47] [2025-01-27 15:26:57.371268][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 6/47] [2025-01-27 15:26:57.371074][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][33/47] [2025-01-27 15:26:57.371559][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][46/47] [2025-01-27 15:26:57.371496][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][10/47] [2025-01-27 15:26:57.371063][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][34/47] [2025-01-27 15:26:57.371482][INFO][ezpz/dist:854] - ['x4309c4s1b0n0'][42/47] [2025-01-27 15:26:57.371710][INFO][ezpz/dist:854] - ['x4102c5s2b0n0'][ 2/47] [2025-01-27 15:26:57.371053][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][35/47] [2025-01-27 15:26:57.371361][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][26/47] [2025-01-27 15:26:57.371618][INFO][ezpz/dist:854] - ['x4309c4s0b0n0'][30/47] [2025-01-27 15:26:57.375272][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:69] - Import python modules in 40.977957248687744 seconds [2025-01-27 15:26:57.375780][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:70] - ez.setup_torch time: 0.8100743293762207 seconds [2025-01-27 15:26:57.376177][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:80] - Setting up W&B from: 0 with AuroraGPT [2025-01-27 15:26:57.376572][INFO][ezpz/dist:1065] - Setting up wandb from rank: 0 [2025-01-27 15:26:57.376937][INFO][ezpz/dist:1066] - Using: WB PROJECT: AuroraGPT wandb: Currently logged in as: foremans (aurora_gpt). Use `wandb login --relogin` to force relogin wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. 2025-01-27 15:26:57.769668: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2025-01-27 15:26:57.769690: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2025-01-27 15:26:57.771206: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-27 15:26:58.541647: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2025-01-27 15:26:59.500297: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics. 2025-01-27 15:26:59.500450: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the me tric more than once, or if the name is already used by other metrics. 2025-01-27 15:26:59.501493: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics. 2025-01-27 15:26:59.501503: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics. 2025-01-27 15:26:59.717453: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded. 2025-01-27 15:26:59.747528: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero 2025-01-27 15:26:59.747878: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747882: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747884: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747886: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747887: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747889: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747890: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747892: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747893: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747895: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747897: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:26:59.747898: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. > setting tensorboard ... WARNING: WANDB writing requested but no legit wandb project or experiment name provided, therefore no WANDB logs will be written according to random generated project or experiment name. wandb: Tracking run with wandb version 0.19.4 wandb: Run data is saved locally in /lus/flare/projects/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/wandb/run-20250127_152657-knsggy9p wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run easy-valley-1380 wandb: ⭐️ View project at https://wandb.ai/aurora_gpt/AuroraGPT wandb: 🚀 View run at https://wandb.ai/aurora_gpt/AuroraGPT/runs/knsggy9 [2025-01-27 15:27:01.618844][INFO][ezpz/dist:1093] - W&B RUN: [easy-valley-1380](https://wandb.ai/aurora_gpt/AuroraGPT/runs/knsggy9p) [2025-01-27 15:27:01.629484][INFO][ezpz/dist:297] - Updating wandb.run: easy-valley-1380 config with "DIST_INFO" [2025-01-27 15:27:01.634114][INFO][ezpz/dist:1125] - Running on machine='Aurora' using world size: 48, data-parallel-size: 24, sequence-parallel size: 1, tensor-model-parallel size: 2, pipeline-model-parallel size: 1 using torch.bfloat16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. True adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.95 adam_eps ........................................ 1e-05 add_bias_linear ................................. False add_position_embedding .......................... False adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 aml_data_download_path .......................... None apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... False apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... False attention_dropout ............................... 0.0 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ True bias_dropout_fusion ............................. False bias_gelu_fusion ................................ False biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False blend_sample_in_corpus .......................... True block_data_path ................................. None checkpoint_activations .......................... False checkpoint_in_cpu ............................... False checkpoint_num_layers ........................... 1 classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 compression_training ............................ False consumed_train_samples .......................... 0 consumed_train_tokens ........................... 0 consumed_valid_samples .......................... 0 contigious_checkpointing ........................ False cpu_optimizer ................................... False cpu_torch_adam .................................. False create_moe_param_group .......................... False curriculum_learning_legacy ...................... False data_cache_path ................................. checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache data_efficiency_curriculum_learning ............. False data_file_list .................................. ALCF/data-lists/aurora/books.txt data_impl ....................................... infer data_parallel_random_init ....................... False data_parallel_size .............................. 24 data_path ....................................... None data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single DDP_impl ........................................ local decoder_num_layers .............................. None decoder_seq_length .............................. None deepscale ....................................... False deepscale_config ................................ None deepspeed ....................................... True deepspeed_activation_checkpointing .............. False deepspeed_config ................................ /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ds-configs/ds_stage1_mb1_gb384_pp1_bf16.json dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_checkpointed_activations ............. False distribute_saved_activations .................... False distributed_backend ............................. ccl distributed_timeout_minutes ..................... 10 ds_fused_adam ................................... False ds_inference .................................... False ds_pipeline_enabled ............................. False ds_sequence_parallel_size ....................... 1 embedding_path .................................. None embedding_weights_in_fp32 ....................... False empty_unused_memory_level ....................... 0 enable_expert_tensor_parallelism ................ False enable_zbh1_exact_semantics ..................... False enable_zbh1_pipeline ............................ False encoder_num_layers .............................. 10 encoder_seq_length .............................. 4096 end_weight_decay ................................ 0.1 eod_mask_loss ................................... False eval_interval ................................... 100 eval_iters ...................................... 20 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False expert_interval ................................. 2 ffn_hidden_size ................................. 11008 finetune ........................................ False force_ds_sequence_parallel ...................... False fp16 ............................................ False fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_e4m3 ........................................ False fp8_hybrid ...................................... False fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 384 gradient_accumulation_fusion .................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.0 hidden_size ..................................... 4096 hidden_size_teacher ............................. None hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference ....................................... False inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kd .............................................. False kd_alpha_ce ..................................... 1 kd_beta_ce ...................................... 1 kd_temp ......................................... 1.0 kill_switch_file ................................ None kv_channels ..................................... 128 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash load_tag ........................................ None load_teacher .................................... None local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_optimizer_states_to_tensorboard ............. True log_params_norm ................................. False log_timers_to_tensorboard ....................... True log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0002 lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_decay_tokens ................................. None lr_warmup_fraction .............................. 0.05 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 lr_warmup_tokens ................................ None make_vocab_size_divisible_by .................... 128 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... False max_position_embeddings ......................... 4096 max_tokens_to_oom ............................... 12000 mem_efficient_ln ................................ True memory_centric_tiled_linear ..................... False merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 mlp_type ........................................ standard mmap_warmup ..................................... False moe_eval_capacity_factor ........................ 1.0 moe_expert_parallel_size ........................ 1 moe_loss_coeff .................................. 0.1 moe_min_capacity ................................ 4 moe_token_dropping .............................. True moe_top2_2nd_expert_sampling .................... True moe_train_capacity_factor ....................... 1.0 mos ............................................. False multiprocessing_context ......................... fork no_load_lr_state ................................ False no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... False no_pipeline_parallel ............................ True no_save_optim ................................... None no_save_rng ..................................... None normalization ................................... rmsnorm num_attention_heads ............................. 32 num_attention_heads_teacher ..................... None num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... [1] num_experts_switch .............................. None num_experts_teacher ............................. [1] num_key_value_heads ............................. 8 num_layers ...................................... 10 num_layers_per_virtual_pipeline_stage ........... None num_layers_teacher .............................. None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adamw output_bert_embeddings .......................... False overlap_p2p_comm ................................ False override_opt_param_scheduler .................... False params_dtype .................................... torch.bfloat16 partition_activations ........................... False patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None profile ......................................... None profile_backward ................................ False profile_ranks ................................... None profile_steps ................................... 2,3 query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None random_ltd ...................................... False rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ 1 remote_device ................................... none repeated_dataloader ............................. False reset_attention_mask ............................ False reset_iteration ................................. False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_workdir ................................... None return_data_index ............................... False rope_theta ...................................... 10000 rotary_percent .................................. 1.0 sample_rate ..................................... 1.0 save ............................................ checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash save_interval ................................... 50 scatter_gather_tensors_in_pipeline .............. True scattered_embeddings ............................ False schedulefree_for_each ........................... False seed ............................................ 1234 seq_length ...................................... 4096 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 shuffle_sample_in_corpus ........................ True skip_train ...................................... False sophiag_beta1 ................................... 0.9 sophiag_beta2 ................................... 0.95 sophiag_rho ..................................... 0.01 split ........................................... 990,10,0 split_transformers .............................. False squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.1 swiglu .......................................... True swin_backbone_type .............................. tiny synchronize_each_layer .......................... False tensor_model_parallel_size ...................... 2 tensorboard_dir ................................. checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/tensorboard tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None tile_factor ..................................... 1 timing_log_level ................................ 1 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. /flare/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/ALCF/tokenizer.model tokenizer_type .................................. Llama2Tokenizer topk ............................................ 1 trace_dir ....................................... ./trace/ train_data_exact_num_epochs ..................... None train_data_path ................................. None train_desc_path ................................. None train_doc_idx_path .............................. None train_idx_path .................................. None train_iters ..................................... 1271565 train_iters_to_skip ............................. None train_range_to_skip ............................. None train_sample_idx_path ........................... None train_samples ................................... None train_shuffle_idx_path .......................... None train_tokens .................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 trust_remote_code ............................... False universal_checkpoint ............................ False untie_embeddings_and_output_weights ............. True use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. True use_contiguous_buffers_in_local_ddp ............. True use_cpu_initialization .......................... None use_dataset_only ................................ False use_distributed_optimizer ....................... False use_flash_attn .................................. True use_flash_attn_builder .......................... True use_flash_attn_triton ........................... False use_flash_attn_v1 ............................... False use_flash_attn_v2 ............................... False use_mics ........................................ False use_one_sent_docs ............................... False use_pin_memory .................................. False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. True use_tutel ....................................... False valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None vocab_size ...................................... None wandb_exp_name .................................. wandb_project ................................... wandb_save_dir .................................. weight_decay .................................... 0.1 weight_decay_incr_style ......................... constant world_size ...................................... 48 zero_allgather_bucket_size ...................... 0.0 zero_contigious_gradients ....................... False zero_reduce_bucket_size ......................... 0.0 zero_reduce_scatter ............................. False zero_stage ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 16 > building Llama2Tokenizer tokenizer ... > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000) torch distributed is already initialized, skipping initialization ... > initialized tensor model parallel with size 2 > initialized pipeline model parallel with size 1 > setting random seeds to 1234 ... > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 make: Entering directory '/lus/flare/projects/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/lus/flare/projects/Aurora_deployment/foremans/tmp/2025-01-27-152131/Megatron-DeepSpeed/megatron/data' > compiling dataset index builder ... >>> done with dataset index builder. Compilation time: 0.181 seconds >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:01.870188][INFO][megatron/training:185] - time to finish initialize_megatron: 5.465544700622559 seconds [2025-01-27 15:27:11.352855][INFO][megatron/training:193] - allreduce call time: 9.482640743255615 seconds [2025-01-27 15:27:11.484061][INFO][megatron/training:195] - time to initialize megatron (seconds)=42.589 [2025-01-27 15:27:11.485145][INFO][megatron/training:96] - [after megatron is initialized] datetime=2025-01-27 15:27:11 [2025-01-27 15:27:11.501247][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:87] - building GPT model ... [2025-01-27 15:27:11,671] [INFO] [utils.py:781:see_memory_usage] Before Building Model [2025-01-27 15:27:11,672] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2025-01-27 15:27:11,672] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 38.39 GB, percent = 3.4% >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,812] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,812] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,818] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,819] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,819] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,820] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,820] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,823] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,823] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel > number of parameters on (tensor, pipeline) model parallel rank (1, 0)=1017204736 [2025-01-27 15:27:11,824] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,823] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,824] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,829] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,829] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,832] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,833] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 [2025-01-27 15:27:11.834706][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:147] - -------------------------------------------------------------------------------- [2025-01-27 15:27:11.835369][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:148] - Number of parameters in model: 1017204736 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 [2025-01-27 15:27:11.835818][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:149] - -------------------------------------------------------------------------------- >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,835] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,836] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,837] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,838] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,839] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,840] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,840] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,842] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,842] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,844] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,845] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,846] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,848] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,849] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,850] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,850] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,851] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,851] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,852] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,852] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,852] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,853] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,854] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 >fused kernel is only supported in cuda, skip loading fused kernel [2025-01-27 15:27:11,866] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 [2025-01-27 15:27:11,963] [INFO] [utils.py:781:see_memory_usage] After Building Model [2025-01-27 15:27:11,970] [INFO] [utils.py:782:see_memory_usage] MA 1.91 GB Max_MA 1.91 GB CA 1.91 GB Max_CA 2 GB [2025-01-27 15:27:11,971] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 39.25 GB, percent = 3.5% [2025-01-27 15:27:11.972857][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:157] - Patching tensorboard from checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/tensorboard 2025-01-27 15:27:12.255633: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2025-01-27 15:27:12.255663: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2025-01-27 15:27:12.256979: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-27 15:27:12.817632: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2025-01-27 15:27:14.237006: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics. 2025-01-27 15:27:14.237225: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the me tric more than once, or if the name is already used by other metrics. 2025-01-27 15:27:14.238601: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics. 2025-01-27 15:27:14.238612: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics. 2025-01-27 15:27:14.501613: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded. 2025-01-27 15:27:14.532454: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero 2025-01-27 15:27:14.532802: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532804: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532806: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532808: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532809: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532811: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532812: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532814: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532815: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532817: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532819: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. 2025-01-27 15:27:14.532820: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device. [2025-01-27 15:27:14.849463][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:164] - Updating WandB run.config: [easy-valley-1380](https://wandb.ai/aurora_gpt/AuroraGPT/runs/knsggy9p) [2025-01-27 15:27:14.852175][INFO][ezpz/dist:123] - `model_provider`, {'pre_process': True, 'post_process': True}) took: dt=3.3509s > number of parameters on (tensor, pipeline) model parallel rank (0, 0)=1017204736 [2025-01-27 15:27:14.853564][INFO][ezpz/dist:123] - `get_model`((<function model_provider at 0x1469a39ea8c0>, <ModelType.encoder_or_decoder: 1>)) took: dt=3.3524s [2025-01-27 15:27:14.854867][INFO][megatron/utils:368] - > learning rate decay style: cosine [2025-01-27 15:27:14.855387][INFO][ezpz/dist:123] - `get_optimizer_param_scheduler`((AdamW ( Parameter Group 0 amsgrad: False betas: (0.9, 0.95) capturable: False differentiable: False eps: 1e-05 foreach: None fused: None lr: 0.0 lr_mult: 1.0 maximize: False name: wd_no_scale_lr wd_mult: 1.0 weight_decay: 0.1 Parameter Group 1 amsgrad: False betas: (0.9, 0.95) capturable: False differentiable: False eps: 1e-05 foreach: None fused: None lr: 0.0 lr_mult: 1.0 maximize: False name: no_wd_no_scale_lr wd_mult: 0.0 weight_decay: 0.0 ),)) took: dt=0.0005s [2025-01-27 15:27:14.857339][INFO][megatron/training:692] - DeepSpeed is enabled. [2025-01-27 15:27:14.857770][INFO][megatron/training:747] - Did NOT catch: ('args.data_efficiency_curriculum_learning' and 'build_train_valid_test_datasets_provider is not None') [2025-01-27 15:27:14.858278][INFO][megatron/training:756] - Calling 'deepspeed.initialize'... [2025-01-27 15:27:14.858687][INFO][megatron/training:757] - Wrapped with: profiler=<megatron.utils.Profile object at 0x1469a39d4250> [2025-01-27 15:27:14,859] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown [2025-01-27 15:27:14,859] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 24 [2025-01-27 15:27:16,862] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: True [2025-01-27 15:27:16,863] [INFO] [logging.py:128:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2025-01-27 15:27:16,863] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-01-27 15:27:16,864] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW [2025-01-27 15:27:16,864] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'> [2025-01-27 15:27:16,864] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000 [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000 [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-01-27 15:27:16,864] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2025-01-27 15:27:17,770] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-01-27 15:27:17,770] [INFO] [utils.py:782:see_memory_usage] MA 2.05 GB Max_MA 2.05 GB CA 2.06 GB Max_CA 2 GB [2025-01-27 15:27:17,771] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 53.14 GB, percent = 4.7% [2025-01-27 15:27:17,952] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-01-27 15:27:17,952] [INFO] [utils.py:782:see_memory_usage] MA 2.05 GB Max_MA 2.21 GB CA 2.21 GB Max_CA 2 GB [2025-01-27 15:27:17,953] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 46.81 GB, percent = 4.1% [2025-01-27 15:27:17,953] [INFO] [stage_1_and_2.py:545:__init__] optimizer state initialized [2025-01-27 15:27:18,126] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-01-27 15:27:18,127] [INFO] [utils.py:782:see_memory_usage] MA 2.05 GB Max_MA 2.05 GB CA 2.21 GB Max_CA 2 GB [2025-01-27 15:27:18,127] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 46.81 GB, percent = 4.1% [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.optimizer_param_scheduler.OptimizerParamScheduler object at 0x1469a39d68c0> [2025-01-27 15:27:18,128] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)] [2025-01-27 15:27:18,129] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print] amp_params ................... False [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-01-27 15:27:18,129] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x14684ffb3df0> [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symm etric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset' : 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_group s': {}}, 'layer_reduction': {'enabled': False}} [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'ra ndom_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] dump_state ................... False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-01-27 15:27:18,130] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": true, "recompute_fwd_factor": 0.0, "profile_step": 2, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 16 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-01-27 15:27:18,131] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=No ne, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] optimizer_name ............... None [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] optimizer_params ............. None [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] pld_params ................... False [2025-01-27 15:27:18,132] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] steps_per_print .............. 1 [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] train_batch_size ............. 384 [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1 [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] world_size ................... 24 [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] zero_allow_untested_optimizer True [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_com m=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persis tence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters =True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient _linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-01-27 15:27:18,133] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-01-27 15:27:18,134] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. False [2025-01-27 15:27:18,134] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-01-27 15:27:18,134] [INFO] [config.py:989:print_user_config] json = { "train_batch_size": 384, "train_micro_batch_size_per_gpu": 1, "gradient_clipping": 1.0, "steps_per_print": 1, "gradient_accumulation_steps": 16, "zero_force_ds_cpu_optimizer": false, "zero_allow_untested_optimizer": true, "wall_clock_breakdown": false, "zero_optimization": { "stage": 1 }, "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bfloat16": { "enabled": true, "loss_scale": 1.0 }, "comms_logger": { "enabled": false, "verbose": false, "debug": false }, "flops_profiler": { "enabled": true, "profile_step": 2, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } } [2025-01-27 15:27:18.134311][INFO][megatron/training:767] - 'deepspeed.initialize' took: 3.27604s [2025-01-27 15:27:18.138694][INFO][megatron/checkpointing:568] - Unable to load lr_state_dict from lr_state_dict_fp=PosixPath('checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/lr_state_dict_0_of_48.yaml'), but strict=False. Returni ng empty dictionary: lr_state_dict={} [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18.141424][INFO][megatron/utils:368] - WARNING: could not find the metadata file checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash [2025-01-27 15:27:18.142199][INFO][megatron/utils:368] - will not load any checkpoints and will start from random [2025-01-27 15:27:18,141] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,142] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,143] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,145] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,144] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,145] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,146] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,147] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,148] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,148] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,149] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [2025-01-27 15:27:18,150] [WARNING] [engine.py:2841:load_checkpoint] Unable to find latest file at checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. (min, max) time across ranks (ms): load-checkpoint ................................: (14.95, 15.08) [2025-01-27 15:27:27.718035][INFO][ezpz/dist:123] - `setup_model_and_optimizer`((<function model_provider at 0x1469a39ea8c0>, <ModelType.encoder_or_decoder: 1>), {'teacher': False, 'data_post_process': <function data_post_process at 0x1469a39eab90>, 'build_train_valid_test_data sets_provider': <function train_valid_test_datasets_provider at 0x1469a39eb7f0>}) took: dt=16.2168s [2025-01-27 15:27:27.725286][INFO][megatron/training:96] - [after model, optimizer, and learning rate scheduler are built] datetime=2025-01-27 15:27:27 [2025-01-27 15:27:27.726306][INFO][megatron/training:1510] - > building train, validation, and test datasets ... [2025-01-27 15:27:27.726859][INFO][megatron/training:1493] - > datasets target sizes (minimum size): [2025-01-27 15:27:27.727356][INFO][megatron/training:1494] - train: 488280960 [2025-01-27 15:27:27.727827][INFO][megatron/training:1495] - validation: 97658880 [2025-01-27 15:27:27.728241][INFO][megatron/training:1496] - test: 7680 [2025-01-27 15:27:27.728652][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:465] - > building train, validation, and test datasets for GPT ... [2025-01-27 15:27:27.729098][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:468] - Reading datasets from ALCF/data-lists/aurora/books.txt [2025-01-27 15:27:27.792129][WARNING][utils/_logger.megatron.data.gpt_dataset:68] - > WARNING: could not find index map files, building on rank 0 using: number of documents: 24114 number of epochs: 353 sequence length: 4096 total number of samples: 211911109 > building indices for blendable datasets ... > sample ratios: dataset 0, input: 0.430653, achieved: 0.430653 dataset 1, input: 0.430584, achieved: 0.430584 dataset 2, input: 0.138763, achieved: 0.138763 [2025-01-27 15:28:02.702761][INFO][data/gpt_dataset.megatron.data.gpt_dataset:191] - [BuildConcatDataset] Caught args.shuffle_sample_in_corpus=True across 490722366 samples [2025-01-27 15:28:02.707002][WARNING][utils/_logger.megatron.data.gpt_dataset:68] - > WARNING: could not find index map files, building on rank 0 using: number of documents: 244 number of epochs: 6889 sequence length: 4096 total number of samples: 42269595 > building indices for blendable datasets ... > sample ratios: dataset 0, input: 0.430653, achieved: 0.430653 dataset 1, input: 0.430584, achieved: 0.430584 dataset 2, input: 0.138763, achieved: 0.138763 [2025-01-27 15:28:09.319430][INFO][data/gpt_dataset.megatron.data.gpt_dataset:191] - [BuildConcatDataset] Caught args.shuffle_sample_in_corpus=True across 98147176 samples > WARNING: could not find index map files for blendable dataset, building indices on rank 0 ... > building indices for blendable datasets ... > sample ratios: dataset 0, input: 1, achieved: 1 [2025-01-27 15:28:12.134453][INFO][data/blendable_dataset.megatron.data.blendable_dataset:52] - > elapsed time for building blendable dataset indices: 2.80 (sec) [2025-01-27 15:28:17.021709][INFO][data/blendable_dataset.megatron.data.blendable_dataset:87] - > finished saving index map files in 4.886175542000274 seconds [2025-01-27 15:28:17.023655][INFO][data/blendable_dataset.megatron.data.blendable_dataset:112] - > loading blendable dataset index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/82b02ab7f8cd8f2eb97205bd2 481c4df_index.npy [2025-01-27 15:28:17.050282][INFO][data/blendable_dataset.megatron.data.blendable_dataset:115] - > loading blendable dataset sample index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/82b02ab7f8cd8f2eb9 7205bd2481c4df_sample_index.npy [2025-01-27 15:28:17.053629][INFO][data/blendable_dataset.megatron.data.blendable_dataset:118] - > finished loading in 0.02997312500156113 seconds [2025-01-27 15:28:17.099847][INFO][data/blendable_dataset.megatron.data.blendable_dataset:130] - > size of blendable dataset: 490722366 samples > WARNING: could not find index map files for blendable dataset, building indices on rank 0 ... > building indices for blendable datasets ... > sample ratios: dataset 0, input: 1, achieved: 1 [2025-01-27 15:28:17.646764][INFO][data/blendable_dataset.megatron.data.blendable_dataset:52] - > elapsed time for building blendable dataset indices: 0.52 (sec) [2025-01-27 15:28:18.721111][INFO][data/blendable_dataset.megatron.data.blendable_dataset:87] - > finished saving index map files in 1.0732649540004786 seconds [2025-01-27 15:28:18.723038][INFO][data/blendable_dataset.megatron.data.blendable_dataset:112] - > loading blendable dataset index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/3be1e743e088817f147da28b5 384ce8d_index.npy [2025-01-27 15:28:18.738154][INFO][data/blendable_dataset.megatron.data.blendable_dataset:115] - > loading blendable dataset sample index: checkpoints/ws48_ds_stage1_nl10_hs4096_mb1_seq4096_gb384_sp1_pp1_tp2_bf16_optadamw_lr_lwf_flash/.cache/books/index-cache/3be1e743e088817f14 7da28b5384ce8d_sample_index.npy [2025-01-27 15:28:18.741530][INFO][data/blendable_dataset.megatron.data.blendable_dataset:118] - > finished loading in 0.018488385001546703 seconds [2025-01-27 15:28:18.768816][INFO][data/blendable_dataset.megatron.data.blendable_dataset:130] - > size of blendable dataset: 98147176 samples [2025-01-27 15:28:18.772962][INFO][Megatron-DeepSpeed/pretrain_gpt_alcf.__main__:515] - > finished creating GPT datasets. Took: 17090845417855.19922s [2025-01-27 15:28:18.773597][INFO][ezpz/dist:123] - `train_valid_test_datasets_provider`(([488280960, 97658880, 7680],)) took: dt=51.0449s [2025-01-27 15:28:18.774290][INFO][ezpz/dist:123] - `build_train_valid_test_datasets`((<function train_valid_test_datasets_provider at 0x1469a39eb7f0>,)) took: dt=51.0474s [2025-01-27 15:28:18.943567][INFO][ezpz/dist:123] - `build_train_valid_test_data_loaders`((<function train_valid_test_datasets_provider at 0x1469a39eb7f0>,)) took: dt=51.2172s [2025-01-27 15:28:21.477616][INFO][ezpz/dist:123] - `build_train_valid_test_data_iterators`((<function train_valid_test_datasets_provider at 0x1469a39eb7f0>,)) took: dt=53.7512s [2025-01-27 15:28:23.271739][INFO][megatron/training:96] - [after dataloaders are built] datetime=2025-01-27 15:28:23 [2025-01-27 15:28:23.272761][INFO][megatron/training:287] - done with setup ... (min, max) time across ranks (ms): model-and-optimizer-setup ......................: (16163.47, 16223.38) train/valid/test-data-iterators-setup ..........: (51051.70, 55543.92) [2025-01-27 15:28:23.278723][INFO][megatron/training:293] - training ... [2025-01-27 15:28:23.297907][INFO][megatron/training:96] - [before the start of training step] datetime=2025-01-27 15:28:23 [2025-01-27 15:28:33,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 547.48 | optimizer_gradients: 16.65 | optimizer_step: 57.34 [2025-01-27 15:28:33,108] [INFO] [logging.py:128:log_dist] [Rank 0] step=1, skipped=0, lr=[3.1457298683118837e-09, 3.1457298683118837e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2025-01-27 15:28:33,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2842.71 | bwd_microstep: 4363.09 | bwd_inner_microstep: 3903.83 | bwd_allreduce_microstep: 459.02 | step_microstep: 2259.08 [2025-01-27 15:28:33,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2842.72 | bwd: 4363.08 | bwd_inner: 3903.88 | bwd_allreduce: 459.02 | step: 2259.08 [2025-01-27 15:28:33.129475][INFO][megatron/training_log:661] - iteration= 1/ 1271565 | consumed_samples= 384 | consumed_tokens= 1572864 | elapsed_time_per_iteration_ms=9845.0 | learning_rate=3.14573e-09 | global_batch_size= 384 | lm loss=11.208542 | loss_sc ale=1.0 | grad_norm=16.175 | actual_seqlen= 4096 | number_of_skipped_iterations= 0 | number_of_nan_iterations= 0 | samples_per_second=39.005 | tokens_per_gpu_per_second_tgs=3328.401 | [LM]TFLOPs=44.71 | [DS]TFLOPs=43.03 | [2025-01-27 15:28:33.131878][INFO][megatron/utils:249] - [Rank 0] (after 1 iterations) memory (MB) | allocated: 2427.544921875 | max allocated: 9358.35107421875 | reserved: 10778.0 | max reserved: 10778.0 (min, max) time across ranks (ms): forward-backward ...............................: (7524.37, 7549.68) optimizer ......................................: (2256.28, 2259.51) [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 146.39 | optimizer_gradients: 0.43 | optimizer_step: 1.08 [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] step=2, skipped=0, lr=[6.291459736623767e-09, 6.291459736623767e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 1417.73 | bwd_microstep: 3879.56 | bwd_inner_microstep: 3510.52 | bwd_allreduce_microstep: 368.84 | step_microstep: 153.12 [2025-01-27 15:28:38,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1417.75 | bwd: 3879.56 | bwd_inner: 3510.57 | bwd_allreduce: 368.84 | step: 153.12