LLM for PyTorch

This directory provides scripts to train the GPT-based models in the Megatron-LM repository on Intel® Gaudi® 2 & Gaudi® 3 AI accelerators. Before you get started, make sure to review the Supported Configurations.

Megatron Overview

This implementation is based on https://github.com/NVIDIA/Megatron-LM at core_r0.8.0.

This repository comprises two essential components: Megatron-LM and Megatron-Core. Megatron-LM serves as a research-oriented framework leveraging Megatron-Core for large language model (LLM) training. Megatron-Core, on the other hand, is a library of optimized training techniques including versioned APIs and regular releases. Alternatively, you can integrate Megatron-Core's building blocks into your preferred training framework.

Megatron-LM

First introduced in 2019, Megatron (1, 2, and 3) sparked a wave of innovation in the AI community, enabling researchers and developers to utilize the underpinnings of this library to further LLM advancements.

Megatron-Core

Megatron-Core is an open-source PyTorch-based library that contains optimized techniques and cutting-edge system-level optimizations. It abstracts them into composable and modular APIs, allowing full flexibility for developers and model researchers to train custom transformers at-scale on accelerated computing infrastructure.

Megatron-Core offers core building blocks such as attention mechanisms, transformer blocks and layers, normalization layers, and embedding techniques. Additional functionality like activation recomputation, distributed checkpointing is also natively built-in to the library. The building blocks and functionality are all optimized, and can be built with advanced parallelization strategies for optimal training speed and stability on Accelerated Computing Infrastructure. Another key component of the Megatron-Core library includes advanced model parallelism techniques (tensor, sequence, pipeline, context, and MoE expert parallelism).

How to Use

Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

Third-Party Models
- In the course of using Megatron-LM, users may choose to download models created and distributed by third parties after reviewing background information about the models and agreeing to the license governing those models.
- Notice: Intel does not create the content and does not warrant its accuracy or quality. By accessing the third-party content, or using materials trained on or with such content, you are indicating your acceptance of the terms associated with that content and warranting that your use complies with the applicable license.
- Intel expressly disclaims the accuracy, adequacy, or completeness of any such third-party content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. You agree Intel is not liable for any liability or damages relating to your use of third-party content.
- Intel’s identification of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and you agree that no additional obligations, indemnifications, or liabilities arise from Intel identifying such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.

Setup

Please follow the instructions provided in the Intel Gaudi Installation Guide to set up the environment including the $PYTHON environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform guide. The guides will walk you through the process of setting up your system to run the model on Gaudi 2 and Gaudi 3.

Prerequisites

When creating Docker container, set the shared memory size as 10 GB through the Docker run command:
```
--shm-size=10g
```

Clone Intel Gaudi Megatron-LM

In the Docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version. You can run the hl-smi utility to determine the Intel Gaudi software version.

git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Megatron-LM

Set the required environment variables as shown below:

export MEGATRON_LM_ROOT=/path/to/Megatron-LM
export PYTHONPATH=$MEGATRON_LM_ROOT:$PYTHONPATH

Install Megatron-LM Requirements

In the Docker container, go to the Megatron-LM directory:
```
cd $MEGATRON_LM_ROOT
```

Install the required packages using pip:

pip install -r megatron/core/requirements.txt

To run training on more than 128 cards, apply the below configuration changes:

echo '*    soft nofile  unlimited' >> /etc/security/limits.conf
echo '*    hard nofile  unlimited' >> /etc/security/limits.conf
echo 'root soft nofile  unlimited' >> /etc/security/limits.conf
echo 'root hard nofile  unlimited' >> /etc/security/limits.conf

Supported Configurations

Model	Mode	Intel Gaudi software Version	PyTorch Version	Validated on Gaudi 2	Validated on Gaudi 3
LLaMA 3.1	Pretraining	1.19.0	2.5.1	✔️	✔️*
Mixtral 8x7B	Pretraining	1.19.0	2.5.1	✔️**

*Sporadic numerical instability can occur when training with fp8 precision.

**Only BF16 configurations are currently enabled.

Changelog

1.19.0

Added support for Gaudi 3.
Added LLaMA 3.1 support and set as default.
Added Megatron-LM to Hugging Face LLaMA checkpoint conversion support. Usage example is available here.
Added Hugging Face to Megatron-LM LLaMA checkpoint conversion support. Usage example is available here.
Added Mixtral 8x7b BF16 support (preview version) here.

1.18.0

Initial release.

Script Modifications

Major changes done to the original code from NVIDIA/Megatron-LM repository:

Changed README file content.
Added HPU support.
Added local RMSNorm support.
Added support for HPU fused ops.
Added checkpoint verification.
Added kill-switch mechanism to gracefully stop training.

Known Issues

Only recipes mentioned in this README are supported and verified.

Name		Name	Last commit message	Last commit date
Latest commit History 4,564 Commits
.github		.github
docs		docs
examples		examples
images		images
megatron		megatron
tasks		tasks
tests		tests
tools		tools
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.ci		Dockerfile.ci
Dockerfile.linting		Dockerfile.linting
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
jet-tests.yml		jet-tests.yml
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_ict.py		pretrain_ict.py
pretrain_mamba.py		pretrain_mamba.py
pretrain_retro.py		pretrain_retro.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
pretrain_vlm.py		pretrain_vlm.py
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM for PyTorch

Table of Contents

Megatron Overview

Megatron-LM

Megatron-Core

How to Use

Setup

Prerequisites

Clone Intel Gaudi Megatron-LM

Install Megatron-LM Requirements

Supported Configurations

Changelog

1.19.0

1.18.0

Script Modifications

Known Issues

About

Releases

Packages

Languages

License

HabanaAI/Megatron-LM

Folders and files

Latest commit

History

Repository files navigation

LLM for PyTorch

Table of Contents

Megatron Overview

Megatron-LM

Megatron-Core

How to Use

Setup

Prerequisites

Clone Intel Gaudi Megatron-LM

Install Megatron-LM Requirements

Supported Configurations

Changelog

1.19.0

1.18.0

Script Modifications

Known Issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages