Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

平台支持多节点并行训练Megatron-LM或者Deepspeed吗? #8

Open
Lzl20092009 opened this issue Dec 21, 2023 · 0 comments
Open

Comments

@Lzl20092009
Copy link

Lzl20092009 commented Dec 21, 2023

你好,请问现在平台支持在本地集群中运行Megatron-LM、Deepspeed等大的训练框架吗?
我们在配置中遇到2个问题
1.Megatron多节点启动bash脚本(每个节点bash有部分参数不一样,如NODE_RANK)。如何可以让分配相同任务节点使用不同配置文件?
2.多节点bash脚本有一个需要配置主Master IP,分配任务节点是由调度器分配的并不知道后续哪一个真正工作节点,这个配置要怎么支持
有没有多节点结合Megatron-LM的实现例子,提供参考一下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant