Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is a good local workflow #1794

Closed
msaroufim opened this issue Aug 18, 2023 · 13 comments
Closed

What is a good local workflow #1794

msaroufim opened this issue Aug 18, 2023 · 13 comments
Labels
competition Support for the NeurIPS Large Language Model Efficiency Challenge documentation Improvements or additions to documentation user question

Comments

@msaroufim
Copy link
Collaborator

msaroufim commented Aug 18, 2023

I have a few folks I'm working with that are working on some non-accuracy preserving ML optimization techniques so their workflow will look like: make some model update, check out accuracy, make another model update, check accuracy again etc..

The only 2 ways I see of them using HELM are either

  1. The new HTTP client: Although people found that is too high overhead for local development and for some remote HPC clusters you won't have the option of opening up a server
  2. HF hub route: Assuming you can fit your model into transformers you then need to upload it to a remote store every time you want to do an eval

So I'm curious what folks think would be the simplest and lowest overhead approach to run HELM frequently locally while making as few model changes as possible from a PyTorch nn.Module

@msaroufim msaroufim added documentation Improvements or additions to documentation competition Support for the NeurIPS Large Language Model Efficiency Challenge labels Aug 18, 2023
@yifanmai
Copy link
Collaborator

Some options here:

@msaroufim
Copy link
Collaborator Author

So the audience I have in mind is Pytorch devs who may or may not be using some underlying framework. In particular I was wondering for the generate() path do they only need to implement a custom Client in which case can we have an opinionated raw PyTorch client? Do people also need to implement the service, window, schema etc..?

@yifanmai
Copy link
Collaborator

yifanmai commented Aug 21, 2023

can we have an opinionated raw PyTorch client?

I like the idea of having a opinionated raw PyTorch client. Would you have an example of what this would look like? I imagine one way to do this is for the client to read in configuration on the nn.Module name (which it could auto-import) and various parameters for the module and path to a tokenizer, and the module would have to conform to a particular spec.

I'm very close to having model configuration files working, but in the meantime, I think the client can read from environment variables for this configuration.

Do people also need to implement the service, window, schema etc..?

In general, I'm very close (about a couple weeks away) to getting configuration files for the models and tokenizers working end-to-end. After that's done, users won't need to modify the HELM code; they can just specify the configuration file, and (1) the schema will be auto-generated, (2) the default window service will be used with (3) the tokenizer they specify.

This does assume that users are using a "standard-ish" tokenizer like the Hugging Face ones.

@HDCharles
Copy link

I'm one of the users @msaroufim is talking about, specifically I do quantized inference. The flow is generally 1)load a model from a checkpoint, 2) perform module swaps on the model to place quantized models where necessary, 3) run model over a small piece of the dataset to calibrate (usually test on test and calibration on validation or something), 4) apply final transformations 5) then run eval.

what would be the easiest way to do that? We're constantly tweaking the 2nd and 4th steps so it sounds like editing the generate function would be the easiest but i'm not sure if you have access to the necessary data at that point

@msaroufim
Copy link
Collaborator Author

Thanks @yifanmai if you have a WIP branch that you'd like us to try out and give feedback for please let us know

@msaroufim
Copy link
Collaborator Author

msaroufim commented Sep 23, 2023

Hi @yifanmai any update on this? Really eager to try something out, with efficient helm we can start running evals in pytorch CI at a reasonable cost

@yifanmai
Copy link
Collaborator

Hi @msaroufim, sorry for the delay. I opened a PR #1861 that should improve the user workflow. Would you have some time to try it out in the next couple of days?

@msaroufim
Copy link
Collaborator Author

Oh interesting I was under the impression that's a PR specific to the NeurIPS competition - will review asap

Just to be clear there are 2 different scenarios I'm interested in for HELM

  1. The NeurIPS competition
  2. PyTorch CI for quantization/sparsity work

@yifanmai
Copy link
Collaborator

Sorry, I think I put this under the wrong issue. #1861 is more relevant to the NeurIPS competition. I need to think more about the PyTorch CI use case.

@HDCharles and @msaroufim I was also curious if there are any code in a git branch or gist or Python notebook that demonstrates doing this quantization pipeline outside of HELM? If you could share this with me, that would help me understand the intended workflow better.

@yifanmai
Copy link
Collaborator

Also wondering: is this doing the GPTQ algorithm or something else? I'm planning to add quantization to the HuggingFaceClient soon (HF quantization API doc), so maybe the Pytorch integration would be similar.

@msaroufim
Copy link
Collaborator Author

msaroufim commented Sep 28, 2023

I'm gonna be out tmrw for a conference let me write a more comprehensive repro and ask and get back to you in the next couple of days

@yifanmai
Copy link
Collaborator

Sounds good, no rush. Enjoy the conference!

@yifanmai
Copy link
Collaborator

yifanmai commented Aug 6, 2024

Closing due to staleness, but feel free to reopen if there are further questions.

We now support quantization: see #1912 for details.

HELM has changed quite substantially since this issue was open. Currently the recommended routes for running a local models are:

  1. Running local inference from a Hugging Face checkpoint on disk, or
  2. Running remote inference using a vLLM server

See #2463 for an explanation of how these methods work. Both methods only require modifying a configuration file, and do not require adding any Python code.

@yifanmai yifanmai closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
competition Support for the NeurIPS Large Language Model Efficiency Challenge documentation Improvements or additions to documentation user question
Projects
None yet
Development

No branches or pull requests

3 participants