Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Constrained environments #343

Open
hms opened this issue Sep 12, 2024 · 25 comments
Open

Memory Constrained environments #343

hms opened this issue Sep 12, 2024 · 25 comments

Comments

@hms
Copy link
Contributor

hms commented Sep 12, 2024

@rosa

At the risk of your crafting a Voodoo doll of me, and using it every time I reach out... Without knowing / understanding your design criteria and objectives, I'm at risk of asking poor questions or making bad suggestions, but here it goes anyway.... I'm going to apologize in advance for being "That Guy".

With the new V0.9 release (no tasks, nothing run), a fresh startup of SolidQueue, I see the following memory footprint (OSX):

  • Supervisor: 160Mb
  • Dispatcher: 108Mb
  • Scheduler: 105Mb
  • Worker: 110Mb

Once the Jobs actually do something of value, the worker reliably grows to 200Mb plus (I'm looking at your ActiveRecord...). For those of us running on cloud services and a shoestring budget, that's already tight. I'm my case, I run a second Worker to isolate high memory jobs so I can "recycle on OOM" while still servicing everything else via the other worker.

I can purchase my way into additional memory resources at a cost of 10x (literally) what I'm paying now. And it only goes up from there. So this issue is real and painful for me, and I would guess a bunch of other folks running on shoestring budgets.

I'm sure there are use-cases where larger deployments would want a Dispatcher without a Supervisor, so I think understand the rational for the current design. But it would be nice if there was a way to via configuration to have a SuperDispatcherVisor... have the supervisor take on the dispatchers responsibilities and allow us to reclaim 110Mb+.

@rosa
Copy link
Member

rosa commented Sep 12, 2024

Yes, I understand... I had an async mode where all processes ran as threads of the supervisor, so there was a single process, but it was decided that we wanted to have a single and only way to run this.

Do you have recurring tasks configured at all? You could skip the scheduler if not. Another question: are you starting Solid Queue via bin/jobs or the Rake task?

I'll see if I get to bring back the asyncmode.

@hms
Copy link
Contributor Author

hms commented Sep 12, 2024

@rosa

I do have recurring tasks.

I had an async mode where all processes ran as threads of the supervisor, so there was a single process, but it was decided that we wanted to have a single and only way to run this.

For what it's worth, this was a very good call. That being said, per the readme, the dispatcher really isn't doing that much anymore. Unless you got big plans for the dispatcher, it seems the Supervisor has the maintenance task thread (or a second thread) that could be doing what's left of the dispatchers job.

@rosa
Copy link
Member

rosa commented Sep 12, 2024

I think it depends on your setup 🤔 If you use delayed jobs, then the dispatcher will be making sure they get dispatched. This also applies to jobs automatically retried via Active Job with delay (the default). We do use them heavily and run several of them separate from workers, but perhaps you don't? In that case, would it help to just not run the dispatcher? You can achieve that by not configuring it at all, but I imagine you do need it in some cases 🤔 Although I imagine you need the concurrency maintenance task 😬 I think the async mode was a good idea for this case, TBH.

@hms
Copy link
Contributor Author

hms commented Sep 12, 2024

I'm starting to worry this last exchange is falling further and further into the "I didn't fully understand" side of things, again :-(

I have to admit to being a bit flummoxed between the complexity required for a competent Async Job subsystem and the desire to fit things into the itty bitty tiny box of small scale and affordable cloud based deployments.

Given where SolidQueue currently sits with memory utilization, I'm going to have good think on the trade-offs between running just 1 worker and assuming it's going to recycle (on OOM) for almost every execution Vs. just facing I have to push that memory dial to the right and eat the bill 😢

@rosa
Copy link
Member

rosa commented Sep 12, 2024

I have to admit to being a bit flummoxed between the complexity required for a competent Async Job subsystem

So sorry, this is my fault. I call it async mode but it's actually the same thing as now, just that instead of forking processes, the supervisor would just spawn threads for each of its supervised processes. I probably confused you with that name.

Are you starting Solid Queue via bin/jobs or the Rake task?

@hms
Copy link
Contributor Author

hms commented Sep 12, 2024

So sorry, this is my fault. I call it async mode but it's actually the same thing as now, just that instead of forking processes, the supervisor would just spawn threads for each of its supervised processes. I probably confused you with that name.

That one I actually understood.

Adding complexity to the code so it can be configured to run both ways seems like a big lift. I know you had it before, but every line of code in SolidQueue is a line that has to be supported, tested, and will eventually used in an unexpected way. I would guess Threads are ok for very lite / IO intensive work loads, but given the GVL I simply don't understand where the tradeoffs are on Threads Vs. Processes. I shouldn't have started this conversation without a better understanding.

Are you starting Solid Queue via bin/jobs or the Rake task?

I've switched to bin/jobs.

I can't thank you enough for being willing to engage, and tolerate / put up with my learning curve on some of these issues.

@rosa
Copy link
Member

rosa commented Sep 12, 2024

Oh, no, no please, it's me who should thank you for your patience and help to make Solid Queue better! 🙏 ❤️

The reason I asked about bin/jobs is that I learnt not that long ago that Rake tasks don't eager load code by default even if you set config.eager_load to true (the default in production). For that to happen, you need to set rake_eager_load to true, as it's false by default. bin/jobs loads Rails's environment so it'll use eager_load and will load the app before forking. I think this might help with memory because forks will share some of that already loaded stuff, but I think the savings are quite modest in general.

@hms
Copy link
Contributor Author

hms commented Sep 12, 2024

I'll look into the Eager Vs. Lazy tradeoffs. Thanks for that.

Once I get worker recycling working / finished, I'll have more suggestions to share that have helped. For example, SolidQueue.on_start { GC.auto_compact = true } helps and is shared between forks.

@rosa
Copy link
Member

rosa commented Sep 12, 2024

There's also Process.warmup in Ruby 3.3, which might help too, but I've never used it in production.

@hms
Copy link
Contributor Author

hms commented Sep 12, 2024

Oh that looks interesting! Thank you.

@majkelcc
Copy link

Do you have recurring tasks configured at all? You could skip the scheduler if not.

What's the easiest way to do this? Would an empty recurring.yml already disable the scheduler?

@dhh
Copy link
Member

dhh commented Sep 16, 2024

@hms If you're running in a super constrained environment, you could just use the puma plugin that we use in development. Then everything is running off that single Rails process. Just make sure you keep WEB_CONCURRENCY = 1.

Another option is to stop getting fleeced by cloud providers charging ridiculous prices for tiny hosts 😄. Rails 8 is actually about answering that question in the broad sense.

@rosa
Copy link
Member

rosa commented Sep 17, 2024

What's the easiest way to do this? Would an empty recurring.yml already disable the scheduler?

@majkelcc yes! If you have no recurring tasks defined at all (the default), then the scheduler will be automatically disabled. Alternatively, if you're running jobs in more than one place and want to disable it in one of them (this is what we do in HEY), then you can pass --skip_recurring to bin/jobs.

@hms
Copy link
Contributor Author

hms commented Sep 17, 2024

@dhh

Oh, how I hate Heroku and the games they play with radically inappropriate "starter" resource sizing (you think Apple is bad) all in an effort to prop up their already overly insanely high prices to force me into upgrades. (does that make it "Insanely high(2) prices?". And yes, I'm very jealous of your new monster Dells and the fact you got off the treadmill (want to rent me a small slice for something I can afford...)

But as a solo developer, who is extremely grateful for the technical compression the Rails community and you have delivered over the years, I can not put a price on the value of A) Not having to worry about anything DevOps; B) The comfort of 10+ years of using a system and feeling like you know all of it's corners

I'm very much crossing my fingers that Rails 8 reduces the moving parts enough that the learning curve of a new deployment strategy becomes within reach.

@dhh
Copy link
Member

dhh commented Sep 17, 2024

@hms You're the ideal target for the progress we're bringing to the deployment story in Rails 8. Stay tuned for Rails World!

But in the meantime, I'd try with the puma plugin approach.

@hms
Copy link
Contributor Author

hms commented Sep 17, 2024

@dhh In my case, I have at least split the web server from the SolidQueue environments, so I'm living large with 512MB x2.

I'm just bristling at the fact that I have to go from $9 to $50 a month to double that memory.

@dhh
Copy link
Member

dhh commented Sep 17, 2024

Highway robbery. Selling 512MB instances in 2024 is something.

@dylanfisher
Copy link

@hms have you had any success with SolidQueue.on_start { GC.auto_compact = true } and/or other config for running solid queue on Heroku? This doesn't seem to be making a difference for me, nor does worker: bin/jobs vs worker: bundle exec rake solid_queue:start.

When I deploy using either a separate worker or with the Puma plugin, I immediately exceed the 512mb quota. (Highway robbery... I know).

2024-12-09T18:58:21.218153+00:00 app[worker.1]: W, [2024-12-09T18:58:21.218045 #2]  WARN -- : SolidQueue-1.1.0 Fail claimed jobs (9.6ms)  job_ids: [], process_ids: []
2024-12-09T18:58:21.218281+00:00 app[worker.1]: I, [2024-12-09T18:58:21.218257 #2]  INFO -- : SolidQueue-1.1.0 Started Supervisor (118.0ms)  pid: 2, hostname: "6a2748a5-8d7a-40f7-b750-a3cad8a3f8c0", process_id: 113, name: "supervisor-498a7acdf5319581010e"
2024-12-09T18:58:21.288512+00:00 app[worker.1]: I, [2024-12-09T18:58:21.288420 #33]  INFO -- : SolidQueue-1.1.0 Started Worker (51.3ms)  pid: 33, hostname: "6a2748a5-8d7a-40f7-b750-a3cad8a3f8c0", process_id: 115, name: "worker-71a8748dff54b8aa83da", polling_interval: 0.1, queues: "*", thread_pool_size: 1
2024-12-09T18:58:21.288513+00:00 app[worker.1]: I, [2024-12-09T18:58:21.288435 #30]  INFO -- : SolidQueue-1.1.0 Started Dispatcher (59.4ms)  pid: 30, hostname: "6a2748a5-8d7a-40f7-b750-a3cad8a3f8c0", process_id: 114, name: "dispatcher-97390c7ceb2f7a4af238", polling_interval: 5, batch_size: 500, concurrency_maintenance_interval: 600
2024-12-09T18:59:21.004396+00:00 heroku[worker.1]: Process running mem=613M(119.8%)
2024-12-09T18:59:21.006149+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)

@hms
Copy link
Contributor Author

hms commented Dec 9, 2024

@dylanfisher

I've been focusing on other SQ PRs. But I have a "recycle Workers" PR in my back pocket. Only question is if the SQ team will accept one of my other PRs since the implementation has to change a little if they do.

I tried all sorts of ways to try to constrain the memory footprint of Workers using the AWS S3 gem without any luck. It generates an unrecoverable memory footprint. I'm guessing a degenerate memory fragmentation rather than leak, but either way, this means I'll require one of: an unlimited Heroku budget or getting off my arse and pushing my recycling PR ASAP.

CG.auto_compact = true:
I haven't had a chance to re-test since SQ upgraded to ruby 3.3.5, but with 3.3.1 SQ (nor GoodJob interestingly enough -- they must be doing something similar that should help narrow down the issue) would have no part of running with GC.auto_compact = true. It would crash the process with a GC error. Now I'm wondering if it was an ARM only issue since because of the crash during test, I never pushed to Heroku and it's Intel chips......

@mattgraham
Copy link

mattgraham commented Dec 25, 2024

This was a good thread to find. I enjoy solid_queue for sure but noticed while running the puma plugin the ram within a matter of time goes far to far over the normal dyno memory threshold. This is a super small app so just tossing more money/ram at it doesn't seem to be the win. For now i'll run a seperate dyno for solid_queue and avoid all the R-errors from the platform.

This is the solid_queue dyno itself.
image

@exterm
Copy link

exterm commented Jan 9, 2025

As an added data point, we're running a (pretty much vanilla) Rails 8 app on the smallest Digital Ocean droplet (1GB of memory) and solid_queue 1.1.2 is taking up 48% of that memory, matching the values reported by @hms in the original post.

Our app is currently very barebones but we do have one recurring job. It doesn't make sense to me that the job framework would use almost 500MB of memory.

I'll look into puma plugin mode, but I'd love to see solid_queue use less memory by default.

@Jell
Copy link

Jell commented Jan 10, 2025

Adding another data point: we just introduced ActiveJob + SolidQueue to an old and large codebase, which has a memory footprint of about 500Mb when running as a web server using Puma.

When starting a SolidQueue worker with bin/jobs in production (using eager loading), 4 processes that each use close to 500Mb are started, we've seen memory usage reach 1.6Gb at some point. There's probably something in our codebase that makes it hard for processes to efficiently share memory for some reason (breaking ruby's copy-on-write mechanism when forking processes), at this point it's not something we'll be able to address since our codebase is so large. We ended up using the rake task instead of bin/jobs to bypass eager loading, so that the worker is the only process that ends up with the entire app in memory, and our SolidQueue server appears to stabilize at 650Mb instead.

For comparison, we are trying to migrate from Sidekiq, which in our case only has a memory footprint of 400Mb for the same application, so SolidQueue appears to have a 4X memory footprint when using bin/jobs and 1.6X memory when using the rake task in our case.

Would it make sense to only have the worker load the entire rails app, and the supervisor / dispatcher / scheduler be super lightweight processes instead? I can imagine that there's some value in trying to fork the workers from the same process to reduce memory footprint when having multiple worker processes, but at the same time I imagine that the 90% use case is running a single multi-threaded worker process anyway?

@rosa
Copy link
Member

rosa commented Jan 10, 2025

@Jell, yeah, I still think the async mode I had was a good idea because of this... but it wasn't my call in the end. I'll see if I can do something about it.

I'd recommend not to migrate from Sidekiq in your case since it seems it was working well, right?

@Jell
Copy link

Jell commented Jan 10, 2025

I'd recommend not to migrate from Sidekiq in your case since it seems it was working well, right?

@rosa actually we do want to migrate away from Sidekiq because we'd like to use a background jobs that would allow us to act as a transactional outbox (the sharp tool described in https://github.com/rails/solid_queue?tab=readme-ov-file#jobs-and-transactional-integrity), and we also want to have better guarantees of "at least once processing" (which would require sidekiq pro).

So we're looking at migrating to either SolidQueue or GoodJob. The higher memory footprint is not an issue for us per se, just doesn't feel good to have to motivate the cost increase.

And I didn't mean to compare to Sidekiq to say "sidekiq is better", just to give what I think is a fair benchmark? We overall much prefer ActiveJob + SolidQueue over Sidekiq overall.

@Jell
Copy link

Jell commented Jan 10, 2025

I still think the async mode I had was a good idea because of this... but it wasn't my call in the end. I'll see if I can do something about it.

from my understanding of your explanation on the other threads I think I believe I agree with you @rosa, an async mode with a single process would be perfect for our use case and I believe should give us a memory footprint which is on par with the industry standard.

Thanks a lot for your patience and your effort 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants