Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: scoreboard: failed to lock (already locked) #13437

Closed
jkabat opened this issue Feb 20, 2024 · 29 comments
Closed

ERROR: scoreboard: failed to lock (already locked) #13437

jkabat opened this issue Feb 20, 2024 · 29 comments
Assignees

Comments

@jkabat
Copy link

jkabat commented Feb 20, 2024

Description

After the recent upgrade of application containers to PHP 8.2.14 I see plenty of the error messages in the log:

ERROR: scoreboard: failed to lock (already locked)
ERROR: status: unable to find or access status shared 

Is there something I can do to prevent it from happening? End users seem not to be affected as far as I can tell. These log messages are written randomly and only for the PHP-FPM containers. Long running PHP-CLI commands operating under the Supervisor do not have these issues.

opcode: enabled
jit: disabled

Edit: worth to mention that errors randomly occurs even on the container which is barely used during of waiting period.

PHP Version

PHP 8.2.14

Operating System

Debian GNU/Linux 12 (bookworm)

@bukka
Copy link
Member

bukka commented Feb 22, 2024

I just checked and this happens when scoreboard is already used by other process. This could happen due to holding lock longer (there was potentially one such change) but a bit before I think. What version was your previous version that you updated from?

Another thing that could play role is high number of children configured. Could you provide your fpm config?

@jkabat
Copy link
Author

jkabat commented Feb 22, 2024

I have upgraded from 8.1.18-fpm-bullseye to 8.2.14-fpm-bookworm

FPM config:

[global]
daemonize = no

log_level = notice

emergency_restart_threshold = 3
emergency_restart_interval = 1m
process_control_timeout = 10s

[www]
access.log = /dev/null

listen = 9000

request_terminate_timeout = 180s

pm = static
pm.max_children = 10
pm.max_requests = 200
pm.process_idle_timeout = 10s

pm.status_path = /status

@bukka
Copy link
Member

bukka commented Mar 8, 2024

Could you try to increase (significantly) or remove pm.max_requests and see if that helps?

@bukka
Copy link
Member

bukka commented Mar 8, 2024

You should generally not need this and it seems it didn't work correctly before 8.2 (we had couple of reports regarding to that) - it means it didn't restart correctly which was somehow fixed in 8.2

@jkabat
Copy link
Author

jkabat commented Mar 18, 2024

I have removed pm.max_requests option, but error is still present even on the newest 8.2.17-fpm-bookworm.
EDIT: could it have something to do with OPCACHE enabled? but still it happens only now and then.

Used memory | 83587760
Free memory | 84184400

@bukka
Copy link
Member

bukka commented Mar 21, 2024

I have been thinking about this and looking through the code / changes. I don't see any specific FPM change that should impact locking. Last fix in that area was also merged to 8.1.

What has got most likely biggest impact on scoreboard locking is probably status collection. Do you have some external collector for status? If so, how often does it run? If often, try to decrease it or temporarily disable it just to see if it helps.

Otherwise there are some other factors that can be the cause. Could you also try it with minimal config just to see if it helps and then try to potentially increase pm.max_children (up to 100) if that still happening:

[global]
daemonize = no
log_level = notice

[www]
listen = 9000

pm = static
pm.max_children = 10

pm.status_path = /status

If it disappears, try to add back the removed options one by one and see when it happens again

@jkabat
Copy link
Author

jkabat commented Mar 26, 2024

@bukka thank you for looking into the issue.

I use healthcheck which periodically calls /status endpoint:
https://raw.githubusercontent.com/renatomefi/php-fpm-healthcheck/master/php-fpm-healthcheck

It's set to 10s. Let me test the settings and I will let you know if it is of any help.

@ToshY
Copy link

ToshY commented Apr 18, 2024

@jkabat I'm using the same healthcheck also with an interval set to 10s, getting the same messages in log. Have you been able to resolve this?

@realcnbs
Copy link

If you're running this in a k8s pod this is probably caused by your liveness and readyness probes being executed at the same exact moment.

@jkabat
Copy link
Author

jkabat commented Jul 11, 2024

@realcnbs yes I have in k8s pods - what would be the solution? Do you have any?

@realcnbs
Copy link

Play with probes delays and periods so they have a couple of seconds between them.

@pimjansen
Copy link

pimjansen commented Jul 11, 2024

But then still, it should be able to handle both at once right? My logs are being spammed with this (300 times last 24 hours).

@realcnbs
Copy link

It should, but apparently it doesn't.

@adamkral12
Copy link

I tried 5 seconds between readiness and liveness and I am still hitting the same issue. I also tried to reproduce it locally without any luck :/

@bukka
Copy link
Member

bukka commented Jul 21, 2024

It would be useful to get more info about the setup (including k8s deployment spec and possibly a bit info about cluster) and traffic where this happen. I'm building a new testing tool https://github.com/wstool/wst/ that will allow me to better test higher traffic issues including k8s setups so more info I get, better chance of recreating that I have...

@pimjansen
Copy link

It would be useful to get more info about the setup (including k8s deployment spec and possibly a bit info about cluster) and traffic where this happen. I'm building a new testing tool https://github.com/wstool/wst/ that will allow me to better test higher traffic issues including k8s setups so more info I get, better chance of recreating that I have...

I will send our dockerfile tomorrow. In general it is not really related to high traffic if im correct

@jkabat
Copy link
Author

jkabat commented Jul 22, 2024

I can provide k8s setup too, agree with @pimjansen that it is not related to high traffic

@pimjansen
Copy link

pimjansen commented Jul 22, 2024

@bukka here is a gist for the Dockerfile and related config. If im correct there is not much "special" in this usecase except maybe the healthcheck:

https://gist.github.com/pimjansen/5bc6b3222cf26f505b608207cf350e31

EDIT:
And some deployment probing:

          livenessProbe:
            exec:
              command:
                - php-fpm-healthcheck
                - --listen-queue=10
            initialDelaySeconds: 0
            periodSeconds: 10
          readinessProbe:
            exec:
              command:
                - php-fpm-healthcheck
            initialDelaySeconds: 1
            periodSeconds: 5

@realcnbs
Copy link

realcnbs commented Jul 22, 2024

I tried 5 seconds between readiness and liveness and I am still hitting the same issue. I also tried to reproduce it locally without any luck :/

Try to disable one of the probes and see if that fixes it. If it does then 5 seconds is not enough to guarantee the time gap. As an option you can try using startupProbe insted of readiness.

@jkabat
Copy link
Author

jkabat commented Jul 22, 2024

I tried 5 seconds between readiness and liveness and I am still hitting the same issue. I also tried to reproduce it locally without any luck :/

Try to disable one of the probes and see if that fixes it. If it odes then 5 seconds is not enough to guarantee the time gap. As an option you can try using startupProbe insted of readiness.

When I have commented out readiness probe, it really helped...

EDIT: as per @realcnbs suggestion I have replaced readiness probe with startupProbe and I saw only one case when it happened so far. Not sure its safe to get rid of readinessProbe, will keep it as temporary solution until its fixed. Thank you guys.

@karabanov
Copy link

After upgrading from PHP7.4.33 to PHP8.0.30, I also started noticing such messages in the logs:

[11-Aug-2024 17:13:42] WARNING: [pool big_app_prod] child 667 said into stderr: "ERROR: scoreboard: failed to lock (already locked)"
[11-Aug-2024 17:13:42] WARNING: [pool big_app_prod] child 667 said into stderr: "ERROR: status: unable to find or access status shared memory"
[11-Aug-2024 18:12:01] NOTICE: error log file re-opened
[11-Aug-2024 18:57:01] WARNING: [pool big_app_prod] child 461 said into stderr: "ERROR: scoreboard: failed to lock (already locked)"
[11-Aug-2024 18:57:01] WARNING: [pool big_app_prod] child 461 said into stderr: "ERROR: status: unable to find or access status shared memory"

PHP is installed from the repository https://packages.sury.org/php/ and runs in a container.

The config file looks something like this:

[big_app_prod]
; Standard
listen = /www/php_sockets/php-fpm-docker-big_app_prod.sock
listen.backlog = 4096
listen.group = www-data
listen.mode = 0660
rlimit_files = 10240
chdir = /
catch_workers_output = yes
security.limit_extensions = .php
 
; Deployment-specific
pm = static
pm.max_children = 1500
pm.status_path = /fpm-status

php_admin_flag[log_errors] = on
php_admin_flag[display_errors] = off
php_admin_value[error_log] = /var/log/php/$pool/$pool.log
php_admin_value[sendmail_path] = /usr/bin/msmtp -C /etc/msmtprc -t --read-envelope-from

; Project-specific
php_admin_value[precision] = 12
php_admin_flag[ignore_user_abort] = on
php_admin_flag[expose_php] = off
php_admin_value[max_input_nesting_level] = 8
php_admin_value[max_execution_time] = 600
php_admin_value[memory_limit] = 300M
php_admin_value[error_reporting] = E_ALL & ~E_NOTICE & ~E_DEPRECATED & ~E_STRICT
php_admin_flag[html_errors] = off
php_admin_value[post_max_size] = 32M
php_admin_value[upload_max_filesize] = 32M
php_admin_flag[file_uploads] = True
php_admin_value[cgi.fix_pathinfo] = 0
php_admin_value[session.name] = big_app_prod
php_admin_value[session.save_handler] = redis
php_admin_value[session.save_path] = "tcp://127.0.0.1:6379?timeout=10&prefix=big_app_prod_SESS_"
php_admin_value[session.cookie_httponly] = On
php_admin_value[apc.shm_size] = 2048M
php_admin_value[max_input_vars] = 1000
php_admin_value[session.gc_maxlifetime] = 1440

Metrics are collected using:

# https://github.com/Lusitaniae/phpfpm_exporter
prometheus-php-fpm-exporter \
    --web.listen-address=127.0.0.1:5584 \
    --phpfpm.socket-paths=/www/php_sockets/php-fpm-docker-big_app_prod.sock \
    --phpfpm.status-path=/fpm-status \
    --phpfpm.script-collector-paths=/var/lib/prometheus/php_opcache_exporter.php

PHP Version

8.0.30

Operating System

Debian 11 (bullseye)

@verfriemelt-dot-org
Copy link

i think the proper solution here is to fix the locking issue, this is also reproducible by spamming the fpm process with multiple concurrent

$ SCRIPT_NAME=/status SCRIPT_FILENAME=/status QUERY_STRING=full REQUEST_METHOD=GET cgi-fcgi -bind -connect 127.0.0.1:9000

requsts. you'll eventually end up with the error described here:

127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 200
127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 500
ERROR: scoreboard: failed to lock (already locked)
ERROR: status: unable to find or access status shared memory

also i think readinessprobe should just use /ping instead of the full /status ... the script mentioned here (https://raw.githubusercontent.com/renatomefi/php-fpm-healthcheck/master/php-fpm-healthcheck) should be adapted to use /ping for the pingmode, so without arguments to avoid this alltogether. this would be a viable workaround for the time being.

@pimjansen
Copy link

i think the proper solution here is to fix the locking issue, this is also reproducible by spamming the fpm process with multiple concurrent


$ SCRIPT_NAME=/status SCRIPT_FILENAME=/status QUERY_STRING=full REQUEST_METHOD=GET cgi-fcgi -bind -connect 127.0.0.1:9000

requsts. you'll eventually end up with the error described here:


127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 200

127.0.0.1 -  21/Aug/2024:08:36:16 +0000 "GET /status" 500

ERROR: scoreboard: failed to lock (already locked)

ERROR: status: unable to find or access status shared memory

also i think readinessprobe should just use /ping instead of the full /status ... the script mentioned here (https://raw.githubusercontent.com/renatomefi/php-fpm-healthcheck/master/php-fpm-healthcheck) should be adapted to use /ping for the pingmode, so without arguments to avoid this alltogether. this would be a viable workaround for the time being.

Seems fair but wouldnt it be good for the php implementation for fpm in docker to provide this out of the box? Would reduce these errrors, better control and better user experience since it is already embedded

bukka added a commit to bukka/php-src that referenced this issue Sep 8, 2024
This changes locking for scoreboard to reduce contention between readers
and adds retries for acquiring scoreboard for read.
bukka added a commit to bukka/php-src that referenced this issue Sep 8, 2024
This changes locking for scoreboard to reduce contention between readers
and adds retries for acquiring scoreboard for read.
@bukka
Copy link
Member

bukka commented Sep 8, 2024

So I finally managed to reliably recreate issue with my new tool called WST funded by STF (Soverign tech fund) that allows me to spin multiple services and test various things in its pipeline - specifically allowing to run load test and checking the logs. The actual test for this can be seen here.

After that I looked properly into the scoreboard locking and found couple of issues. I have got an initial fix in #15805 after which I no longer see the locking error but it might be changing a bit too much so not sure if I should maybe go for lighter version for PHP 8.2 and do the rest in master as it might be a bit risky.

@goffyara
Copy link

The same with php 8.3.10 and using php-fpm-healthcheck with k8s

bukka added a commit to bukka/php-src that referenced this issue Nov 23, 2024
This changes locking for scoreboard to reduce contention between readers
and adds retries for acquiring scoreboard for read.
bukka added a commit to bukka/php-src that referenced this issue Nov 23, 2024
This changes locking for scoreboard to reduce contention between readers
and adds retries for acquiring scoreboard for read.

Closes phpGH-15805
bukka added a commit to bukka/php-src that referenced this issue Dec 7, 2024
This changes locking for scoreboard to reduce contention between readers
and adds retries for acquiring scoreboard for read.

Closes phpGH-15805
@bukka bukka closed this as completed in 3490ac0 Dec 14, 2024
@pimjansen
Copy link

@bukka i see this as patched which is great to start with. Do you know how far this is backported?

@bukka
Copy link
Member

bukka commented Dec 14, 2024

@pimjansen It will be part of 8.3.16 and 8.4.3 . 8.2 branch is already closed for normal bug fixes and only accepts security fixes.

@pimjansen
Copy link

@pimjansen It will be part of 8.3.16 and 8.4.3 . 8.2 branch is already closed for normal bug fixes and only accepts security fixes.

Fair! Thanks for quick reply

@jkabat
Copy link
Author

jkabat commented Dec 19, 2024

@bukka thanks for investing so much time to resolve the issue!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants