Dovecot/upcp multiple deadlocks
On Friday, December 17 a deadlock was introduced into a Dovecot package update that utilizes ApisCP's Bootstrapper build system to template configuration upon install. Systems affected began reporting spurious IMAP/POP3 login rejections with first reports occurring shortly after 12 AM EST (-0500 GMT) during routine platform updates. The issue largely affected CentOS 8/Alma/Rocky Linux machines. CentOS 7 was unaffected in most timezones.
Automatic remediation is blocked by two separate locking mechanisms both deadlocked, one in RPM and another in upcp. A manual intervention is required to resolve.
To determine if you are affected, verify that Dovecot's administration doveadm auth cache flush
completes successfully.
OK
12 cache entries flushed
FAILED
doveadm: Fatal: Couldn't load required plugin /usr/lib64/dovecot/lib20_expire_plugin.so: dlopen() failed: /usr/lib64/dovecot/lib20_expire_plugin.so: undefined symbol: t_pop_check
If your output is similar to the FAILED response above, two remediation pathways exist:
- Reboot the server. This will terminate both
rpm
andansible-playbook
processes contending for two locks. Runupcp
after reboot to confirm system is regularly receiving updates. - Terminate the nested RPM process spawned within an
upcp
session to release the lock. Letupcp
migrations complete, then runupcp
again to fetch new code that corrects this.
The following code will perform those tasks:
[[ -f /var/lib/rpm/.rpm.lock ]] && kill -9 "$(lsof -t /var/lib/rpm/.rpm.lock)" 2> /dev/null && sleep 15 && upcp
If any of these commands fail, then reach out to help@apiscp.com.
What happened
ApisCP provides 2 update channels, code from GitLab and package through the OS with yum/dnf. These channels operate independently to ensure redundancy if one update channel is interrupted such as a package update that can be remediated with a FLARE code update.
Platform management is built around Ansible that allows configuration to be applied an infinite number of times but only changing what needs to be changed. Keeping an idempotent design reduces cognitive overload by filtering noise: focus on the big picture, not details. If a service changed, report the change rather than the individual steps that went into evaluating the change. A lock is utilized to avoid having multiple processes run concurrently that may change one state only to undo another configuration's state change, a singleton design pattern. Any changes made within the UI places the request in a queue to ensure successive invocations.
The first deadlock arises in upcp
, a shell script responsible for code updates, doesn't have the same logic pathway built in. It will wait until the Bootstrapper queue is empty before running. Sometimes these queues take seconds and sometimes minutes to complete. upcp will wait until Ansible completes.
ApisCP relies heavily on templating to ensure package versions work with the platform. For this scenario, Dovecot uses a single template to preserve Bootstrapper settings while reducing the risk of having changes steamrolled in a platform scrub ("upcp -sb" command). An update under internal testing was moved into production on Friday to correct spurious cgroup errors reported in maillog from optional cpuset, freezer support added in 3.2.28. This update also upgraded Dovecot from 2.2 to 2.3 released nearly 2 years ago.
A second deadlock occurred juggling systems that may have Dovecot 2.2 installed if a migration bundled in a panel update occurred before a routine package update processed. Likewise if cpcmd scope:get system.update-policy
is disabled this was an inevitable calamity.
CentOS 7 calls yum hourly in /etc/cron.hourly/0yum-hourly.cron
. In CentOS 8/Alma/Rocky this is moved to a timer, dnf-automatic
. Clients unaffected had the dovecot package update apply before the panel update thus making the deadlock OS and timezone-dependent. Ordering can be confirmed by looking at both dnf-automatic.timer
and upcp.timer
from systemctl list-timers
:
[root@paixhans cron.d]# systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
Mon 2021-12-20 20:56:46 GMT 10min left Mon 2021-12-20 20:28:38 GMT 17min ago apnscp-flare-check.timer apnscp-flare-check.service
Mon 2021-12-20 21:00:00 GMT 14min left Mon 2021-12-20 20:30:01 GMT 15min ago apnscp-storage-log.timer apnscp-storage-log.service
Mon 2021-12-20 21:57:23 GMT 1h 11min left Mon 2021-12-20 20:41:16 GMT 4min 30s ago dnf-makecache.timer dnf-makecache.service
Tue 2021-12-21 00:00:00 GMT 3h 14min left Mon 2021-12-20 00:00:01 GMT 20h ago bwcron.timer bwcron.service
Tue 2021-12-21 00:00:00 GMT 3h 14min left Mon 2021-12-20 00:00:01 GMT 20h ago metricscron.timer metricscron.service
Tue 2021-12-21 00:00:00 GMT 3h 14min left Mon 2021-12-20 00:00:01 GMT 20h ago sa-update.timer sa-update.service
Tue 2021-12-21 00:00:00 GMT 3h 14min left Mon 2021-12-20 00:00:01 GMT 20h ago unbound-anchor.timer unbound-anchor.service
Tue 2021-12-21 00:09:23 GMT 3h 23min left Mon 2021-12-20 00:05:39 GMT 20h ago upcp.timer upcp.service
Tue 2021-12-21 03:41:52 GMT 6h left Mon 2021-12-20 03:41:52 GMT 17h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Tue 2021-12-21 06:43:18 GMT 9h left Mon 2021-12-20 06:00:27 GMT 14h ago dnf-automatic.timer dnf-automatic.service
Sat 2022-01-01 01:14:05 GMT 1 weeks 4 days left Wed 2021-12-01 01:35:51 GMT 2 weeks 5 days ago platform-scrub.timer platform-scrub.service
As seen from above there was a ~5 1/2 hour time between upcp running first before dnf. If dnf ran first then no deadlock. If upcp ran first, then the deadlock was encountered.
Dovecot 2.3 introduced a few important directive changes that are reconciled on December 10th with 3.2.30.1. Dovecot 2.3 was released through the package update channel 1 week after 3.2.30.1 that adds an update hook to recompile /etc/dovecot/conf.d/apnscp.conf
. This update hook in mail/configure-dovecot also calls yum/dnf package management to ensure packages are up to date.
For the actual lock, a migration released in 3.2.30.2 calls yum to update dovecot that in turn updates the dovecot package if not already updated that calls Bootstrapper again to recompile /etc/dovecot/conf.d/apnscp.conf
. POSIX standards disallow lock inheritance, thus the rpm database lock is held by the parent process spawned by line 17. In turn, the %post script invoked yum again as part of mail/configure-dovecot whose .rpm.lock advisory lock file was held by the parent process resulting in a deadlock of that yum task (PID 931107 below) in turn blocking successive upcp calls.
root 929305 0.0 0.0 17072 3484 ? Ss Dec17 0:00 /bin/bash /usr/local/apnscp/build/upcp.sh
root 929344 0.0 0.0 16940 304 ? S Dec17 0:00 \_ /bin/bash /usr/local/apnscp/build/upcp.sh
root 929345 0.0 0.0 7312 800 ? S Dec17 0:00 | \_ /bin/tee -a /usr/local/apnscp/storage/.upcp.failure
apnscp 929494 0.0 0.1 238580 54384 ? S Dec17 0:01 \_ apnscp_php ./artisan migrate --force
root 929738 17.5 0.2 309900 67164 ? Rl Dec17 193:17 \_ /usr/bin/python3.6 /bin/ansible-playbook -c local migrator.yml --extra-vars=migrations='2021_12_18_002612_update-yum-hook' --tags=up
root 931105 0.0 0.1 310932 59436 ? S Dec17 0:00 \_ /usr/bin/python3.6 /bin/ansible-playbook -c local migrator.yml --extra-vars=migrations='2021_12_18_002612_update-yum-hook' --tags=
root 931106 0.0 0.0 14372 3044 ? S Dec17 0:00 \_ /bin/sh -c /usr/bin/python3.6 && sleep 0
root 931107 0.0 0.4 720520 160084 ? S Dec17 0:09 \_ /usr/bin/python3.6
root 931472 0.0 0.0 14372 3184 ? S Dec17 0:00 \_ /bin/sh /var/tmp/rpm-tmp.kBSE68 2
root 931481 17.7 0.2 320560 77284 ? Rl Dec17 194:34 \_ /usr/bin/python3.6 /bin/ansible-playbook bootstrap.yml --tags=mail/configure-dovecot --extra-vars=dovecot_version=
root 932017 0.0 0.2 323864 73008 ? S Dec17 0:00 \_ /usr/bin/python3.6 /bin/ansible-playbook bootstrap.yml --tags=mail/configure-dovecot --extra-vars=dovecot_vers
root 932020 0.0 0.0 14372 2992 ? S Dec17 0:00 \_ /bin/sh -c /usr/bin/python3.6 && sleep 0
root 932021 0.0 0.3 466804 128360 ? S Dec17 0:17 \_ /usr/bin/python3.6
COMMAND PID TYPE SIZE MODE M START END PATH
python3.6 931107 POSIX 1.5M READ 0 1073741826 1073742335 /var/lib/dnf/history.sqlite
python3.6 931107 POSIX 32K READ 0 128 128 /var/lib/dnf/history.sqlite-shm
python3.6 931107 POSIX 0B WRITE 0 0 0 /var/lib/rpm/.rpm.lock
python3.6 931107 FLOCK 344K READ 0 0 0 /var/lib/rpm/__db.001
/var/lib/dnf/history.sqlite
python3.6 932021 POSIX 32K READ 0 128 128 /var/lib/dnf/history.sqlite-shm
dnf-automatic 1001216 POSIX 1.5M READ 0 1073741826 1073742335 /var/lib/dnf/history.sqlite
dnf-automatic 1001216 POSIX 32K READ 0 128 128 /var/lib/dnf/history.sqlite-shm
With both update channels blocked there are no other means to reasonably ameliorate the situation.
Changes going forward
RPM updates now solicit a flag to indicate how it was invoked if through yum/dnf update channels. A new RPM macro is present in 3.2.30.2 that will be bundled with future ApisCP packages to call the proper roles while passing yum_transaction_hook
. Any third-party playbooks that utilize yum
are advised to the following when
condition as in the following sample:
- name: "{{ mail_enabled | ternary('Install','Remove') }} Dovecot RPMs"
yum:
name:
- dovecot
- dovecot-pigeonhole
- dovecot*-apnscp
state: '{{ mail_enabled | ternary("installed","absent") }}'
when: yum_transaction_hook | default('') == ""
yum_transaction_hook
must be set to update
whenever a package is updated. This is accomplished with the RPM macro %run_apnscp_update
which effectively bypasses nested rpm database access.
Second, FLARE updates now include a second non-null space-delimited parameter to clear the Bootstrapper lock file. This would allow an amended release to run if an existing ansible-playbook process is running.
Failure always creates learning opportunities and learning yields growth. Going forward, I believe these changes introduce the most compatible means of avoiding such a situation going forward.
As always if you have any other questions feel free to reach out to help@apiscp.com or join us on our community Discord.
Keeping in touch
This issue was initially reported within the community Discord approximately 12 AM EST (-0500 GMT) on Friday, December 18th. Likewise once a resolution was determined, a tweet sent out from @apisnetworks later that evening.
This blog post was delayed 24 hours to December 20th as two additional concerns were analyzed (login case changes in Dovecot's ABI + explicitly naming login_trusted_networks). An email was sent out later that same day.
👉 @apisnetworks is the best secondary line of communication for issues with Discord being primary. 👈