Dovecot/upcp multiple deadlocks

On Friday, December 17 a deadlock was introduced into a Dovecot package update that utilizes ApisCP's Bootstrapper build system to template configuration upon install. Systems affected began reporting spurious IMAP/POP3 login rejections with first reports occurring shortly after 12 AM EST (-0500 GMT) during routine platform updates. The issue largely affected CentOS 8/Alma/Rocky Linux machines. CentOS 7 was unaffected in most timezones.

Automatic remediation is blocked by two separate locking mechanisms both deadlocked, one in RPM and another in upcp. A manual intervention is required to resolve.

To determine if you are affected, verify that Dovecot's administration doveadm auth cache flush completes successfully.

OK
12 cache entries flushed

FAILED
doveadm: Fatal: Couldn't load required plugin /usr/lib64/dovecot/lib20_expire_plugin.so: dlopen() failed: /usr/lib64/dovecot/lib20_expire_plugin.so: undefined symbol: t_pop_check

If your output is similar to the FAILED response above, two remediation pathways exist:

  1. Reboot the server. This will terminate both rpm and ansible-playbook processes contending for two locks. Run upcp after reboot to confirm system is regularly receiving updates.
  2. Terminate the nested RPM process spawned within an upcp session to release the lock. Let upcp migrations complete, then run upcp again to fetch new code that corrects this.

    The following code will perform those tasks:
[[ -f /var/lib/rpm/.rpm.lock ]] && kill -9 "$(lsof -t /var/lib/rpm/.rpm.lock)" 2> /dev/null && sleep 15 && upcp

If any of these commands fail, then reach out to help@apiscp.com.

What happened

ApisCP provides 2 update channels, code from GitLab and package through the OS with yum/dnf. These channels operate independently to ensure redundancy if one update channel is interrupted such as a package update that can be remediated with a FLARE code update.

Platform management is built around Ansible that allows configuration to be applied an infinite number of times but only changing what needs to be changed. Keeping an idempotent design reduces cognitive overload by filtering noise: focus on the big picture, not details. If a service changed, report the change rather than the individual steps that went into evaluating the change. A lock is utilized to avoid having multiple processes run concurrently that may change one state only to undo another configuration's state change, a singleton design pattern. Any changes made within the UI places the request in a queue to ensure successive invocations.

Complex situation in which four processes (blue lines) compete for one resource (grey circle), following a right-before-left policy. A deadlock occurs when all processes lock the resource simultaneously (black lines). Courtesy Wikipedia.

The first deadlock arises in upcp, a shell script responsible for code updates, doesn't have the same logic pathway built in. It will wait until the Bootstrapper queue is empty before running. Sometimes these queues take seconds and sometimes minutes to complete. upcp will wait until Ansible completes.

ApisCP relies heavily on templating to ensure package versions work with the platform. For this scenario, Dovecot uses a single template to preserve Bootstrapper settings while reducing the risk of having changes steamrolled in a platform scrub ("upcp -sb" command). An update under internal testing was moved into production on Friday to correct spurious cgroup errors reported in maillog from optional cpuset, freezer support added in 3.2.28. This update also upgraded Dovecot from 2.2 to 2.3 released nearly 2 years ago.

A path less traveled

A second deadlock occurred juggling systems that may have Dovecot 2.2 installed if a migration bundled in a panel update occurred before a routine package update processed. Likewise if cpcmd scope:get system.update-policy is disabled this was an inevitable calamity.

CentOS 7 calls yum hourly in /etc/cron.hourly/0yum-hourly.cron. In CentOS 8/Alma/Rocky this is moved to a timer, dnf-automatic. Clients unaffected had the dovecot package update apply before the panel update thus making the deadlock OS and timezone-dependent. Ordering can be confirmed by looking at both dnf-automatic.timer and upcp.timer from systemctl list-timers:

[root@paixhans cron.d]# systemctl list-timers
NEXT                         LEFT                LAST                         PASSED             UNIT                         ACTIVATES
Mon 2021-12-20 20:56:46 GMT  10min left          Mon 2021-12-20 20:28:38 GMT  17min ago          apnscp-flare-check.timer     apnscp-flare-check.service
Mon 2021-12-20 21:00:00 GMT  14min left          Mon 2021-12-20 20:30:01 GMT  15min ago          apnscp-storage-log.timer     apnscp-storage-log.service
Mon 2021-12-20 21:57:23 GMT  1h 11min left       Mon 2021-12-20 20:41:16 GMT  4min 30s ago       dnf-makecache.timer          dnf-makecache.service
Tue 2021-12-21 00:00:00 GMT  3h 14min left       Mon 2021-12-20 00:00:01 GMT  20h ago            bwcron.timer                 bwcron.service
Tue 2021-12-21 00:00:00 GMT  3h 14min left       Mon 2021-12-20 00:00:01 GMT  20h ago            metricscron.timer            metricscron.service
Tue 2021-12-21 00:00:00 GMT  3h 14min left       Mon 2021-12-20 00:00:01 GMT  20h ago            sa-update.timer              sa-update.service
Tue 2021-12-21 00:00:00 GMT  3h 14min left       Mon 2021-12-20 00:00:01 GMT  20h ago            unbound-anchor.timer         unbound-anchor.service
Tue 2021-12-21 00:09:23 GMT  3h 23min left       Mon 2021-12-20 00:05:39 GMT  20h ago            upcp.timer                   upcp.service
Tue 2021-12-21 03:41:52 GMT  6h left             Mon 2021-12-20 03:41:52 GMT  17h ago            systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Tue 2021-12-21 06:43:18 GMT  9h left             Mon 2021-12-20 06:00:27 GMT  14h ago            dnf-automatic.timer          dnf-automatic.service
Sat 2022-01-01 01:14:05 GMT  1 weeks 4 days left Wed 2021-12-01 01:35:51 GMT  2 weeks 5 days ago platform-scrub.timer         platform-scrub.service
As seen from above there was a ~5 1/2 hour time between upcp running first before dnf. If dnf ran first then no deadlock. If upcp ran first, then the deadlock was encountered.

Dovecot 2.3 introduced a few important directive changes that are reconciled on December 10th with 3.2.30.1. Dovecot 2.3 was released through the package update channel 1 week after 3.2.30.1 that adds an update hook to recompile /etc/dovecot/conf.d/apnscp.conf. This update hook in mail/configure-dovecot also calls yum/dnf package management to ensure packages are up to date.

For the actual lock, a migration released in 3.2.30.2 calls yum to update dovecot that in turn updates the dovecot package if not already updated that calls Bootstrapper again to recompile /etc/dovecot/conf.d/apnscp.conf. POSIX standards disallow lock inheritance, thus the rpm database lock is held by the parent process spawned by line 17. In turn, the %post script invoked yum again as part of mail/configure-dovecot whose .rpm.lock advisory lock file was held by the parent process resulting in a deadlock of that yum task (PID 931107 below) in turn blocking successive upcp calls.

root      929305  0.0  0.0  17072  3484 ?        Ss   Dec17   0:00 /bin/bash /usr/local/apnscp/build/upcp.sh
root      929344  0.0  0.0  16940   304 ?        S    Dec17   0:00  \_ /bin/bash /usr/local/apnscp/build/upcp.sh
root      929345  0.0  0.0   7312   800 ?        S    Dec17   0:00  |   \_ /bin/tee -a /usr/local/apnscp/storage/.upcp.failure
apnscp    929494  0.0  0.1 238580 54384 ?        S    Dec17   0:01  \_ apnscp_php ./artisan migrate --force
root      929738 17.5  0.2 309900 67164 ?        Rl   Dec17 193:17      \_ /usr/bin/python3.6 /bin/ansible-playbook -c local migrator.yml --extra-vars=migrations='2021_12_18_002612_update-yum-hook' --tags=up
root      931105  0.0  0.1 310932 59436 ?        S    Dec17   0:00          \_ /usr/bin/python3.6 /bin/ansible-playbook -c local migrator.yml --extra-vars=migrations='2021_12_18_002612_update-yum-hook' --tags=
root      931106  0.0  0.0  14372  3044 ?        S    Dec17   0:00              \_ /bin/sh -c /usr/bin/python3.6 && sleep 0
root      931107  0.0  0.4 720520 160084 ?       S    Dec17   0:09                  \_ /usr/bin/python3.6
root      931472  0.0  0.0  14372  3184 ?        S    Dec17   0:00                      \_ /bin/sh /var/tmp/rpm-tmp.kBSE68 2
root      931481 17.7  0.2 320560 77284 ?        Rl   Dec17 194:34                          \_ /usr/bin/python3.6 /bin/ansible-playbook bootstrap.yml --tags=mail/configure-dovecot --extra-vars=dovecot_version=
root      932017  0.0  0.2 323864 73008 ?        S    Dec17   0:00                              \_ /usr/bin/python3.6 /bin/ansible-playbook bootstrap.yml --tags=mail/configure-dovecot --extra-vars=dovecot_vers
root      932020  0.0  0.0  14372  2992 ?        S    Dec17   0:00                                  \_ /bin/sh -c /usr/bin/python3.6 && sleep 0
root      932021  0.0  0.3 466804 128360 ?       S    Dec17   0:17                                      \_ /usr/bin/python3.6
COMMAND             PID  TYPE SIZE MODE  M      START        END PATH
python3.6        931107 POSIX 1.5M READ  0 1073741826 1073742335 /var/lib/dnf/history.sqlite
python3.6        931107 POSIX  32K READ  0        128        128 /var/lib/dnf/history.sqlite-shm
python3.6        931107 POSIX   0B WRITE 0          0          0 /var/lib/rpm/.rpm.lock
python3.6        931107 FLOCK 344K READ  0          0          0 /var/lib/rpm/__db.001
/var/lib/dnf/history.sqlite
python3.6        932021 POSIX  32K READ  0        128        128 /var/lib/dnf/history.sqlite-shm
dnf-automatic   1001216 POSIX 1.5M READ  0 1073741826 1073742335 /var/lib/dnf/history.sqlite
dnf-automatic   1001216 POSIX  32K READ  0        128        128 /var/lib/dnf/history.sqlite-shm

With both update channels blocked there are no other means to reasonably ameliorate the situation.

Rough model of the deadlock. Locks are freed once the holding process exits.

Changes going forward

RPM updates now solicit a flag to indicate how it was invoked if through yum/dnf update channels. A new RPM macro is present in 3.2.30.2 that will be bundled with future ApisCP packages to call the proper roles while passing yum_transaction_hook. Any third-party playbooks that utilize yum are advised to the following when condition as in the following sample:

- name: "{{ mail_enabled | ternary('Install','Remove') }} Dovecot RPMs"
  yum:
    name:
      - dovecot
      - dovecot-pigeonhole
      - dovecot*-apnscp
    state: '{{ mail_enabled | ternary("installed","absent") }}'
  when: yum_transaction_hook | default('') == ""

yum_transaction_hook must be set to update whenever a package is updated. This is accomplished with the RPM macro %run_apnscp_update which effectively bypasses nested rpm database access.

Second, FLARE updates now include a second non-null space-delimited parameter to clear the Bootstrapper lock file. This would allow an amended release to run if an existing ansible-playbook process is running.


Failure always creates learning opportunities and learning yields growth. Going forward, I believe these changes introduce the most compatible means of avoiding such a situation going forward.

As always if you have any other questions feel free to reach out to help@apiscp.com or join us on our community Discord.

Keeping in touch

This issue was initially reported within the community Discord approximately 12 AM EST (-0500 GMT) on Friday, December 18th. Likewise once a resolution was determined, a tweet sent out from @apisnetworks later that evening.

This blog post was delayed 24 hours to December 20th as two additional concerns were analyzed (login case changes in Dovecot's ABI + explicitly naming login_trusted_networks). An email was sent out later that same day.

👉 @apisnetworks is the best secondary line of communication for issues with Discord being primary. 👈