Puppet Recovery Playbook: Fix Code Deploy Failures, File Sync Issues, and PCP Connection Errors

After a UID/GID mismatch and misconfigured file permissions disrupted multiple Puppet services, including pe-puppetserver and file-sync on a replica node, we performed a comprehensive recovery. This post documents all issues, exact errors, and the steps we took to fix them.


Issues & Errors Encountered

1. Puppet Server Restart Failure

Error:

Execution error (FileAlreadyExistsException): /opt/puppetlabs/server/data/analytics/analytics

Fix:

mv /opt/puppetlabs/server/data/analytics/analytics /opt/puppetlabs/server/data/analytics/analytics.bak_wrong
mkdir -p /opt/puppetlabs/server/data/analytics/analytics
chown pe-puppet:pe-puppet /opt/puppetlabs/server/data/analytics/analytics
chmod 750 /opt/puppetlabs/server/data/analytics/analytics

2. puppet code deploy Failed

Error:

invalid or unknown remote ssh hostkey

Fix:

sudo -u pe-puppet ssh-keyscan -p 443 ssh.github.com >> /var/opt/lib/pe-puppet/.ssh/known_hosts
chmod 600 /var/opt/lib/pe-puppet/.ssh/known_hosts
chown pe-puppet:pe-puppet /var/opt/lib/pe-puppet/.ssh/known_hosts

3. UID/GID Mismatch

Observation:

id -a pe-puppet  # Showed incorrect UID/GID

Fix:

find / -xdev -uid <wrong_uid> -exec chown pe-puppet:pe-puppet {} +
find / -xdev -gid <wrong_gid> -exec chown :pe-puppet {} +

4. pe-puppetserver Failed to Start

Error:

Unable to create/set permissions for rundir: /run/puppetlabs/puppetserver

Fix:

chown -R pe-puppet:pe-puppet /run/puppetlabs/puppetserver
chmod 755 /run/puppetlabs/puppetserver

5. Restart Counter File Error

Error:

restartcounter is not readable/writable

Fix:

chown pe-puppet:pe-puppet /opt/puppetlabs/server/data/puppetserver/restartcounter

6. Serial File .tmp Permission Errors

Error:

Execution error (FileSystemException): /etc/puppetlabs/puppetserver/ca/infra_serials<rand>tmp: Operation not permitted

Fix:

chown pe-puppet:pe-puppet /etc/puppetlabs/puppetserver/ca/*.tmp

Validation:

id -a pe-puppet  # Verified GID was correct, not inherited from pe-host-action-collector

7. Token Expired During Replica Provisioning

Error:

The provided token has expired

Fix:

puppet access login

8. Replica Not Connected to PCP

Error:

The node <replica-server> is not connected to the primary via PCP

Fix:

  • Added entries in /etc/hosts on both nodes
  • Started the pxp-agent service:
systemctl start pxp-agent
  • Initiated CSR from replica:
puppet agent -t
  • Signed it on the primary:
puppetserver ca sign --certname <replica-server> 

9. File Sync Not Running on Replica

Error:

Couldn't connect to server (https://puppet-master:8140/file-sync/v1/latest-commits): (Connection refused)

Fix:

  • Ensured correct UID/GID:
id -a pe-puppet
  • Corrected ownership on critical files/directories:
find / -uid <wrong_uid> -exec chown pe-puppet:pe-puppet {} +

Final Validation

puppet infrastructure status

Expected Output:

All services are fully operational on both primary and replica


Summary

  • Ownership and permissions were the root cause of cascading service failures.
  • Fixing UID/GID mismatches, cleaning up .tmp files, and ensuring valid certificates resolved all issues.
  • Documenting each error and its corresponding fix was essential for recovery and future-proofing.

Comments

Popular Posts

Why SELinux Is Blocking Your Service (and How to Fix It)

Oracle ASM Disks Not Visible After Reboot - Diagnosis and Fix

How to Integrate Ansible with AWS EC2 and Manage Instances Dynamically