Puppet Recovery Playbook: Fix Code Deploy Failures, File Sync Issues, and PCP Connection Errors
After a UID/GID mismatch and misconfigured file permissions disrupted multiple Puppet services, including pe-puppetserver
and file-sync on a replica node, we performed a comprehensive recovery. This post documents all issues, exact errors, and the steps we took to fix them.
Issues & Errors Encountered
1. Puppet Server Restart Failure
Error:
Execution error (FileAlreadyExistsException): /opt/puppetlabs/server/data/analytics/analytics
Fix:
mv /opt/puppetlabs/server/data/analytics/analytics /opt/puppetlabs/server/data/analytics/analytics.bak_wrong
mkdir -p /opt/puppetlabs/server/data/analytics/analytics
chown pe-puppet:pe-puppet /opt/puppetlabs/server/data/analytics/analytics
chmod 750 /opt/puppetlabs/server/data/analytics/analytics
2. puppet code deploy
Failed
Error:
invalid or unknown remote ssh hostkey
Fix:
sudo -u pe-puppet ssh-keyscan -p 443 ssh.github.com >> /var/opt/lib/pe-puppet/.ssh/known_hosts
chmod 600 /var/opt/lib/pe-puppet/.ssh/known_hosts
chown pe-puppet:pe-puppet /var/opt/lib/pe-puppet/.ssh/known_hosts
3. UID/GID Mismatch
Observation:
id -a pe-puppet # Showed incorrect UID/GID
Fix:
find / -xdev -uid <wrong_uid> -exec chown pe-puppet:pe-puppet {} +
find / -xdev -gid <wrong_gid> -exec chown :pe-puppet {} +
4. pe-puppetserver
Failed to Start
Error:
Unable to create/set permissions for rundir: /run/puppetlabs/puppetserver
Fix:
chown -R pe-puppet:pe-puppet /run/puppetlabs/puppetserver
chmod 755 /run/puppetlabs/puppetserver
5. Restart Counter File Error
Error:
restartcounter is not readable/writable
Fix:
chown pe-puppet:pe-puppet /opt/puppetlabs/server/data/puppetserver/restartcounter
6. Serial File .tmp
Permission Errors
Error:
Execution error (FileSystemException): /etc/puppetlabs/puppetserver/ca/infra_serials<rand>tmp: Operation not permitted
Fix:
chown pe-puppet:pe-puppet /etc/puppetlabs/puppetserver/ca/*.tmp
Validation:
id -a pe-puppet # Verified GID was correct, not inherited from pe-host-action-collector
7. Token Expired During Replica Provisioning
Error:
The provided token has expired
Fix:
puppet access login
8. Replica Not Connected to PCP
Error:
The node <replica-server> is not connected to the primary via PCP
Fix:
- Added entries in
/etc/hosts
on both nodes - Started the
pxp-agent
service:
systemctl start pxp-agent
- Initiated CSR from replica:
puppet agent -t
- Signed it on the primary:
puppetserver ca sign --certname <replica-server>
9. File Sync Not Running on Replica
Error:
Couldn't connect to server (https://puppet-master:8140/file-sync/v1/latest-commits): (Connection refused)
Fix:
- Ensured correct UID/GID:
id -a pe-puppet
- Corrected ownership on critical files/directories:
find / -uid <wrong_uid> -exec chown pe-puppet:pe-puppet {} +
Final Validation
puppet infrastructure status
Expected Output:
All services are fully operational on both primary and replica
Summary
- Ownership and permissions were the root cause of cascading service failures.
- Fixing UID/GID mismatches, cleaning up
.tmp
files, and ensuring valid certificates resolved all issues. - Documenting each error and its corresponding fix was essential for recovery and future-proofing.
Comments
Post a Comment