Automation in Regulated & Air-Gapped Firewall Environments

Why Automation Matters in Regulated & Air-Gapped Firewall Environments

It’s 2:00 AM on a Saturday. You’re in a secure, windowless data center or a remote utility substation, and the air conditioning is humming just a little too loud. Your team has a four-hour maintenance window to apply a critical PAN-OS hotfix to a fleet of Palo Alto Networks firewall high-availability (HA) pairs.

Because this is a regulated, air-gapped environment, there’s no Internet access. The process is a “sneaker-net” nightmare: you’ve spent days getting the images vetted, transferred via a secure jump host, and staged on an internal file share. Now, you’re manually logging into each device, one by one, carefully following a 9-step text-file runbook to suspend the passive node, upload the image, install, reboot, and pray. You then suspend the active node, trigger the failover, and repeat the process, all while staring at a clock, knowing every manual click is a chance for a typo that could cause a production outage and a long night.  

This scenario is the all-too-common reality for network and security teams in critical infrastructure, finance, healthcare, and government.

The central problem is a painful paradox:

  • Regulatory mandates (like NERC CIP, PCI DSS, and HIPAA) demand strict network isolation and air-gaps to protect critical assets.  
  • These same regulations also demand timely, consistent, and documented patch and change management to protect against vulnerabilities.  

These two requirements are in direct operational conflict. The manual, high-friction process of updating air-gapped devices makes timely patching so difficult that it actively discourages the very compliance it’s meant to achieve.

This is where automation becomes the bridge between regulatory constraints and operational reality. In this series, we will explore how to use Ansible to safely update and upgrade Palo Alto Networks firewalls in offline environments. But first, we must establish why automation isn’t just “nice to have.” It is the only practical way to remain secure, compliant, and sane.

Why Are Environments Air-Gapped in the First Place?

These “offline-by-design” networks don’t exist by accident. They are a deliberate, fundamental security control mandated by risk assessments and regulatory frameworks.

  • NERC CIP (North American Electric Reliability Corporation Critical Infrastructure Protection): For the energy sector, the primary goal is protecting the Bulk Electric System (BES). Standards like CIP-007 (Systems Security Management) and CIP-010 (Configuration Change Management) enforce strict controls on how “BES Cyber-Assets” are managed, patched, and changed, often requiring them to be isolated from the Internet to prevent a remote attack on the power grid.  
  • PCI DSS (Payment Card Industry Data Security Standard): In finance, the goal is to protect the Cardholder Data Environment (CDE). The easiest way to reduce the scope of a PCI audit is to use firewalls to create a heavily segmented, “air-gapped” CDE with no (or strictly controlled) connectivity to the rest of the corporate network or the Internet.  
  • HIPAA (Health Insurance Portability and Accountability Act): The HIPAA Security Rule requires “technical safeguards” to protect the confidentiality, integrity, and availability of electronic protected health information (ePHI). Isolating critical systems holding ePHI is a common and effective architectural safeguard.  
  • ISO 27001 / Annex A: This is a formal risk management framework. For many organizations, a risk assessment (A.12.6, A.14) identifies that the risk of Internet-based threats to a critical system is too high, and the appropriate mitigating control is network isolation.  

Field Notes: A Quick Glossary
  • CDE (Cardholder Data Environment): Per PCI DSS, this is the part of the network that possesses cardholder data or sensitive authentication data. It’s the sort of “crown jewels” that must be segmented and protected.  
  • BES Cyber-Asset (BCA): Per NERC CIP, this is a “Cyber Asset that if rendered unavailable, degraded, or misused would, within 15 minutes of its required operation, misoperation, or instability, adversely impact the reliable operation of the Bulk Electric System (BES).”  
  • HA Pre-emption: A Palo Alto Networks setting that determines if a firewall in a High-Availability pair should automatically reclaim its “active” role (e.g., after a reboot) if it has a higher priority than the other node. Disabling it is critical during upgrades to prevent the firewall you just upgraded from immediately failing back over before you’re ready.  

The critical paradox, however, is that these regulations also mandate timely patching. “Offline” does not mean “patch whenever.”

  • PCI DSS 6.2 is explicit: critical security patches must be installed within one month (30 days) of release.  
  • NERC CIP-007 R2 requires a documented patch management process to identify and assess patches within 35 calendar days of release.  
  • HIPAA 45 C.F.R. § 164.308(a)(1) requires a formal “Risk Management” process. Knowingly leaving a firewall unpatched with a critical, known vulnerability—even if it’s “offline”—is a failure of risk management, as it ignores insider threats, lateral movement, or misconfiguration.  
  • ISO 27001 A.12.6.1 demands a process to manage technical vulnerabilities “in a timely manner” and take “appropriate measures”.  

This is the central bind: regulations demand isolation, which makes patching hard, but they also demand timely patching. Your manual process is the weak link that breaks under this pressure.

The Pain of Manual Updates in Air-Gapped Palo Alto Environments

Let’s detail the “before automation” picture. The process is a mix of documented procedures, tribal knowledge, and sheer hope.

A “simple” PAN-OS upgrade on a standalone firewall involves a tedious sequence of manual steps. For a High-Availability (HA) pair, often called a “PITA” (Pain In The A**) by field engineers, the complexity explodes.

Task Manual Standalone Upgrade Manual HA Pair Upgrade
Preparation 1. Download Impage in DMZ. 1. Disable pre-emption & commit.
  2. Vet and hash-check image. 2. Download/Sync image to both nodes.
  3. Securely transfer image to air-gapped zone. 3. Securely transfer image to air-gapped zone.
Execution 4. Log into firewall GUI. 4. Log into passive node.
  5. Manually upload image. 5. suspend the passive node.
  6. Manually click “Install”. 6. Install image on (suspended) passive node.
  7. Manually reboot device. 7. Reboot passive node. Wait. Verify.
    8. Log into active node.
    9. suspend the active node (triggers failover).
    10. Install image on new passive node.
    11. Reboot new passive node. Wait. Verify.
Post-Checks 8. Log back in, verify version and health. 12. Re-enable pre-emption & commit. Verify HA state.

 

This 12-step dance is performed manually, per-pair, often in the middle of the night. Even a “supported” offline method for content updates, which involves setting up an Internet-facing Panorama to push updates to an SCP server that your air-gapped Panorama then pulls from, is a brittle, complex, multi-system dependency.   

The risks of this manual approach are obvious and severe:

  • Human Error:A typo, a mis-click, upgrading the active node first, or forgetting to suspend a device can lead to a full-blown, network-down outage.   
  • Inconsistency:Engineer A follows the 12-step plan. Engineer B, in a hurry, follows 10. This “configuration drift” creates snowflake devices that are impossible to manage or audit at scale.
  • Missed Patches:When the process is this painful, patching gets delayed. Devices are forgotten. The spreadsheet (the “source of truth”) gets out of date. This is how you fail to meet your 30-day window.   
  • Failed Audits:The evidence you provide to an auditor is a scattered mess of change tickets, screenshots, and spreadsheet cells. It’s not a credible, immutable record.

Audit Reality: An auditor (for NERC CIP, PCI, or ISO 27001) will not ask, “Did you patch your firewalls?” They will select a sample of assets and say, “Show me the evidence that firewall PA-SUBSTATION-01a was patched for CVE-2024-1234, which was released on March 1st. Show me the documented change, the authorization for that change, the proof of execution, and the post-change verification, all mapping to your 35-day policy.”. A spreadsheet with a “Done” cell and a screenshot of a login prompt will not satisfy this request. Manual processes struggle to provide this “golden thread” of evidence.   

Signs You Need Automation in Your Air-Gapped Environment

  • You use a spreadsheet as your primary source of truth for firewall versions.
  • The phrase “HA pair upgrade” causes universal groaning in your team meetings.
  • Your “rollback plan” is a paragraph in a Word document that says “revert to last known good config.”
  • It takes you more than two weeks to fully roll out a “critical” patch.
  • You dread audits because you know you’ll spend days gathering screenshots and logs.
  • You have “snowflake” devices that everyone is afraid to touch because their configuration is “special.”
  • Your Mean Time to Recovery (MTTR) for a misconfiguration is longer than your Recovery Time Objective (RTO).
  • You have multiple engineers with different ‘muscle memory’ or personal checklists for the same procedure.

Why Automation Changes the Game (Especially in Regulated Environments)

In this context, automation means one thing: codifying your operational workflows. It’s about taking that 12-step manual runbook, with all its “tribal knowledge” and hidden “gotchas,” and turning it into a repeatable, tested, and evidence-rich piece of code.

This approach fundamentally changes the game by delivering five key benefits:

  1. Repeatability & Consistency: An Ansible playbook executes the exact same 12 steps, in the exact same order, on every single HA pair, every single time. This is the ultimate weapon against configuration drift.
  2. Reduced Human Error: The automation handles the logic. It never “forgets” to suspend the passive node. It never has a typo in a command. It deterministically moves from step to step, eliminating the single greatest cause of outages: 3:00 AM human fatigue.  
  3. Faster Time-to-Patch: What takes a team three weeks of coordinated manual effort and late-night windows can be compressed into a single, automated run. This makes the 30-day (PCI) and 35-day (NERC) windows not just achievable, but routine.  
  4. Standardized Rollback Plans: A manual rollback is often a panic-driven scramble. An automated rollback is just another playbook (yml). (Note: PAN-OS version downgrades are generally not supported via this process. Rollback playbooks should focus on configuration state restoration using snapshots and panos_op commands to revert to a previous config, not on OS version reversions.)
  5. Built-in, Audit-Ready Evidence: This is the most powerful benefit for regulated environments. When the upgrade is run, the automation platform (like Ansible) generates a detailed, time-stamped log of every command executed, every check performed, and every device touched. This log, combined with your version-controlled playbook, is the audit trail.
      

This shift directly supports your regulatory outcomes. You can now prove you patched on time (NERC CIP-007, PCI 6.2). You can demonstrate a controlled, documented, and authorized change process (NERC CIP-010, PCI 6.4, ISO 27001 A.14). And you can show you are actively managing risk (HIPAA 164.308).  

Field Note: Securing the Automation Itself Automating your firewall management introduces a new, powerful “super admin” tool. This tool must be secured with the same rigor as your firewalls.

  • Credentials: All API keys and passwords must be encrypted using ansible-vault and never stored in plaintext Git repositories.
  • Least Privilege: The Ansible service account (API key) on Panorama and the firewalls should use a custom Admin Role, granting only the specific API permissions required for configuration, import, and operational commands (like HA suspend/reboot). Avoid using superuser for automated tasks.
  • Audit Logging: Ensure the API service account’s activity is logged, and playbook execution itself is logged on the Ansible control node, providing a clear record of who (or what) ran which playbook and when.  

Where Ansible Fits for Palo Alto Networks Firewalls in Air-Gapped Environments

This series focuses on Ansible, and for good reason. It is uniquely suited for automating Palo Alto Networks devices in high-security environments.

  • It’s Agentless: This is the number-one reason. You don’t have to install any special software or “agent” on your firewalls, which is a non-starter in a regulated environment. Ansible communicates with PAN-OS using the built-in, secure XML API.
  • It’s Inventory-Driven: Ansible can read a simple text file (or a dynamic script) to understand your entire infrastructure, letting you group devices by site, function, or HA pair (e.g., [substation_firewalls], [pci_cde_firewalls]).
  • It’s Idempotent: Ansible playbooks are designed to enforce a desired state. You can run a playbook to “ensure PAN-OS 10.1.10 is installed” a hundred times. If the firewall is already on 10.1.10, Ansible will simply report “OK” and change nothing.
  • It’s Mature: The panos collection for Ansible is robust, officially supported, and contains modules for nearly every task you can imagine.  

How Ansible Works Offline

Ansible doesn’t need the Internet; it just needs network access to the devices it manages. The architecture is straightforward:

  1. Ansible Control Node: A hardened Linux server (e.g., RHEL) is deployed inside your air-gapped management zone. This node holds the playbooks and initiates all commands.
  2. Internal Artifact Repository: Your “golden,” vetted PAN-OS images and content files are staged on an internal web server, file share, or a full-fledged artifact manager like Nexus or Artifactory.  
  3. The Playbook: The Ansible playbook runs from the control node and orchestrates the entire process, replacing the manual human steps:  
  • Pre-checks: Uses the panos_op module to run show chassis-ready, show high-availability state, and show system disk-space to validate the firewall is ready.  
  • Upload/Install: Uses the panos_import module to upload the image file from the internal repository directly to the firewall before installing. It then uses panos_op to execute the request system software install version...
  • Orchestration: Uses panos_op to run request high-availability state suspend to control HA failover, just as a human would.  
  • Post-checks: Loops, using panos_op to check show chassis-ready until the device is back online and healthy after a reboot.  

 

Design Principles for Air-Gapped Automation

When you build this automation pipeline, you’re not just moving scripts around. You’re building a compliance framework. The following principles are essential.

  • Golden Artifacts Only: No file enters the secure zone without being vetted. All PAN-OS images, content updates, and even Ansible collection updates are downloaded, scanned, and have their hashes verified in a staging zone before being moved to the internal artifact repository.
  • Cryptographic Verification at Every Stage: All PAN-OS images and content files must have their SHA256 hashes verified at every step: upon download from the vendor, upon ingestion into the internal repository, and again by the playbook (using a module like builtin.stat) before being imported to the firewall. The playbook must fail if a hash mismatch is detected.
  • Separation of Duties: The team that writes the automation (Automation Engineering) should not be the same team that approves its use (Security/Compliance) or executes it in production (Network Operations). This is a core tenant of PCI 6.4 and strong change management. An ideal workflow has Compliance approve the Git pull request, while Operations executes the merged-and-approved playbook from a locked-down bastion host.
  • Test Stage Production (All Inside the Air-Gap): Your secure zone must have its own non-production environment. This could be a lab-based HA pair or a single non-critical device. All automation is tested here, inside the air-gap, before it is ever pointed at production assets.
  • No Snowflake Devices: The goal of automation is to enforce a consistent baseline. If a device requires a “special” one-off configuration, that exception should be documented in the automation (e.g., using an Ansible group variable), not applied manually on the side.
  • Everything as Code (GitOps): Your playbooks, inventories, and runbooks must live in an internal, air-gapped Git server (like GitLab or GitHub Enterprise). The git commit log is the documented history of your changes. A “Pull Request” is your documented approval process. This is the heart of your audit-ready evidence.

 

Mini Case Study: Before vs. After Automation

Scenario 1: A NERC CIP-Regulated Utility

  • Before: A critical PAN-OS vulnerability is announced. The 35-day assessment clock starts ticking. The security team spends four weeks in “hero mode,” coordinating with multiple site operators to schedule manual, late-night change windows to patch 60 firewalls across remote substations. The process is stressful, error-prone, and the final evidence package for the audit is a 100-page PDF of screenshots and change tickets.  
  • After: A critical vulnerability is announced. The pre-existing, pre-tested Ansible upgrade playbook is pulled from the internal Git server. After a single test-run in the lab, it’s executed during the next standard maintenance window. The playbook runs, upgrading all 60 firewalls in a consistent, orchestrated fashion, including all HA failover logic. The entire rollout is complete in four hours. The audit evidence is a link to the Git commit (the approval) and the Ansible log file (the execution proof).

Scenario 2: A PCI-Regulated Bank

  • Before: The bank’s CDE firewalls are air-gapped. The 30-day window for a critical patch is closing. The team schedules a full-day weekend window, manually upgrading each HA pair one by one. A mis-click on one pair causes a 30-minute outage, triggering alarms and a tense call from the application team.  
  • After: A critical patch is announced. The security team creates a new branch in Git, updates the version: variable in the yml playbook, and opens a Pull Request. The change is approved by the compliance lead. During the next nightly window, the playbook is run. It automatically validates the pre-check conditions, safely upgrades the passive node, fails over, upgrades the active node, and runs post-checks, all while logging every step. The QSA (auditor) later reviews the Git log and Ansible output and marks PCI 6.2 and 6.4 “Compliant” in 10 minutes.

Conclusion 

In regulated, air-gapped Palo Alto Networks environments, “staying offline” is a critical security posture, but it is not an excuse for slow or inconsistent patching. The regulations themselves demand both isolation and timely vulnerability management.

Relying on manual, “sneaker-net” updates is no longer a viable strategy. It is slow, prone to human error, and fails to produce the credible, immutable evidence that auditors require. This manual friction is the single greatest threat to both your security and compliance posture.

Automation with Ansible transforms this. It converts high-risk, ad-hoc, manual heroics into a controlled, repeatable, and evidence-rich engineering process. It is the essential bridge that allows you to be both securely isolated and verifiably compliant.

 


Paul Amman is a Senior Cyber Security Engineer at ADS with decades of hands-on experience in network engineering, security architecture, and automation. He specializes in simplifying complex infrastructure and streamlining security operations across large, distributed environments.