Cloud Security Best Practices - Cloud Security Office Hours

Close-up of a checklist with green checkmarks — Photo by Towfiqu barbhuiya on Pexels

Last updated 2026-04-30 · By Shawn Nunley · Vendor-neutral · View source on GitHub

The honest version: Most "cloud security best practices" lists are 100 items long and unranked, which is the same as not having a list. The bullets below are ordered by what actually appears as the root cause in our breach kill chains. Master the first three sections and you'll prevent the majority of cloud incidents seen in industry. The later sections are where mature programs differentiate.

If a practice isn't here, that's not because it doesn't matter — it's because we're calling out the high-leverage ones. Pair this page with the cloud security overview for foundational concepts and the certifications guide if you want to formalize.

The 8 practice areas

Ranked roughly by how often the absence of each shows up in real cloud incidents. Click any tile to jump to its details:

1. Identity & access

Most important. MFA, least privilege, no long-lived keys.

2. Configuration & posture

CIS benchmarks, public-block-by-default, encryption-by-default.

3. Network controls

Segmentation, private endpoints, egress filtering.

4. Data protection & secrets

KMS, secrets managers, classification, exfil prevention.

5. Logging & detection

CloudTrail/Activity/Audit, GuardDuty/Defender, SIEM pipeline.

6. Supply chain & CI/CD

SBOM, signed images, OIDC for cloud auth, locked-down runners.

7. Workloads & containers

Image scanning, runtime protection, K8s policy, no-root.

8. AI & LLM workloads

Prompt-injection guardrails, agent permissions, training-data hygiene.

The practices below are roughly ordered by where they show up in this maturity ladder — start with the Stage 2 essentials, layer Stage 3 tooling once those are in place.

Identity & access (most important)

If you only fix one category, fix this one. Identity is the perimeter — every cloud breach we've documented involves an identity failure somewhere in the chain.

The deeper a permission lives, the more painful misconfiguration is at the top. SCP / org-policy at the root is the cheapest control by orders of magnitude.

Phishing-resistant MFA on every human identity. Push prompts and SMS codes are bypassable (see Uber 2022 and Scattered Spider/MGM). FIDO2 security keys, passkeys, and platform authenticators are the bar. Enforce via Conditional Access / IAM policies, not just policy documents.
Least privilege on every workload identity. The Capital One breach happened because a WAF role had S3 list and read across 700 buckets. Read-only-on-one-bucket would have stopped it. Use IAM Access Analyzer (AWS), Microsoft Entra Privileged Identity Management, or Google's IAM Recommender to find and trim unused permissions continuously.
Eliminate long-lived access keys. Use short-lived federated credentials via OIDC for CI/CD, workload-identity federation for service-to-service, and SSO for humans. Long-lived keys are the foundation of most credential-theft incidents.
Enforce IMDSv2 on every EC2 instance. The Capital One pattern (SSRF → IMDSv1 → temporary creds → S3 exfil) is closed by IMDSv2's session-token requirement. AWS defaults new instances to v2; enforce org-wide via SCP.
Just-in-time access for privileged operations. Standing admin = standing risk. Use Microsoft PIM, AWS IAM Identity Center session policies, or Google Cloud Just-in-Time Access for elevation with approval and time bounds.
Centralize identity with one IdP. Federate everything (cloud accounts, SaaS, internal tools) to one source of truth. Lifecycle events (joiner/mover/leaver) only need to fire in one place. Disconnected admin accounts are a common breach vector.
Audit Conditional Access / IAM policies regularly. Policies drift. Weekly automated reports of "users with no MFA, accounts with no last-sign-in, roles with unused permissions" close gaps before attackers find them.

In the cloud, identity is the new perimeter — and most cloud breaches involve at least one identity failure. — why the section above is first, not last

Configuration & posture

Misconfigurations are the second-most-common breach root cause. The good news: posture management tools have made this category solvable for most organizations.

Continuous posture scanning across all accounts. Run a CSPM tool (Prowler and ScoutSuite are the open-source defaults; commercial CNAPP platforms bundle this). Scan against CIS benchmarks, AWS Foundational Security Best Practices, or your provider's equivalent — and FIX the findings, don't just collect them.
Default-deny on public access. S3 bucket public-access block, Azure storage firewall set to deny by default, GCS uniform bucket-level access. Revoke default-public IAM grants on day one of every new account.
Encryption-at-rest defaults on every storage service. S3, EBS, RDS, Azure Storage, Cosmos DB, GCS, BigQuery — all support customer-managed keys. Enable. Enforce via SCP / Azure Policy / Org Policy so users can't disable.
Tag everything for ownership and blast radius. "Owner: team-name" + "data-classification: public/internal/confidential" tags let you scope incident response without a wiki lookup.
Use guardrail accounts, not exception lists. Each business unit gets its own AWS account / Azure subscription / GCP project. SCPs / Azure Policy / Organization Policy enforce floor-level controls. Cross-account access is explicit, not implicit.
Drift-detect Infrastructure as Code. Terraform plan in CI on every change. AWS Config, Azure Policy, or GCP Asset Inventory to catch console-mode "quick fixes" that diverge from code.

Network controls

Cloud networks aren't perimeters in the on-prem sense, but the controls still matter — especially for limiting lateral movement and exfiltration.

Default-deny security groups / NSGs. Open inbound rules only for what you actually need, and only from the source CIDRs that need to reach you. 0.0.0.0/0 on port 22 or 3389 should not exist anywhere except deliberate jump hosts.
VPC endpoints / Private Link / Private Service Connect. Keep traffic to managed services off the public internet. Reduces attack surface and gives you flow logs you can actually audit.
Egress controls and DNS filtering. Most data exfiltration goes out over HTTPS to attacker-controlled domains. Egress proxies, NAT-gateway logs, and DNS filtering (AWS Route 53 Resolver firewall, Azure DNS Private Resolver) catch this when log monitoring is wired up.
Zero-trust between services. Use mTLS via service mesh (Istio, Linkerd) or workload-identity-based authorization. The LAN is hostile; assume an attacker is already on it.
Web Application Firewalls with the modern rule sets. AWS WAF, Azure WAF, Cloudflare. Add explicit rules for SSRF (block requests targeting 169.254.169.254) and the OWASP Top 10 — defaults are not enough.
DDoS protection on internet-facing services. AWS Shield Standard is on by default; Shield Advanced is worth it for high-value targets. Azure Front Door / Cloudflare for edge protection.

Data protection & secrets

Classify sensitive data and treat the classifications as load-bearing. Microsoft Purview, AWS Macie, Google DLP can find PII, PCI, and PHI in storage. Classifications drive access policy, encryption requirements, and incident-response priority.
Customer-managed keys (CMK) for sensitive data. Even though provider-managed keys are encrypted, CMK gives you key-rotation policy, deletion control, and a "stop the bleeding" lever during an incident.
Secrets in a managed store, not in code. AWS Secrets Manager, Azure Key Vault, Google Secret Manager, HashiCorp Vault. Auto-rotate where supported. Never commit a secret — gitleaks, trufflehog, or GitHub secret scanning in CI.
Encrypt in transit, end-to-end. TLS 1.2 minimum, 1.3 preferred. Enforce HTTPS-only on storage services. Internal service-to-service should be mTLS (see network section).
Backup and restore — and TEST the restore. Cross-region replication for catastrophic-loss recovery; immutable backups for ransomware recovery. Untested backups don't count.
DLP on data egress paths. Cloud DLP services flag bulk PII / PCI / PHI moving outside your boundary. Pair with egress controls so the alert fires before the data leaves.

A business analyst reviews a colorful bar chart and documents at a desk — Photo by RDNE Stock project on Pexels

A diverse group of professionals engaged in a collaborative office meeting with laptops — Photo by Yan Krukau on Pexels

Logging & detection

You can't respond to what you don't see. The SolarWinds, Capital One, and LastPass incidents all had detectable activity that nothing was watching for.

Enable cloud-native threat detection in every account. AWS GuardDuty, Microsoft Defender for Cloud, Google Security Command Center. They're the cheapest credible detection layer available — turn them on org-wide.
Centralize control-plane logs. CloudTrail, Azure Activity Log, GCP Audit Logs into a security data lake or SIEM. Retention long enough to investigate (90+ days for incident response, longer for compliance).
Build detection content as code. Sigma rules, KQL queries, Splunk SPL — versioned, peer-reviewed, regression-tested. Map each detection to MITRE ATT&CK Cloud techniques so you can measure coverage.
Alert on impossible behavior, not signatures. "Workload identity used from an unexpected IP," "S3 ListBucket volume anomaly," "Sign-in from impossible-travel locations." Behavior-based detections survive attacker tooling changes.
Test your detections. Atomic Red Team, Stratus Red Team, AWS GuardDuty Tester, Microsoft Attack Simulator. Detection content that hasn't been exercised in production conditions is detection content that probably doesn't fire.
Practice your incident response. Tabletop exercises quarterly. Real exercises (Stratus Red Team in a sandbox account, with the SOC team responding) at least annually. Rehearse the embarrassing scenarios.

Supply chain & CI/CD

Modern attackers pivot through your build pipeline. SolarWinds is the canonical case; the npm/PyPI typosquats and the Codecov compromise are routine.

Pin dependencies to specific versions, not ranges. Lockfiles (package-lock.json, poetry.lock, Pipfile.lock) committed to source. Renovate or Dependabot for automated bumps with PR review.
Software Bill of Materials (SBOM) for every artifact. SPDX or CycloneDX format. Sign with Sigstore/cosign so you can prove a binary came from your build.
SLSA-style hardened build pipelines. Builds run in isolated, hermetic environments; provenance attestations get signed; build inputs are restricted to declared sources.
OIDC, not stored credentials, for cloud deploys. GitHub Actions → AWS / Azure / GCP via OIDC eliminates long-lived deploy keys. No secret to leak.
Scope CI tokens minimally. A test-runner job doesn't need write access to production. Default-deny, then add only what's required.
Pin GitHub Actions to a full SHA. Tags can be moved silently by a compromised maintainer. SHAs cannot. Dependabot or ratchet keeps SHAs current automatically.

Workload & container security

Scan container images at build AND in registry. Trivy, Grype, Snyk, Wiz, Aqua. Block builds on critical CVEs in base layers. Re-scan registry periodically — a clean image at build can rot.
Run containers as non-root with read-only root filesystems. Drop capabilities, seccomp profile, AppArmor or SELinux. Pod Security Standards "restricted" baseline on Kubernetes.
Network Policies in Kubernetes. Default-deny inter-pod traffic; allow only what's declared. Cilium, Calico, or vanilla Kubernetes Network Policies. Without this, a compromised pod is on a flat network.
Runtime threat detection on Linux workloads. Falco or Tetragon (eBPF-based) catch process-level anomalies — unexpected shell spawned, privilege escalation, unusual syscall patterns.
Patch the host plane and the workload plane. Managed Kubernetes (EKS/AKS/GKE) handles control-plane patching; you still own node-image refreshes and worker patching.
Don't expose Kubernetes APIs directly to the internet. Private endpoints + bastion + audit logging. The kubelet API is a known attack target.

AI & LLM workloads

The newest category and the one with the most movement. Treat AI workloads as agentic systems with tool access — because that's what they are.

Treat every LLM input as untrusted. Prompt injection isn't theoretical; it's the default vulnerability of any system that lets a model see attacker-controllable text. Guardrails (input/output filters) help; they're not sufficient.
Limit agent tool authority. If an agent can call APIs, scope those calls to least privilege — same discipline as IAM. Confused-deputy attacks are real.
Sandbox indirect prompt injection sources. If your model retrieves web pages, emails, or PDFs, those are attacker-controlled. Strip HTML, summarize through a constrained sub-call, never feed raw retrieved content into a tool-calling context.
Log model inputs, outputs, and tool calls. Same observability discipline as any other production workload. Tag with user identity for incident attribution.
Don't pipe sensitive data through external model APIs without review. Production traffic to OpenAI / Anthropic / Google APIs goes through DLP. Use enterprise tier endpoints with no-training guarantees for sensitive workloads.
Test against the OWASP LLM Top 10. Prompt injection, insecure output handling, training data poisoning, model DoS, supply chain — same shape as application security, different surface.

Apply all three at every layer of the defense-in-depth stack — at the perimeter, at the network, at identity, at workloads, at data.

Governance & people

Document the shared responsibility line for every service you use. Most "cloud security failures" are misunderstandings of who's supposed to do what. See our shared responsibility model page.
Run cloud security through your SDLC, not as a separate program. Threat modeling at design, IaC scanning at PR, posture scanning at deploy, runtime monitoring in production. Security as a separate org becomes a bottleneck nobody calls.
Train developers and admins continuously. Annual compliance training is theater. Hands-on labs (try our CTF directory), tabletop exercises, and breach-walkthrough sessions (we do these in Friday Zoom) build genuine competence.
Have an incident response plan and rehearse it. Roles, escalation paths, communication templates, legal/PR engagement triggers. The first call during a real incident shouldn't be the first time you've thought about who to call.
Vendor risk for the vendors of your vendors. SaaS your SaaS uses can leak your data without you noticing. SOC 2 reports of subprocessors, not just direct vendors.

Most cloud configuration evidence (CloudTrail, IAM policies, encryption settings) maps to all three frameworks. Build the controls once; tag them per-framework for the auditor.

Anti-patterns: things to stop doing

The flip side. If you see any of these in your environment, fix them before working on more sophisticated controls.

Long-lived IAM access keys for service-to-service auth. Use OIDC federation or workload identity instead.
Public S3 buckets for "convenience." Use signed URLs, CloudFront with OAC, or proper authentication. There is no legitimate reason for "make it public, it's just an internal staging environment."
Wildcard IAM policies ("Action": "*", "Resource": "*"). Even for "admin" roles, scope to actions and resources actually needed.
SSH'ing into instances to debug. Use AWS Systems Manager Session Manager, Azure Bastion, or Google IAP — they leave audit trails and don't require open inbound 22.
"Just give me admin and I'll figure out what I need later." The right answer is "let's find out what you need and grant exactly that." IAM Access Advisor / Access Analyzer answer the "what do I actually need" question after a trial period.
Using the cloud-provider root account for daily work. Lock the root credentials in a vault, set hardware MFA, only break glass for billing or account closure. Day-to-day uses federated identities.
Email as MFA. Email is push-bombable, phishable, and account-takeover-able. Use FIDO2 or platform authenticators.
"We'll add monitoring later." If you're moving prod workloads to the cloud without GuardDuty / Defender / SCC enabled, you're flying blind from day one. Turn it on before you turn the workload on.

Where next

Shared responsibility model — what's yours to secure vs. what the provider does.
CSPM vs CNAPP vs CWPP — the tool categories that implement most of these practices.
Breach kill chains — see what happens when these aren't followed.
Learning path — how to build the skills that turn these from a checklist into instinct.
Friday Zoom — practitioners debate every one of these decisions. Bring your edge cases.