Cloud detection is a data engineering problem disguised as a security problem. Most teams figure this out the hard way — after they've deployed a SIEM, pointed it at their cloud logs, and ended up with either a firehose of unactionable alerts or a mostly-empty dashboard because the logs never made it in the first place.
Getting detection and response right in cloud environments requires getting four foundational pieces right. Not all at once — but understanding all of them, because each one builds on the last.
Pillar 1: Scalable Data Pipelines
Everything starts with getting the data in. In cloud environments, this means handling logs from AWS CloudTrail, VPC flow logs, GCP Audit Logs, container runtime events, application logs, and whatever else your infrastructure generates — at scale, reliably, and in near-real-time.
The biggest mistake I see is building ingestion pipelines that work at current volume and fail under peak load or during incidents — exactly when you need them most. Resilience under peak load isn't optional.
The pipeline architecture matters. The key design decisions:
- Stream vs. batch — Stream processing (Kafka, Kinesis, Pub/Sub) for real-time detection use cases; batch for forensic backfill and cost management. Most mature pipelines do both.
- Normalization at ingest — Transform logs into a consistent schema before they hit storage or your detection layer. Doing this downstream is painful. Tools like
Matano,OpenSearch Data Prepper, andLogstashhandle this well at different scales. - Agents and collectors —
Elastic Beatsfor lightweight shipping from hosts; cloud-native forwarders for managed services. The goal is minimal friction between event and pipeline. - Backpressure handling — The pipeline should degrade gracefully when a downstream component falls behind, not drop events silently.
The measure of a good ingestion pipeline is not "does it work today?" It's "does it work during an incident when volume spikes 10x and three people are querying the data at the same time?"
Pillar 2: AI-Powered Detections
Not all detections are equal. There's a meaningful difference between a rule, a detection, and a signal — and treating them the same is why most detection programs plateau.
- Rules are simple conditional logic: if X happens, fire an alert. Fast to write, easy to understand, brittle in production. Rules don't adapt to context.
- Detections are context-aware logic: if X happens, and Y is true about the entity, and Z hasn't happened in the last 24 hours, then this is worth investigating. Detections require understanding your environment.
- Signals are behavior-based: anomaly detection, clustering, sequence detection, user behavior analytics. These are where ML earns its keep — surfacing patterns that rule-based systems can't express.
AI doesn't replace the rule layer — it extends it. The practical value of ML in detection is in two areas:
- False positive reduction — Classifying alerts by historical patterns, entity context, and environmental baseline. Not every "admin role assigned" is an incident if your onboarding process generates five of them per week.
- Cloud drift detection — Identifying when configuration state deviates from baseline. This is where rule-based systems struggle and where statistical modeling is genuinely useful.
The trap to avoid: deploying ML detection without a clean, normalized data foundation. Garbage in, garbage out — but slower and more expensive.
Pillar 3: Security Data Architecture
Where the data lives and how it's organized determines what you can actually do with it during an investigation. Most detection pipelines get this wrong — optimizing for ingest but not for query.
The architecture I've converged on for most environments:
- Data lake as the foundation — Store everything in a cost-efficient, queryable format. Columnar formats (Parquet, ORC) are essential for fast analytical queries at scale.
- Hot/cold storage model — Hot tier (recent 30–90 days) in high-performance storage for active detection and investigation; cold tier (90 days+) in object storage for forensic investigation and compliance retention.
- Query layer —
AWS AthenaorAWS Security Lakefor serverless ad-hoc querying without managing infrastructure;DatabricksorMatanofor more complex analytical workloads.
The operational implication: your analysts need to be able to query last year's CloudTrail logs during an incident without waiting 45 minutes for a job to run. Architecture decisions made during quiet periods determine your capability during incidents.
Pillar 4: GitOps for Alert Management
Detection rules are code. They should be treated like code: version-controlled, reviewed, tested, and deployed through a repeatable process.
The case for GitOps in detection:
- Version control — Every detection rule has a history. You can see who wrote it, why, when it was last modified, and what changed. This matters when you're debugging a false positive that appeared after a rule was "updated."
- CI validation — Rules are validated against schema, logic, and test cases before they reach production. No more broken detection logic shipping on a Friday.
- Environment testing — Rules are promoted through dev → staging → production, with synthetic attack simulation verifying that new rules fire where expected and don't fire where they shouldn't.
- Audit trail — Every deployed rule change is documented, attributable, and reversible. When a compliance auditor asks "what detection rules were active on this date?" you have an answer.
Detection rules that aren't version-controlled aren't really managed. They're inherited.
The tooling here can be as simple as a Git repository with CI/CD pipelines pushing rules to your SIEM or detection platform. The discipline matters more than the specific tooling.
Putting It Together
These four pillars are interdependent. A great detection layer built on a broken pipeline will miss events. A well-designed architecture with no GitOps discipline will accumulate technical debt in the form of undocumented, unreviewed rules that nobody trusts.
The order matters too. Start with the pipeline — you can't detect what you can't ingest. Then build the detection layer on top of clean, normalized data. Then invest in architecture as data volume and retention requirements grow. Then add GitOps discipline as the detection rule library grows beyond what one person can hold in their head.
The goal isn't a perfect detection platform. It's a detection program that improves incrementally and doesn't collapse when you need it most.