Network-based malware detection: what you can see from flow data alone
Every time two systems talk over a network, they leave a record: who connected to whom, when, for how long, and how much data moved. Flow logs — whether VPC Flow Logs in AWS or NetFlow on traditional infrastructure — collect these records automatically. They do not capture what was said. Just the conversation metadata.
That sounds like a limitation. For a lot of threat scenarios, it turns out to be enough.
What flow data actually is
Think of a flow record as a phone bill rather than a phone tap. You can see the number dialled, the time, and how long the call lasted. You cannot hear the conversation.
Each record contains the source and destination IP addresses, the ports used, the protocol (TCP or UDP), how many packets were sent, how many bytes, and whether the connection was allowed or blocked. In AWS, VPC Flow Logs capture this for every network interface in your account. Records are aggregated in 10-minute windows by default, or 1-minute windows on newer instance types.
A blocked connection attempt looks like this:
{
"srcaddr": "198.51.100.42",
"dstaddr": "10.0.1.15",
"srcport": 4444,
"dstport": 22,
"protocol": 6,
"packets": 3,
"bytes": 180,
"start": 1748995200,
"end": 1748995260,
"action": "REJECT"
}
Reading this: an external IP (198.51.100.42) tried to connect to an internal server on port 22 (SSH). AWS blocked it (REJECT). Three packets and 180 bytes is consistent with a connection attempt that never completed — the attacker knocked, got no answer, and moved on.
You can also enable additional fields: tcp-flags shows whether a connection was a SYN scan or an established session; flow-direction separates inbound from outbound traffic; for container workloads, you can include the cluster and task identifiers so you know exactly which workload generated the traffic.
What malware looks like in flow data
C2 beaconing is one of the cleaner signals. When malware checks in with its command-and-control server, it does so on a schedule. That regularity shows up in flow data as dozens of short connections to the same external IP, spaced evenly over time, with consistently small amounts of data transferred. A host phoning home every five minutes produces a very distinctive pattern against normal traffic.
Data exfiltration shows up as an asymmetry. Normal server traffic is mostly inbound: users send small requests, servers send large responses. When a machine is sending significantly more data out than it is receiving, sustained over time, that is worth investigating. Flow logs cannot tell you what the data is, but the volume and direction are there.
Lateral movement is when an attacker who has compromised one machine starts moving to others. This shows up as new connections between internal systems that have never talked to each other before. The catch: you only know they have never talked before if you have a baseline of normal traffic to compare against. This is the part most teams skip until after an incident.
Port scanning produces a distinctive pattern: many short, blocked connections from one source, spread across many destination IPs or ports. Because AWS security groups block most scan traffic before it reaches your instances, the REJECT records in your flow logs are actually your best view of scanning activity — the attacker never got in, but they still left a trace.
DGA-based malware generates domain names algorithmically, rotating through hundreds of addresses to evade blocklists. In flow logs, this looks like a host making short connections to many different external IPs in a short period. The problem: that pattern is not unique to malware. A misconfigured CDN or a service with aggressive DNS caching behaves similarly. You need Route 53 Resolver query logs alongside flow data to tell them apart.
What you cannot see
Flow logs capture no payload. You cannot tell what data was transferred, identify a malware family from traffic patterns alone, or distinguish encrypted malicious traffic from encrypted legitimate traffic. If an attacker uses standard HTTPS to a cloud storage service to exfiltrate data, the flow record looks identical to a legitimate backup job.
DNS names are also absent. Flow logs record the IP address the connection reached, not the domain name that resolved to it. If malware uses fast-flux DNS — rotating through many IP addresses for the same domain — each connection appears to go to a different destination. Without DNS query logs, you end up chasing individual IPs rather than the domain driving the behaviour.
On traditional infrastructure, NetFlow is commonly sampled: only one in every 1,000 packets (or more) makes it into the flow export. A low-volume beacon sending a handful of packets every few minutes will often not appear in the data at all. VPC Flow Logs do not sample by default, but some teams disable flow logging on high-traffic interfaces to control costs, and those gaps are rarely documented.
Finally, flow data is not real-time. Records are aggregated over a window, then delivered to CloudWatch Logs or S3 with an additional delay. In practice, if you are in the middle of an active incident, you are looking at traffic from at least 15 minutes ago. Useful for investigation; not useful for live containment decisions.
Querying VPC Flow Logs at scale
For historical investigation, the standard setup is to publish flow logs to S3 in Parquet format and query them with Amazon Athena. Parquet stores data in typed columns, which makes large-scale queries significantly cheaper than scanning plain text.
To find servers sending unusually large amounts of data to external destinations:
SELECT srcaddr, dstaddr, dstport, SUM(bytes) AS total_bytes, COUNT(*) AS flow_count
FROM vpc_flow_logs
WHERE flow_direction = 'egress'
AND action = 'ACCEPT'
AND start BETWEEN 1748908800 AND 1748995200
GROUP BY srcaddr, dstaddr, dstport
HAVING SUM(bytes) > 1000000000
ORDER BY total_bytes DESC;
This finds internal IPs sending more than 1 GB to a single external destination over the time window. Adjust the threshold based on what is normal in your environment — a media server and a database instance have very different baselines.
Beaconing detection needs a different approach: instead of looking for large transfers, you are looking for connections that are small, frequent, and regular over an extended period.
On-prem vs. cloud
| On-prem NetFlow | VPC Flow Logs | |
|---|---|---|
| Sampling | Common (captures 1 in 1,000+ packets) | None by default |
| Payload | Not captured | Not captured |
| DNS names | Requires a separate DNS logging pipeline | Route 53 Resolver query logs, configured separately |
| Aggregation | Configurable per device | 1 or 10 minutes |
| Internal traffic | Depends on where monitoring taps are placed | All traffic on every network interface |
| Cost | Hardware and storage | Per GB of data logged |
| Process info | Not available without a host agent | Not available without a host agent |
On-prem environments often have blind spots: network devices that do not support NetFlow, or traffic paths that bypass the collection points entirely. VPC Flow Logs capture everything that crosses a network interface, which removes the coverage gap — but moves it inward. Traffic that stays on the same host (between containers on one machine, or loopback connections) does not appear in VPC Flow Logs at all.