Telemetry — SNMP, Flow Logs, Metrics
Topic 61

Telemetry — SNMP, Flow Logs, Metrics

Telemetry

You cannot operate a network you cannot see. Three telemetry families give you that visibility, and they answer genuinely different questions. SNMP polls device counters — how many bytes and errors crossed each interface, how full a queue is. Flow records — NetFlow, sFlow, IPFIX — capture who talked to whom and how much, the conversation-level traffic matrix. Streaming metrics push time-series data off the device continuously, so you watch trends in near real time instead of asking once a minute.

The mistake is treating these as competitors when they are layers. SNMP tells you an interface is running hot; flow records tell you which source and destination are driving the traffic; streaming metrics tell you it has been climbing for ten minutes and is about to saturate. Pick the wrong one for the question and you either drown in data you cannot interpret or miss the event entirely between polls. Knowing which family answers which question is most of the skill.

SNMPpoll device counters
Pull, on UDP 161, every minute or so. Bytes, errors, and queue depth per interface — device state. Universal, but counter-oriented and blind to individual conversations.
Flow recordswho-talked-to-whom
NetFlow / sFlow / IPFIX — the traffic matrix. One record per conversation: src/dst, ports, bytes. Answers "top talkers" and "what's consuming this link," often sampled 1-in-1000.
Streaming metricspush trends
Device pushes time-series, often over gRPC. Sub-second resolution with no poll loop — you watch a counter climb in near real time instead of asking once a minute.

SNMP — Polling Counters and MIBs

SNMP, the Simple Network Management Protocol, is the legacy workhorse. A collector polls each device on UDP port 161, asking for the value of named counters: bytes in and out per interface, input and output errors, discards, CPU and memory. The names live in a MIB, a Management Information Base, which maps human-readable objects to numeric OIDs — object identifiers like 1.3.6.1.2.1.2.2.1.10 for inbound octets on an interface. Devices can also push unsolicited traps when something crosses a threshold, instead of waiting to be polled.

SNMP's strength is universality — virtually every switch, router, firewall, and printer speaks it — and its weakness is that it is pull-based and counter-oriented. You learn an interface's byte count, but only as often as you poll, and you learn nothing about the individual conversations inside that traffic. The cardinal sin is leaving the read community string at the default public, which lets anyone who can reach the device walk its entire configuration and topology.

# walk one interface's counters by OID
snmpwalk -v2c -c private 10.0.0.1 IF-MIB::ifTable
# IF-MIB::ifDescr.2      = STRING: GigabitEthernet0/1
# IF-MIB::ifInOctets.2   = Counter32: 84213996122
# IF-MIB::ifInErrors.2   = Counter32: 1842     <- rising = bad cable/optic
# IF-MIB::ifOutDiscards.2= Counter32: 90571    <- output queue dropping

Flow Records — NetFlow, sFlow, IPFIX

Where SNMP gives device state, flow records give the traffic matrix: each record summarizes one conversation — source and destination IP, ports, protocol, packet and byte counts, start and end time. Export this from every router and you can answer "who are my top talkers," "what is consuming this link," and "did this host suddenly start scanning the internet." NetFlow (Cisco) and the vendor-neutral IPFIX standard build records by tracking every flow; sFlow takes a different approach, sampling 1-in-N packets and exporting the raw headers.

Sampling is the central tradeoff. Full flow accounting on a 100 Gbps core link can generate millions of records per second and bury the collector, so high-rate interfaces are usually sampled — 1-in-1000 or 1-in-4000 packets. Sampling is statistically fine for elephant flows and top-talker reports, but it will miss small, short conversations entirely, so it is the wrong tool for security forensics where every connection matters. Match the sampling rate to the question: coarse for capacity planning, full capture where you must account for every flow.

Streaming Telemetry — Push Instead of Poll

Polling does not scale. To watch a counter at one-second resolution by polling, the collector must hit every device every second, and at thousands of devices and tens of thousands of interfaces the poller becomes the bottleneck and the data is already stale by the time it lands. Streaming telemetry inverts the flow: the device subscribes you to the metrics you want and pushes updates on its own schedule, often via gRPC, in a structured encoding. No poll loop, no per-request overhead, sub-second resolution.

The model matches how modern metrics systems already work — a time series identified by labels, sampled at a fixed interval, stored for querying. The win at scale is decisive: pushing means the device sends only what changed when it changed, and the collector spends its budget storing and indexing rather than asking. The cost is that streaming telemetry is newer, less universal than SNMP, and varies more by vendor, so most real networks run both — SNMP for the long tail of old gear, streaming for the modern core.

What to Collect

Start with the signals that predict user pain. Per-interface, the load-bearing four are throughput (bytes in/out), errors (CRC errors, which mean a bad cable or optic), discards/drops (which mean the queue is full and the link is saturated), and utilization against capacity. Errors and discards are the ones people forget to graph, and they are precisely the ones that explain "the link isn't full but it's dropping packets."

Above the interface, collect flow volume and the top-talker list so you can attribute traffic to sources, plus saturation signals that are not raw bandwidth — conntrack-table fill, NAT-pool exhaustion, queue depth. The discipline is to collect with intent: gathering every available counter and then alerting on none of them is the most common failure, producing dashboards nobody reads and outages nobody is paged for. Collect what you would act on, and graph the rest only if you will actually look at it during an incident.

SNMP vs Flow Records vs Streaming Telemetry

SNMP pulls device-state counters — interface bytes, errors, CPU — on a poll interval. Choose it for device health and for the universality of "everything speaks it," accepting that it tells you nothing about individual conversations and is only as fresh as your poll.

Flow records capture who talked to whom and how much, the conversation-level traffic matrix. Choose them to answer top-talkers, link attribution, and traffic patterns — full capture for forensics, sampled for capacity planning on high-rate links.

Streaming telemetry pushes time-series metrics off the device continuously at sub-second resolution. Choose it for real-time trends at scale, where per-poll overhead would make SNMP the bottleneck — accepting that it is newer and less uniform across vendors.

Common Mistakes
  • Leaving the SNMP read community at the default public. Anyone who can reach UDP 161 can then walk the device's full interface list, routing table, and configuration — a topology and inventory leak handed out for free.
  • Enabling unsampled full flow export on a high-rate core link. The device floods the collector with millions of records per second, the collector drops them, and you end up with gaps in exactly the data you turned on flow to get.
  • Polling interface counters every five minutes and wondering why microbursts are invisible. A burst that saturates a link for 200 ms and drops packets averages away to nothing across a 300-second poll, so the drops show up but the cause never does.
  • Collecting every available metric and alerting on none of them. Mountains of telemetry with no thresholds or routing produce dashboards nobody watches and outages nobody is paged for — visibility you paid to store and never use.
  • Graphing throughput but not errors and discards. The link sits at 40% utilization and still drops packets from CRC errors or a full output queue, and without the error counters the trouble looks like a mystery instead of a bad optic.
Best Practices
  • Change every SNMP community string off the defaults and restrict polling to the collector's source addresses, or move to SNMPv3 with authentication and encryption where the gear supports it.
  • Sample flow on high-rate links (1-in-1000 or coarser) for capacity work, and reserve full, unsampled flow for the segments where security forensics needs every connection accounted for.
  • Collect errors, discards, and utilization on every interface, not just throughput, so a link that drops packets below line rate explains itself instead of becoming an investigation.
  • Prefer streaming telemetry on the modern core for sub-second resolution and run SNMP alongside it for the long tail of devices that do not stream, rather than forcing one model everywhere.
  • Collect with intent — instrument the signals you would actually act on during an incident — and resist the urge to store every counter just because the device exposes it.
Comparable conceptsPrometheus (pull/push metrics model)Cloud VPC flow logs (traffic records)

Knowledge Check

A link is at 60% utilization but a host on it complains of slowness. What does flow data reveal that SNMP interface counters cannot?

  • Which source-destination conversations make up the traffic, so you find the talker crowding the host
  • A far more accurate total byte count for the interface than the SNMP counter is able to provide
  • The internal CPU temperature of the physical device that is forwarding the traffic across the congested link
  • The ability to reprioritize the host's packets ahead of others

Why is 1-in-4000 sampled flow a poor fit for security forensics, even though it works for capacity planning?

  • Sampling misses small short-lived flows, and a malicious connection is exactly such a flow
  • Sampling overwhelms the forensic collector with far more records than full unsampled capture ever would
  • Sampled flow records omit the source and destination IP address fields needed to trace an attacker
  • Sampled flow data is encrypted and cannot be inspected after export

What is the core reason streaming telemetry scales better than SNMP polling for sub-second metrics?

  • The device pushes updates itself, removing the per-device poll loop that bottlenecks the collector
  • It collects far fewer distinct metrics than SNMP does, so there is much less data to store and index
  • It runs on a dedicated out-of-band management network that SNMP polling is unable to use at all
  • The device stores the full metric history locally, so the collector reads it once

You got correct