Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Win32] High CPU load #1278

Open
gvanem opened this issue Jan 12, 2025 · 16 comments
Open

[Win32] High CPU load #1278

gvanem opened this issue Jan 12, 2025 · 16 comments

Comments

@gvanem
Copy link
Contributor

gvanem commented Jan 12, 2025

I'm successfully running tcpdump (or windump.exe) on my main Win-10 PC (AMD 3.9 GHz) just fine.
But installing Win-11 on a slow Intel I3 CPU, 2.3 GHz (yes it's possible), I find that tcpdump.exe runs with
a CPU load of approx. 24% . Almost makes the PC unusable.
I also use the latest NPcap 1.80 on this Win-11 PC.

The high CPU-load is AFAICS due to PacketReceivePacket() returns immediately when there is no packets to receive. No?
So adding an option --blocking to tcpdump.exe:

--- a/tcpdump.c 2025-01-12 13:22:42
+++ b/tcpdump.c 2025-01-01 15:54:41
@@ -165,6 +165,7 @@
 #endif

 static int Bflag;                      /* buffer size */
+static int blocking;                   /* '--blocking' option used */
 static int64_t Cflag;                  /* rotate dump files after this many bytes */
 static int Cflag_count;                        /* Keep track of which file number we're writing */
 static int Dflag;                      /* list available devices and exit */
@@ -646,6 +647,7 @@
 #define OPTION_LENGTHS                 138
 #define OPTION_TIME_T_SIZE             139
 #define OPTION_SKIP                    140
+#define OPTION_BLOCKING                141

 static const struct option longopts[] = {
        { "buffer-size", required_argument, NULL, 'B' },
@@ -656,6 +658,7 @@
        { "help", no_argument, NULL, 'h' },
        { "interface", required_argument, NULL, 'i' },
        { "monitor-mode", no_argument, NULL, 'I' },
+       { "blocking", no_argument, NULL, OPTION_BLOCKING },
 #ifdef HAVE_PCAP_SET_TSTAMP_TYPE
        { "time-stamp-type", required_argument, NULL, 'j' },
        { "list-time-stamp-types", no_argument, NULL, 'J' },
@@ -1259,6 +1262,10 @@
                        return (NULL);
                error("%s", ebuf);
        }
+       status = pcap_setnonblock(pc, !blocking, ebuf);
+       if (status != 0)
+          error("%s", ebuf);
+
 #ifdef HAVE_PCAP_SET_TSTAMP_TYPE
        if (Jflag)
                show_tstamp_types_and_exit(pc, device);
@@ -1808,6 +1815,11 @@
                        ++Iflag;
                        break;

+               case OPTION_BLOCKING:
+                       blocking = 1;
+                       timeout = 0;
+                       break;
+
 #ifdef HAVE_PCAP_SET_TSTAMP_TYPE
                case 'j':
                        jflag = pcap_tstamp_type_name_to_val(optarg);

the CPU-load on the Win-11 PC decreases to only 0.1% on average.

Does this make sense?
I'm not sure about the relationship between a timeout = 0 and a call to pcap_setnonblock(pc, 0).

And BTW, I find no call to pcap_setnonblock() in Wireshark either.

@guyharris
Copy link
Member

The high CPU-load is AFAICS due to PacketReceivePacket() returns immediately when there is no packets to receive.

That was d717937, which was originally done as a response to #525.

A better response would be to have, perhaps, a .1-second timeout, which is what Wireshark uses - or, at least, to do it when packets are being printed. (Perhaps that's even good enough when they're only being written to a file, given that the buffer is no longer the default system size on *BSD capture, the faster CPUs of today should be better able to handle packed being delivered every .1 second, and the faster networks of today should be more likely to fill up even a larger kernel packet buffer within .1 second, than was the case in 1992 or so.)

@guyharris
Copy link
Member

Oh, wait, that's what I already did in 2cd0a90, in 2020.

Is this on current versions of tcpdump? Or is it in pre-2cd0a90c24ccf01ad9a034d7d5a6a651c82a4785 versions? 2cd0a90 and post-2cd0a90c24ccf01ad9a034d7d5a6a651c82a4785 versions should only run in immediate mode if the user requests it with --immediate-mode.

@gvanem
Copy link
Contributor Author

gvanem commented Jan 12, 2025

Is this on current versions of tcpdump?

Yes. From git master of yesterday.

@guyharris
Copy link
Member

Is this on current versions of tcpdump?

Yes. From git master of yesterday.

OK, so that's probably Windows-specific; I threw a quick test fprintf() to make sure tcpdump wasn't setting immediate mode, and it printed nothing when I tested it on macOS. It's probably something in pcap-npf.c that's provoking it in tcpdump but not in Wireshark's dumpcap.

@guyharris
Copy link
Member

Another possible cause for higher CPU usage when printing packets is that the Visual Studio C library's notion of "line-buffered" output is "one character written, one WriteFile() call"), i.e., unbuffered. "line-buffered" is the default when writing to a terminal, which might mean "a console" as well as "a serial port". -U, at least in the main branch, does "packet-buffering" even when printing packets, even on Windows, i.e., it does full buffering and an fflush(stdout) after each packet.

@guyharris
Copy link
Member

guyharris commented Jan 12, 2025

Does this make sense? I'm not sure about the relationship between a timeout = 0 and a call to pcap_setnonblock(pc, 0).

Every packet capture mechanism is weird in its own way. (Hat tip to Tolstoy. :-)

As per https://npcap.com/guide/npcap-internals.html, the NPF code from WinPcap and Npcap has a single circular buffer (unlike the BPF capture mechanism's pair of buffers, or the Linux PF_PACKET/TPACKET_V3 ring of buffers). There's a "minimum number of bytes to copy" parameter and a read timeout; a ReadFile() from the capture device will complete if either 1) the minimum number of bytes to copy has arrived in the buffer since the last read or 2) the timeout has expired. By default, the read timeout is 1 second, and the minimum amount of data copied between the kernel and the application is 16K.

pcap_set_immediate_mode() causes the minimum amount to copy to be set to 0, so there is no minimum. Otherwise, it's set to 16000 bytes (yes, 16 KB, not 16 KiB; that dates back to WInPcap).

The packet buffer size on Windows, in libpcap, is 1000000 bytes (1 MB - not 1 MiB).

WinPcap dates back to at least the early 2000s, if not earlier; 16KB seems like a rather small "minimum amount to transfer". The default buffer size from the early-90s BPF is, I think 4 KiB, which is absurd by modern standards; the maximum, these days, is somewhere between 512 KiB and 16 MiB, and libpcap sets it as high as it can.

There's a {WinPcap,Npcap}-only libpcap API to set the minimum amount to copy, pcap_setmintocopy(). Try just cranking that up to, say, 256 KiB. If that works, that's what libpcap (and Npcap) should have that be the default. (The default buffer size should probably be larger as well.)

@guyharris
Copy link
Member

And BTW, I find no call to pcap_setnonblock() in Wireshark either.

There aren't any. There aren't any calls to pcap_setmintocopy(), either, so I'm not sure why Wireshark isn't seeing this.

@guyharris
Copy link
Member

I'm not sure about the relationship between a timeout = 0 and a call to pcap_setnonblock(pc, 0).

There isn't one.

If the timeout set by PacketSetReadTimeout() is 0, PacketReceivePacket() will call WaitForSingleObject() with a timeout of INFINITE before calling ReadFile() on the NPF device.

If the timeout in question is > 0, PacketReceivePacket() will call WaitForSingleObject() with the specified timeout before calling ReadFile() on the NPF device.

If the timeout in question is -1, PacketReceivePacket() will not call WaitForSingleObject() before calling ReadFile() on the NPF device.

ReadFile() from that device will not, itself, block, so seeing the timeout to -1 turn non-blocking mode on.

@guyharris
Copy link
Member

There isn't one.

Well, I guess setting it to any non-zero value has the effect of turning off non-blocking mode, so that's the connection. opt.timeout in the pcap_t is the value to which the timeout was set. The timeout has no effect in non-blocking mode, as a timeout that specifies a maximum amount of time that a read will block doesn't do anything if the read won't block. If you call pcap_setnonblock(pc, 0) to turn non-blocking mode off, the time will be set to the opt.timeout value.

The pcap-npf.c code in libpcap is what calls PacketSetReadTimeout() to -1, if non-blocking mode is being turned on, or to the opt.timeout value, when the device is activate or if non-blocking mode is being turned off.

@gvanem
Copy link
Contributor Author

gvanem commented Jan 13, 2025

Ok, very well. But what's the option in tcpdump.exe to reduce the CPU-load for this Intel I3 PC?
AFAICS w/o this added --blocking option above, there is none.

@guyharris
Copy link
Member

But what's the option in tcpdump.exe to reduce the CPU-load for this Intel I3 PC?

The one that fixes the problem that increases the CPU time.

Ideally, the one that does so without introducing any new options.

The question is "where is that CPU time going?"

Your option sets the timeout to 0, meaning it causes reads from the pcap_t to block indefinitely. If you're just capturing packets and writing them to a capture file without dissecting them or otherwise inspecting them or modifying them, then disabling the timeout reduces the number of wakeups and reads, so that might be what makes the difference.

However, setting the timeout to 0 causes captures on Linux and, I think, all system using the BPF capture mechanism, as well as Solaris 10 with DLPI, to block without a timeout as well.

If that significantly reduces CPU time on tcpdump -i {interface} -w {file} on all of those platforms, then it might be a good idea to use it if possible. Unfortunately:

  1. counting packets is a form of inspecting them, so tcpdump -c {count} -i {interface} -w {file} can't use it - if the number of packets in the in-kernel buffer is >= the remaining number of packets to be written, it could take an indefinite amount of time for those remaining packets to be delivered to tcpdump in order to capture them if there's no timeout;
  2. this requires that a ^C interrupt be able to provoke a waiting read to deliver packet and wake tcpdump up, which might be possible on Linux and Windows (where pcap_breakloop() can poke an event to make data available), but I'm not sure there's a way to do it reliably on all systems using BPF (maybe changing the packet timeout to a very small value will do it, but I'm not certain that it will, and think it might not) or pre-11 Solaris with DLPI).

If it doesn't significantly reduce CPU time on other platforms, that may be either 1) an issue with NPF or 2) an issue with how pcap-npf.c uses NPF.

If this happens without -w, so that tcpdump is printing packets, and with -l, so that output buffering is turned off, but doesn't happen without -w, so that tcpdump isn't printing packets, that may be a result of VS's library not really implementing line-buffering, requiring us to implement it as not buffering. If that's the case, see what difference having tcpdump set _IOFBF for the -l flag, and then having pretty_print_packet() do flush(stdout) after each packet it prints (that's packet-buffering rather than line-buffering, but it's good enough for the test).

If that happens even with -w, it might be interesting to see whether setting the minimum-to-copy value to, for example, 256 KB (as in 256000), by having tcpdump call pcap_setmintocopy(p, 256000) right after pcap_activate() succeeds. That will reduce the number of wakeups that occur, even without a timeout (especially without a timeout). If that's the case, see what happens with the larger minimum-to-copy value without setting the timeout to 0. If even that improves things, it's time we made that change in libpcap, which would mean that, once that's in Npcap, there wouldn't need to be an option to reduce the CPU load, it would be standard.

Otherwise, it'd probably be time to see whether the Npcap developers can figure out whether there's something being done inefficiently that needs to be fixed.

(I.e., I don't think the problem is that there needs to be an option to reduce the CPU load, I think the problem is that there needs to be fixes to some component of the Windows capture code, whether it's in the libpcap code or the packet.dll/NPF driver code or both, to reduce the CPU load by default.)

@gvanem
Copy link
Contributor Author

gvanem commented Jan 13, 2025

The question is "where is that CPU time going?"

Indeed. I fail to see it's VS's line-buffering.
But more testing with this --blocking options shows it's a bad idea; packets gets postponed for far too long.

@infrastation
Copy link
Member

It could help to state the exact steps to reproduce the problem on one system and the normal operation on the other.

@guyharris
Copy link
Member

It could help to state the exact steps to reproduce the problem on one system and the normal operation on the other.

For the other system, the claim was that "I'm successfully running tcpdump (or windump.exe) on my main Win-10 PC (AMD 3.9 GHz) just fine." Do "successfully" and "just fine" mean that the CPU percentage is at the 0.1% level that it was on the other machine with the change, or does it mean that it was a higher percentage but the other system had more CPU to spare?

I.e., was it using more CPU than it should on both machines, but was that less of an issue on the faster machine?

@gvanem
Copy link
Contributor Author

gvanem commented Jan 14, 2025

I.e., was it using more CPU than it should on both machines, but was that less of an issue on the faster machine?

Correct. If tcpdump.exe must show packets W/O a large lag and CPU-usage the suffers due to this,
it's not a big deal for me. I bought this Intel I3 PC for $50. But I was hoping for a better balance of CPU-usage and responsiveness.

@guyharris
Copy link
Member

guyharris commented Jan 14, 2025

But I was hoping for a better balance of CPU-usage and responsiveness.

Right now, I'm hoping to figure out whether one or more of the knobs for the NPF driver is inappropriately set (whether by pcap-npf.c or something else) and causing more CPU usage than required for the desired level of responsiveness.

What level of traffic was the i3 PC receiving and sending?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants