Monday, 1 February 2016

[Humbledown highlights] VLAN tag stripping in Virtualbox (actually, Intel NICs et al.)

This is historical material from my old site, but as I have just bumped into a page that linked to it, I thought I would republish it. 
I have not verified that this material is still accurate.
Feel free to post an update as a comment and I'll publish it.

VLAN tag stripping in Virtualbox (actually, Intel NICs et al.)

Mon Jan 17 21:41:46 NZDT 2011
Short version: When using VirtualBox “Internal Network” adaptors for VLAN environments, don't use the ”Intel PRO/1000” family of adaptors, which are the default for some operating system types. Instead, use the either the Paravirtualised adaptor (which would require your guest to have Xen’s virtio drivers, or “AMD PCNet” family of adaptors.
This is because the “Intel PRO/1000” family of adaptors strip the VLAN tags: this is not a problem specific to Virtualbox: it occurs also in other virtualisation products and also on native systems; although you hear about it more often in virtualisation host because that it is more common to want to expose such a host to a VLAN trunk. There are some Windows registry settings for changing this on Intel’s Windows drivers, but currently there are no published mechanisms for Linux systems (using the e1000 kernel module).
In preparing materials for this years TELE301, I found a recent upgrade of VirtualBox had broken my VLAN lab environment. This post shows the problem, how it was diagnosed, and its solution.


I had configured the physical topology shown in the diagram below, but although C1 could ping R1, R1 could not ping on R2 (or any other similar interaction among the VLANs hosted on Switch 2).
Tcpdump on R2’s eth1 (note: not eth1.10 or similar) was showing the frames received as being untagged. ARP requests would timeout. Thus, either the frames were not being tagged on R1, stripped on transit to R2, stripped by R2, or simply not shown by tcpdump.


In previous versions of VirtualBox, the “AMD PCNet FAST III” was the default adaptor, because it was widely supported. Later versions of VirtualBox added support for a few adaptors in the “Intel PRO/1000” family, for compatibility reasons (see VirtualBox User Manual Section 6: “Virtual Networking”).
Using the “AMD PCNet FAST III” works as expected; the ARP is replied to, the ping -n works, and I see vlan 10 in the tcpdump output on R2’s eth1 interface.
On guest systems with Xen DomU support, the Paravirtualised (virtio) driver ought to be used instead, and it also works fine and should also deliver better performance compared to the emulated PCNet adaptor.
Note: One thing to be aware of: if you attach a packet capture to an adaptor (ie. VBoxManage modifyvm name --nictraceN=1 --nictracefileN=adaptor1.pcap) then you will not see the VLAN tags, although it will be seen when it gets to the destination machine on the internal network… though I do need to double-check that this is still the case when using the paravirtualised adaptor.
In Vyatta, you can verify that you are using a pcnet driven card using the command show interfaces ethernet ethX physical: the "driver" line should show "pcnet32" in the case of the AMD PCNet adaptor. You will get an output similar to an error for the paravirtualised driver, as there is no “physical” adaptor being emulated.
Note that if you still want to use VLANs in a network with Intel adaptors, that is fine, so long as you don't expose a trunk port (one whereby frames are tagged with a VLAN identifier) to the adaptor. If you want to tweak registry settings on Windows; you can do that, but at this time there appears to be no such control surface for Linux hosts.
[Update: 18 Jan 2011] Sasquatch, over on the VirtualBox user forums, points out:
The reason the VLAN tag is stripped is because the Intel adapters support VLAN tagging and needs to be set in the adapter properties. When you don’t provide such tag, the default is used (untagged). If you don’t have the VLAN tab in the adapter properties, you haven’t installed the advanced features of it. Grab the driver from the Intel website and install the full package, that will provide VLAN tagging. Using the .inf will not provide the VLAN tab or other features.
I tried this briefly today, in a Windows 7 guest in Virtualbox, but unfortunately Intel’s drivers detect there is no real Intel adaptor installed in the system. I’ve tried with two of the three available adaptors in Virtualbox, although none of them are really modern from a Intel perspective, so perhaps an older version of Windows might work better. I wonder if the installation program takes any switches…

Other notes and tidbits

  • On Vyatta, show interfaces ethernet eth1 physical reports that the ‘Intel PRO/1000 MT Server (82545EM)’ series is being driven by the ‘e1000’ driver. show version reports the Vyatta version as VC6.1-2010.10.16, with kernel version 2.6.32-1-586-vyatta-virt.
  • sudo ethtool --driver eth1 shows driver version is 7.3.21-k5-NAPI
  • There are two separate issues worthy of mention with respect to the VLAN tag stripping behaviour on Linux: the first is stripping of VLAN tags on incoming frames, when using traffic capture tools; although this should not be a problem under Linux as the driver will automatically disable VLAN tag stripping (in hardware) when the device enters promiscuous mode.
    The other issue, discussed in this document, is the tags being stripped on outgoing packets. The reason is much harder to fathom; perhaps it is due to a feature of the card called “Native VLAN”, which simply means that frames on an adaptors home VLAN are not tagged.
  • This issue is not particular to Vyatta, or even to Linux, but rather seems to be a common issue, particularly with Intel cards but also with others, such as the Marvel Yukon. Other possible interesting points of reference:
    VLAN testing with Cisco Catalyst 4006 not going so well
    (turns out a registry tweak was needed to turn off 'VLAN filtering')
    Network Connectivity — My sniffer is not seeing VLAN, 802.1q, or QoS tagged frames
    (Knowledge-base article from Intel; another registry tweak for “monitor mode”)
    README.txt from Intel for the e1000 driver module
    (note that it mentions that "Native VLANs" are supported in this version... that seems like a feature likely to automatically strip VLAN tags...)
    GSN3 — Turning off VLAN tag stripping on Marvell Yukon NIC cards
    This is not a problem restricted to Intel PRO/1000 cards. For example, Marvell Yukon cards can have a similar problem; the solution (another registry tweak) for that card type is discussed for Windows.
  • When e1000 is used, only outgoing frames are not tagged, tcpdump still shows tagged frames coming in from the network.
  • Even if configured using just vconfig add eth1 40 and ifconfig (ie. no Vyatta tools used to configure it), it still fails to tag egress packets.
  • I tried removing the module and reinserting it with debugging enabled.
    /sbin/modinfo e1000
    sudo modprobe -r e1000
    sudo modprobe e1000 debug=8
    But it didn’t appear to show much at debug level 8 anyway; I didn’t turn it up to maximum debugging (debug level 16).
  • ethtool -d eth1 dumps the registers of the apaptor, and shows a number of sections. Under the ‘CTRL (Device control register)’ section, ‘VLAN mode’ is enabled. Under the ‘RCTL (Receive control register), the ‘VLAN filter’ is disabled. There appears to be no method (besides perhaps writing a kernel module) available for manually tweaking the registers: for debugging it might be useful to disable the ‘VLAN mode’, which the driver automatically enables when a VLAN is added. Without looking at the driver source, I can’t be certain of everything that ‘VLAN mode’ implies, and what could be broken if is turned off.

Tuesday, 4 August 2015

Memcached logging (and others) under Systemd on RHEL7

I've been getting into RHEL7 lately (yay, a valid reason to get into at work!) and this means learning about systemd. This post is not about systemd... at least its not another systemd tutorial. This post is about how I got memcached to emit its logging to syslog while running under systemd, and how to configure memcache to sanely log to a file that I can expose via a file share.

Its 2:21am, so this is gonna be quick. I wrote this because I needed memcached, and in response to some bad advice on Stack Overflow.

Tuesday, 5 May 2015

Use IPTables NOTRACK to implement stateless rules and reduce packet loss.

I recently struck a performance problem with a high-volume Linux DNS server and found a very satisfying way to overcome it. This post is not about DNS specifically, but useful also to services with a high rate of connections/sessions (UDP or TCP), but it is especially useful for UDP-based traffic, as the stateful firewall doesn't really buy you much with UDP. It is also applicable to services such as HTTP/HTTPS or anything where you have a lot of connections...

We observed times when DNS would not respond, but retrying very soon after would generally work. For TCP, you may find that you get a a connection timeout (or possibly a connection reset? I haven't checked that recently).

Observing logs, you might the following in kernel logs:
kernel: nf_conntrack: table full, dropping packet.
You might be inclined to increase net.netfilter.nf_conntrack_max and net.nf_conntrack_max, but a better response might be found by looking at what is actually taking up those entries in your connection tracking table.

The importance of being liberal in a Cisco environment

Okay, so today I grappled with a Cisco sized gorilla and won. -- me, earlier this year, apparently feeling rather chuffed with myself.

I had recently launched a new service for a client, and a harvester on the Internet was experiencing timeouts in trying to connect, and so our data was not being harvested by the harvester that we needed. There was no evidence of a problem in the web-server logs because no HTTP request ever had a chance to make it through.

It seems that Cisco products (some unstudied subset, but likely firewalls and NATs) seem to play a bit too fast and loose for Linux’s liking and change packets in ways that makes the Linux firewall (iptables) stateful connection tracking occassionally see such traffic as INVALID. This manifests in in things such as connection timeouts, and as such you won’t notice it in things like webserver logs. In a traffic capture, you may recognise it as a lot of retransmissions.

Tuesday, 21 April 2015

How long has that command been running? (in seconds)

The ps command is fairly obtuse at the best of the times, but I am very thankful for things like ps -eo uid,command. I wish it were as easy to have ps report the starttime of a process in a form I can use. See my pain looking at the start-time and elapsed time (which is not wall-clock) for a rather antiquarian system.

# ps -eo stime,etime
... selected extracts ...
Apr20  1-02:30:30
 2013 731-03:01:37
12:59    01:42:25
14:41       00:00

Yeah, I really don't want to touch that with any scripting tool. Best to head to /proc/... you may actually want to use ps or something like pgrep to determine the PID of the progress of interest.

According to proc(5), we want the 28th element in /proc/PID/stat:  "starttime: ...The time in jiffies the process started after system boot.". A jiffie is explained in time(7) -- this from RHEL 5:

   The Software Clock, HZ, and Jiffies
       The  accuracy  of  many system calls and timestamps is limited by the resolution of
       the software clock, a clock  maintained  by  the  kernel  which  measures  time  in
       jiffies.  The size of a jiffy is determined by the value of the kernel constant HZ.
       The value of HZ varies across kernel versions and hardware platforms.  On  x86  the
       situation is as follows: on kernels up to and including 2.4.x, HZ was 100, giving a
       jiffy value of 0.01 seconds; starting with 2.6.0, HZ was raised to 1000,  giving  a
       jiffy of 0.001 seconds; since kernel 2.6.13, the HZ value is a kernel configuration
       parameter and can be 100, 250 (the default) or 1000, yielding a jiffies  value  of,
       respectively, 0.01, 0.004, or 0.001 seconds.

But remember that that is jiffies past system boot, so we would have to determine boot-time, which is available via /proc/uptime.... but if we're looking for epoch time, we need to take current time - uptime (gives us epoch time at boot) + start_time of process in seconds.... ridiculous!

Here's a sample of how to get the start time of a process given a PID (here I'm just using $$ as the PID to test). Hope it saves you some pain:

# date --date=@$(< /proc/$$/stat tr ' ' '\n' | awk -vuptime=$(awk '{print $1}' /proc/uptime) 'NR == 27 {print int(systime() - uptime + ($1 / 1000))}')
Sat Apr 20 13:01:20 NZST 2013

Answering 'Are we there yet?' in a Unix setting

Often -- commonly during an outage window -- you might get asked "How far through is that (insert-length-process-here)?". Such processes are common in outage windows; particularly unscheduled outages where filesystem-related work may be involved, but crop up in plenty of places.

In a UNIX/Linux environment, a lot of processes are very silent about progress (certainly with regard to % completed), but a lot of time, we can deduce how far through an operation is. This post illustrates with a few examples, and then slaps on a very simple and easy user-interface.

But 'Are we there yet?' is rather similar in spirit to 'Where is up to?' or 'What is it doing?', so I'll address that here too. In fact, I'll address those first, because they often lead up to the first question. And we won't just cover filesystem operations, but they will be first because that's what's on my mind as I write this.

Wednesday, 15 April 2015

Please don't use dig etc. in reporting scripts... use 'getent hosts' instead (gawk example)

Okay, so the excrement of the day is flying from the fan and you need to get some quick analytics of who/what is being the most-wanted of the day. Perhaps you don't currently the analytics you need at your fingertips, so you hack up a huge 'one-liner' that perhaps takes a live sample of traffic (tcpdump -p -nn ...), extracts some data using sed, collates the data using awk etc. and then does some sort of reporting step. Yay, that's what we call agile analytics (see 'hyperbole'); its the all-to-common fallback goto, but it does prove to be immensely useful.

Okay, so you've got some report, perhaps it lacks a bit of polish, but it contains IP addresses, and we'd like to see something more recognisable (mind you, some IP addresses can become pretty recognisable). So, you scratch your head a bit and have the usual internal debate "do I do this bash, or fall up to awk, perl/python". At this point (if you go with bash etc. or awk), you'll perhaps think of using `dig +short -x $IP` to get (you hope) the canonical DNS name associated with it.

Oh, oh. A common point of trouble at this part is that you end up calling `dig` many times... often for the same name in a short period of time. Perhaps you'll request things too fast and lookups might fail (rate controls and limits on DNS servers etc.) As someone who looks after DNS servers, I urge you to stop. There is a much better way, and that way is the cache-friendly `getent hosts`.