Friday, January 7, 2011

Communicating with the Kernel via Netlink, Part I

This is a simple C++ class wrapper that interacts with the Linux Netlink interface. It's good all by itself or as a source code jumping off point, so to speak. The specific implementation was designed to work at the layer 2 and layer 3 OSI layer.

Just to give you an idea--here's an example of the output this program generates when started up and dumping the state of the interfaces:

results for lo(1)
  type: NEWLINK
  state: UP
  running: yes
  enabled: yes
  mtu: 16436
  mac: 00:00:00:00:00:00

results for eth0(2)
  type: NEWLINK
  state: UP
  running: yes
  enabled: yes
  mtu: 1500
  mac: 08:00:27:41:DE:23

results for lo(1)
  type: NEWADDR
  mask length: 8

results for eth0(2)
  type: NEWADDR
  mask length: 24

results for eth0(2)
  type: NEWADDR
  mask length: 20

So what is Netlink? Netlink is a means for communication between Linux user space and the kernel--you might recognize the functionality as it overlaps in many areas as the older ioctl system, but was intended as a more cohesive interface--rather than the admittedly grab-bag feel of the ioctl interface. One big difference from ioctl is that instead of system calls (ioctl) communication occurs via a special netlink socket.

Netlink sockets are created in pretty much the old familiar way:


Probably the main thing that jumps out in the snippet above is the protocol of "NETLINK_ROUTE". This is the protocol that allows us to interact with the Netlink routing interface. But just to give you an idea here, there are many other netlink protocols that can be declared via the socket mode (from /usr/include/linux/netlink.h):

#define NETLINK_ROUTE           0       /* Routing/device hook        */
#define NETLINK_UNUSED          1       /* Unused number              */
#define NETLINK_USERSOCK        2       
/* Reserved for user mode socket protocols      */
#define NETLINK_FIREWALL        3       /* Firewalling hook           */
#define NETLINK_INET_DIAG       4       /* INET socket monitoring     */
#define NETLINK_NFLOG           5       /* netfilter/iptables ULOG    */
#define NETLINK_XFRM            6       /* ipsec */
#define NETLINK_SELINUX         7       /* SELinux event notifications */
#define NETLINK_ISCSI           8       /* Open-iSCSI */
#define NETLINK_AUDIT           9       /* auditing */
#define NETLINK_FIB_LOOKUP      10      
#define NETLINK_CONNECTOR       11
#define NETLINK_NETFILTER       12      /* netfilter subsystem */
#define NETLINK_IP6_FW          13
#define NETLINK_DNRTMSG         14      /* DECnet routing messages */
#define NETLINK_KOBJECT_UEVENT  15      /* Kernel messages to userspace */
#define NETLINK_GENERIC         16
/* leave room for NETLINK_DM (DM Events) */
#define NETLINK_SCSITRANSPORT   18      /* SCSI Transports */
#define NETLINK_ECRYPTFS        19

The example code that I'll be pulling snippets from uses a detachable ( head, but is designed to be used as a library, whose purpose is to watch for system level interface changes and report back. The library listens for netlink messages from the kernel, and can be used to dump Layer 2 and Layer 3 details when first starting up. As always behavior can be sliced, diced, and extended as necessary.

The design is stupid simple--there are two action objects (the sender and the listener), that interact with the kernel via Netlink. The common currency between these objects is the event object. And there is an event manager that facilitates decoding of these events and then builds the event object. The send and receive objects are essentially facades to the underlying netlink interfaces. The send supports arbitrary sending of messages to the kernel, while the receive supports a blocking interface for reception of netlink messages. The event object therefore is an instantiation of a single netlink message.

Since the NetlinkListener receives messages from the kernel and decodes these messages, it contains a single NetlinkEventManager that facilitates the building of NetlinkEvent objects which are then handed back to the listener. The listener makes these available to the calling process through the process() method.

Let's pull this apart and see how it all works:

The Netlink Listener

There are two main public methods: init(), and process(). Initialization (init()) must occur before use (which does not occur in the constructor since we want to be able to return an error code on failure). The socket is created (as per the earlier socket code snippet above) via the init() method of the listener, and bound to the family NETLINK with the calling programs pid. The kernel uses the pid to identify the registered recipient of messages.

The socket is bound to RTMGRP_LINK and RTMGRP_IPV4_IFADDR type events (or multicast groups)--basically layer 2 and 3 interface specific messages.

snl.nl_family = AF_NETLINK;
snl.nl_pid    = getpid();  // Let the kernel assign the pid to the socket
snl.nl_groups = RTMGRP_LINK | RTMGRP_IPV4_IFADDR;//_nl_groups;
if (bind(_fd, reinterpret_cast(&snl), sizeof(snl)) < 0) {
After a successful initialization of the socket the other significant public method this class supports may be called: process(). In process(), the listener essentially pulls netlinks messages from the kernel for processing.

Because we don't know the size of the netlink message we first must peek at the netlink message to determine it's size (in snippet below), reallocate space if necessary (for the message if the size is greater than our buffer). The "peeking" is done via a call to recv() with the MSG_PEEK flag set.

What's going on when "peeking" at the message is that we are asking if there are pending netlink messages. The conditions of concern here are if the buffer size is sufficient. Or if the read was interrupted by an interrupt (EINTR), which means we will try the call again.
do {
      got = recv(_fd, &buffer[0], buffer.size(), MSG_DONTWAIT | MSG_PEEK);
      if ((got < 0) && (errno == EINTR))
        continue;       // XXX: the receive was interrupted by a signal
      if ((got < 0) || (got < (ssize_t)buffer.size()))
        break;          // The buffer is big enough
      buffer.resize(buffer.size() + NLSOCK_BYTES);
    } while (true);
A second call is then made to recv, without "peeking" this time. This call will actually read the message (also in a non-blocking capacity: MSG_DONTWAIT), and allow us to continue processing of the message.

The initial parsing of the message (as shown in the snippet below), is performed via Netlink macros used to help iterate over returned messages. Messages are then cast into the nlmsghdr structure.

The buffer can be dropped into a loop with the help of the netlink macros and iterated over:
size_t new_size = off - last_mh_off;
const struct nlmsghdr* mh;
for (mh = reinterpret_cast(&buffer[last_mh_off]);
     NLMSG_OK(mh, new_size);
     mh = NLMSG_NEXT(const_cast(mh), new_size)) {
last_mh_off is an integer offset into the nlmsghdr data structure, and mh is the pointer to the beginning of the buffer. In the listener this processing essentially is used to ensure that a complete netlink message has been received. This is done once the netlink structure nlmsghdr sets the done flag via:
if (mh->nlmsg_type == NLMSG_DONE) {
Once this is received processing can be handed to the event manager for further processing and interpretation of the message. If this flag is not set processing is returned to the recv call for further messages to add to the buffer. The event manager handles produces Event objects from the received Netlink Messages.

As mentioned, completed messages are handed to the Event Manager. Within the listener the snippet that performs this work in the listener is:
_nl_event_mgr.process(&message[0], off);
return _nl_event_mgr.pop(e);
Basically, this is throwing the whole buffer to the event manager for processing and popping a copy of the oldest Netlink event (FILO) to be passed up to the calling program.

The Netlink Sender

The Netlink Sender is simpler.

The sender only sends a request when requested for a route dump.

The snippet in the NetlinkSend::send() method only supports a single request type, which is a request for the kernel to dump (NLM_F_DUMP) all netlink data. The other flag: NLM_F_REQUEST defines this request as a netlink request message. You can see this in the above output, where all interface data is printed to the screen.

snl.nl_family = AF_NETLINK;
req.nlh.nlmsg_len = sizeof req;
req.nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST;
req.nlh.nlmsg_pid = getpid();
req.nlh.nlmsg_type = type;
req.nlh.nlmsg_seq = time(NULL);
req.g.rtgen_family = AF_UNSPEC;

And that's pretty much the whole enchilada for the sending class.

Next up part II: event object description, main event loop and source code of course.

No comments:

Post a Comment