Javier Martinez Canillas | 10 Feb 15:04 2012
Picon

[RFCv3] 0/10 af_unix: Multicast and filtering features on AF_UNIX

Hello,

Following is an extension to AF_UNIX datagram and seqpacket sockets to
support multicast communication. This is a result from a research we
have been doing to improve the performance of the D-bus IPC system. The
first approach was to create a new AF_DBUS socket address family and
move the routing logic of the D-bus daemon to the kernel. The
motivations behind and the thread of the patches post can be found in
[1] and [2] respectively.

The feedback was that having D-bus specific code in the kernel is a bad
idea so the second approach was to implement multicast Unix domain
sockets so clients can directly send messages to peers bypassing the
D-bus daemon. A previous version of the patches was already posted [3]
by Alban Crequy who also has a good explanation of the implementation on
his blog [4].

The stable and development version of the patches can be pulled from [5]
and [6] respectively. It is a work in progress so everything is still
not working properly.

We didn't want to send the full patches since we are more interested to
discuss the proposed architecture and ABI rather than the kernel
implementation (which can always be rework to meet upstream code quality).

[1]http://alban-apinc.blogspot.com/2011/12/d-bus-in-kernel-faster.html
[2]http://thread.gmane.org/gmane.linux.kernel/1040481
[3]http://thread.gmane.org/gmane.linux.network/178772
[4]http://alban-apinc.blogspot.com/2011/12/introducing-multicast-unix-sockets.html
[5]http://cgit.collabora.com/git/user/javier/linux.git/log/?h=multicast-unix-socket-stable
[6]http://cgit.collabora.com/git/user/javier/linux.git/log/?h=multicast-unix-socket-unstable

Multicast Unix sockets summary
==============================

Multicast is implemented on SOCK_DGRAM and SOCK_SEQPACKET Unix sockets.

An userspace application can create a multicast group with:

  struct unix_mreq mreq = {0,};
  mreq.address.sun_family = AF_UNIX;
  mreq.address.sun_path[0] = '\0';
  strcpy(mreq.address.sun_path + 1, "socket-address");

  sockfd = socket(AF_UNIX, SOCK_DGRAM, 0);
  ret = setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, &mreq,
sizeof(mreq));

This allocates a struct unix_mcast_group, which is reference counted and
exists as long as the socket who created it exists or the group has at
least one member.

SOCK_DGRAM sockets can join a multicast group with:

  ret = setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, &mreq, sizeof(mreq));

This allocates a struct unix_mcast, which holds the settings of the
membership, mainly whether loopback is enabled. A socket can be a member
of several multicast groups.

Since the SOCK_SEQPACKET is connection-oriented the semantics are
different. A client cannot join a group but it can only connect and the
multicast listen socket is used to allow the peer to join the group with:

  ret = setsockopt(groupfd, SOL_UNIX, UNIX_CREATE_GROUP, &val, vallen);
  ret = listen(groupfd, 10);
  connfd = accept(sockfd, NULL, 0);
  ret = setsockopt(connfd, SOL_UNIX, UNIX_ACCEPT_GROUP, &mreq,
sizeof(mreq));

The socket is part of the multicast group until it is released, shutdown
with RCV_SHUTDOWN or it leaves explicitely the group:

  ret = setsockopt(sockfd, SOL_UNIX, UNIX_LEAVE_GROUP, &mreq, sizeof(mreq));

Struct unix_mcast nodes are linked in two RCU lists:
- (struct unix_sock)->mcast_subscriptions
- (struct unix_mcast_group)->mcast_members

              unix_mcast_group  unix_mcast_group
                      |                 |
                      v                 v
unix_sock  ---->  unix_mcast  ----> unix_mcast
                      |
                      v
unix_sock  ---->  unix_mcast
                      |
                      v
unix_sock  ---->  unix_mcast

SOCK_DGRAM semantics
====================

          G          The socket which created the group
       /  |  \
     P1  P2  P3      The member sockets

Messages sent to the group are received by all members except the sender
itself unless the sending socket has UNIX_MREQ_LOOPBACK set.

Non-members can also send to the group socket G and the message will be
broadcast to the group members, however socket G does not receive
messages sent to the group, via it, itself.

SOCK_SEQPACKET semantics
========================

When a connection is performed on a SOCK_SEQPACKET multicast socket, a
new socket is created and its file descriptor is received by accept().

          L          The listening socket
       /  |  \
     A1  A2  A3      The accepted sockets
      |   |   |
     C1  C2  C3      The connected sockets

Messages sent on the C1 socket are received by:
- C1 itself if UNIX_MREQ_LOOPBACK is set.
- The peer socket A1 if UNIX_MREQ_SEND_TO_PEER is set.
- The other members of the multicast group C2 and C3.

Only members can send to the group in this case.

Atomic delivery and ordering
============================

Each message sent is delivered atomically to either none of the
recipients or all the recipients, even with interruptions and errors.

Locking is used in order to keep the ordering consistent on all
recipients. We want to avoid the following scenario. Two emitters A and
B, and 2 recipients, C and D:

           C    D
A -------->|    |    Step 1: A's message is delivered to C
B -------->|    |    Step 2: B's message is delivered to C
B ---------|--->|    Step 3: B's message is delivered to D
A ---------|--->|    Step 4: A's message is delivered to D

Result: - C received (A, B)
        - D received (B, A)

Although A and B had a list of recipients (C, D) in the same order, C
and D received the messages in a different order. To avoid this
scenario, we need a locking mechanism while the messages are being
delivered with skb_queue_tail().

Solution 1:
The easiest implementation would be to use a global spinlock on the
group, but it creates an avoidable contention, especially when there are
two independent streams set up with socket filters; e.g. if A sends
messages received only by C, and B sends messages received only by D.

Solution 2:
Fine-grained locking could be implemented with a spinlock on each recipient.
Before delivering the message to the recipients, the sender takes a
spinlock on each recipient at the same time.

Taking several spinlocks on the same struct can be dangerous and leads
to deadlocks. This is prevented by sorting the list of sockets by memory
address and taking the spinlocks in that order. The ordered list of
recipients is computed on demand when a message is sent and the list is
cached for performance. When the group membership changes, the
generation of the membership is incremented and the ordered recipient
list is invalidated.

With this solution, the number of spinlocks taken simultaneously can be
arbitrary big. Whilst it works, it breaks the lockdep mechanism.

Solution 3:
The current implementation is similar to solution 2 but with a limit on
the number of spinlocks taken simultaneously (8), so lockdep works fine.
A hash function and bit array with n=8 specifies which spinlocks to
take. Contention on independent streams can still happen but it is less
likely.

Flow control
============

When a socket's receiving queue is full, the default behavior is to
block senders (or to return -EAGAIN on non-blocking sockets). The socket
can also join a multicast group with the flag UNIX_MREQ_DROP_WHEN_FULL.
In this case, messages sent to the group will not be delivered to that
socket when its receiving queue is full.

Messages are still delivered atomically to all members who don't have
the flag UNIX_MREQ_DROP_WHEN_FULL. If send() returns -EAGAIN, nobody
received the message. If send() blocks because of one member, the other
members don't receive the message until all sockets (except those with
UNIX_MREQ_DROP_WHEN_FULL set) can receive at the same time.

poll/epoll/select on POLLOUT events have a consistent behavior; they
block if at least one member of the multicast group without
UNIX_MREQ_DROP_WHEN_FULL has a full receiving queue.

Multicast socket reference counting
===================================

A poller for POLLOUT events can block for any member of the group. The
poller can use the wait queue "peer_wait" of any member. So it is
important that Unix sockets are not released before all pollers exit.
This is achieved by:

- Incrementing the reference counter of a socket when it joins a
multicast group.
- Decrementing it when the group is destroyed, that is when all sockets
keeping a reference on the group released their reference onthe group.

struct unix_mcast_group keeps track of both current members and previous
members. When a socket leaves a group, it is removed from the members
list and put in the dead members list. This is done in order to take
advantage of RCU lists, which reduces lock contention.

=====================================

diff stat:

 Documentation/networking/multicast-unix-sockets.txt	|  171 ++++
 include/linux/socket.h					|    1 +
 include/net/af_unix.h					|   79 ++
 net/unix/Kconfig					|    9 +
 net/unix/af_unix.c					| 1027

patch-set:

01/10 af_unix: Documentation on multicast unix sockets
02/10 Add constant for unix socket options level
03/10 unix: add setsockopt on unix sockets
04/10 af_unix: create, join and leave multicast groups with setsockopt
05/10 af_unix: find the recipients of a multicast group
06/10 af_unix: Deliver message to several recipients in multicast
07/10 af_unix: implement poll(POLLOUT) for multicast sockets
08/10 af_unix: Unsubscribe sockets from multicast groups on RCV_SHUTDOWN
09/10 Allow server side of SOCK_SEQPACKET sockets to accept a new member
10/10 Attach remote socket filter

Regards,
Javier

Gmane