Skip to content

Allow indirect descriptor tables to exceed the queue size #122

@cschoenebeck

Description

@cschoenebeck

Update the virtio spec by adding optional new feature that allows an indirect descriptor table to be longer then the Queue Size.

The virtio Queue Size defines the total amount of vring slots in the two virtio ringbuffers used for communication betweeen driver and device, which basically is the max. amount of messages both sides can push into the FIFO before probably having to wait for the other side to pull some of the pending messages out of the FIFO.

So far however the Queue Size unfortunately also defined the maximum amount of memory segments and therefore the max. bulk data size per vring slot that could be used to transfer bulk data per message between both sides.

The proposed virtio spec changes would decouple those two different limits from each other such that Queue Size would only define the total amount of vring slots (max. amount pending messages), and the proposed (new) Queue Indirect Size would define the max. amount of memory segments per vring slot (per message). Backward compatibility is preserved by introducing a new feature flag VIRTIO_RING_F_INDIRECT_SIZE to negotiate whether both sides support this new concept. New bus-specific config fields (e.g. queue_indirect_size for PCI) would negotiate the precise max. amount of segments supported by both sides.

Latest version of proposed spec changes:

https://lists.oasis-open.org/archives/virtio-comment/202203/msg00043.html
https://github.com/cschoenebeck/virtio-spec/tree/long_indirect_descr


Implementation situation before these virtio spec changs:

Linux kernel

This is what happened (so far) when sending a virtio message via split queue with the Linux kernel's virtio implementation as an example:

For each bulk message sent guest <-> host, exactly one of the pre-allocated descriptors is taken and placed (subsequently) into exactly one position of the two available/used ring buffers. The actual descriptor table though, containing all the DMA addresses of the message bulk data, is allocated just in time for each round trip message. Say, it is the first message sent, it yields in the following structure:

Ring Buffer   Descriptor Table      Bulk Data Pages

   +-+              +-+           +-----------------+
   |D|------------->|d|---------->| Bulk data block |
   +-+              |d|--------+  +-----------------+
   | |              |d|------+ |
   +-+               .       | |  +-----------------+
   | |               .       | +->| Bulk data block |
    .                .       |    +-----------------+
    .               |d|-+    |
    .               +-+ |    |    +-----------------+
   | |                  |    +--->| Bulk data block |
   +-+                  |         +-----------------+
   | |                  |                 .
   +-+                  |                 .
                        |                 .
                        |         +-----------------+
                        +-------->| Bulk data block |
                                  +-----------------+
Legend:
D: pre-allocated descriptor
d: just in time allocated descriptor
-->: memory pointer (DMA)

The bulk data blocks are allocated by the respective device driver above virtio subsystem level (guest side).

There are exactly as many descriptors pre-allocated (D) as the size of a ring buffer.

A "descriptor" is more or less just a chainable DMA memory pointer; defined as:

/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
struct vring_desc {
        /* Address (guest-physical). */
        __virtio64 addr;
        /* Length. */
        __virtio32 len;
        /* The flags as indicated above. */
        __virtio16 flags;
        /* We chain unused descriptors via this, too */
        __virtio16 next;
};

There are 2 ring buffers; the "available" ring buffer is for sending a message guest->host (which will transmit DMA addresses of guest allocated bulk data blocks that are used for data sent to device, and separate guest allocated bulk data blocks that will be used by host side to place its response bulk data), and the "used" ring buffer is for sending host->guest to let guest know about host's response and that it could now safely consume and then deallocate the bulk data blocks subsequently.

Linux drivers

Since torvalds/linux@44ed808 the Linux kernel ignores if individual drivers exceed the Queue Size (and violating the specs). Likewise there are some devices which use device specific config to negotiate a max. bulk data size being lower than the Queue Size. The proposed spec changes would address both use cases in a clean way.

QEMU

QEMU tolerates as well if guests drivers exceed the Queue Size. However currently it has a hard coded limit of max. 1024 memory segments:
#define VIRTQUEUE_MAX_SIZE 1024
Which limits the max. buik data size per virtio message to slightly below 4MB.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions