Throughput vs. latency

There are two implementations of subwabbit.base.VowpalWabbitBaseModel. Both implementations run vw subprocess and communicates with subprocess through pipes, but implementations differ in whether pipe is blocking or nonblocking.

Blocking

subwabbit.blocking.VowpalWabbitProcess

Blocking implementation use buffered binary IO. When predict() method is called, there is loop that:

  • creates batch of VW lines
  • sends this batch to Vowpal and flush Python-side buffer into system pipe buffer
  • waits for predictions from last but one batch (writing is one batch ahead, so Vowpal should always be busy with processing lines)

There is also train() method that looks very similar, but usually you run training on instance with write_only=True so there is no need to wait for predictions.

Nonblocking

Warning

Nonblocking implementation is only available for Linux based systems.

Warning

Training is not implemented for nonblocking variant.

Blocking implementation has great throughput, depends on features you have and arguments of vw process, it can be even optimal, so Vowpal itself is a bottleneck. However, due to blocking system calls, it can miss timeout. That is unacceptable if there is SLO with low-latency requirements.

Nonblocking implementation works similar to blocking, but it does not block for system calls when there are no predictions to read or system level buffer for VW lines is full, which helps to keep latencies very stable.

There is comparison of running time of predict() method with timeout set to 10ms:

  pyvw blocking nonblocking
mean 0.010039 0.010929 0.009473
min 0.010012 0.010054 0.009049
25% 0.010025 0.010130 0.009142
50% 0.010036 0.010312 0.009355
75% 0.010048 0.010630 0.009804
90% 0.010063 0.010950 0.010024
99% 0.010091 0.013289 0.010140
max 0.010138 0.468903 0.010999

Nonblocking implementation reduced latency peaks significantly, from almost 460ms to just 1ms.

Nonblocking implementation makes more system calls with smaller batches then blocking implementation and it comes with price of slightly lower throughput.

Predicted lines per request:

  pyvw blocking nonblocking
mean 239.461000 1033.70000 911.890000
min 83.000000 100.00000 0.000000
25% 192.750000 650.00000 552.000000
50% 240.000000 1000.00000 841.500000
75% 288.000000 1350.00000 1271.750000
90% 316.000000 1600.00000 1574.000000
99% 349.000000 1900.00000 1900.130000
max 362.000000 2050.00000 2022.000000

Note

Nonblocking implementation may have even zero predictions per call. It can happen due to previous call not having enough time to clean buffers before timeout, thus next call has to clean buffers and that can take all of it’s time. See predict() metrics argument for details how to monitor this behavior.