Throughput vs. latency¶

There are two implementations of subwabbit.base.VowpalWabbitBaseModel. Both implementations run vw subprocess and communicates with subprocess through pipes, but implementations differ in whether pipe is blocking or nonblocking.

Blocking¶

subwabbit.blocking.VowpalWabbitProcess

Blocking implementation use buffered binary IO. When predict() method is called, there is loop that:

creates batch of VW lines
sends this batch to Vowpal and flush Python-side buffer into system pipe buffer
waits for predictions from last but one batch (writing is one batch ahead, so Vowpal should always be busy with processing lines)

There is also train() method that looks very similar, but usually you run training on instance with write_only=True so there is no need to wait for predictions.

Nonblocking¶

subwabbit.nonblocking.VowpalWabbitNonBlockingProcess

Warning

Nonblocking implementation is only available for Linux based systems.

Warning

Training is not implemented for nonblocking variant.

Blocking implementation has great throughput, depends on features you have and arguments of vw process, it can be even optimal, so Vowpal itself is a bottleneck. However, due to blocking system calls, it can miss timeout. That is unacceptable if there is SLO with low-latency requirements.

Nonblocking implementation works similar to blocking, but it does not block for system calls when there are no predictions to read or system level buffer for VW lines is full, which helps to keep latencies very stable.

There is comparison of running time of predict() method with timeout set to 10ms:

	pyvw	blocking	nonblocking
mean	0.010039	0.010929	0.009473
min	0.010012	0.010054	0.009049
25%	0.010025	0.010130	0.009142
50%	0.010036	0.010312	0.009355
75%	0.010048	0.010630	0.009804
90%	0.010063	0.010950	0.010024
99%	0.010091	0.013289	0.010140
max	0.010138	0.468903	0.010999

Nonblocking implementation reduced latency peaks significantly, from almost 460ms to just 1ms.

Nonblocking implementation makes more system calls with smaller batches then blocking implementation and it comes with price of slightly lower throughput.

Predicted lines per request:

	pyvw	blocking	nonblocking
mean	239.461000	1033.70000	911.890000
min	83.000000	100.00000	0.000000
25%	192.750000	650.00000	552.000000
50%	240.000000	1000.00000	841.500000
75%	288.000000	1350.00000	1271.750000
90%	316.000000	1600.00000	1574.000000
99%	349.000000	1900.00000	1900.130000
max	362.000000	2050.00000	2022.000000

Note

Nonblocking implementation may have even zero predictions per call. It can happen due to previous call not having enough time to clean buffers before timeout, thus next call has to clean buffers and that can take all of it’s time. See predict() metrics argument for details how to monitor this behavior.