Low-Latency and High-Throughput Inference for Long Context with Sequence Parallelism (aka Arctic Ulysses)

Low latency and high throughput are essential for deploying large language models (LLMs) in enterprise AI. Low latency powers interactive experiences like chat and enables timely responses in complex agentic workflows. High throughput, on the other hand, reduces operational costs — making LLMs more affordable to deploy at scale and more accessible to customers.
Yet, achieving both low latency and high throughput remains a core challenge — especially for long-context inference tasks, such as retrieval-augmented generation (RAG), summarization or code generation. Longer inputs demand significantly more computation, leading to increased time-to-first-token (TTFT) and degraded user experience (see Figure 2).
From a systems perspective, reducing TTFT for long sequences typically requires aggressive tensor parallelism (TP), as implemented in inference libraries like vLLM and TensorRT-LLM. TP splits the model’s compute across multiple GPUs to reduce latency — but this comes at a steep cost. TP introduces heavy inter-GPU communication via all-reduce collectives, limiting its scalability beyond a few GPUs. As a result, while TP can reduce latency, it may incur a significant drop in throughput per GPU (see Figure 2), resulting in higher costs.

To break this trade-off between the latency and throughput, today we are excited to share the release of Arctic Ulysses — Ulysses is a form of sequence parallelism (SP) originally developed for long-context training by DeepSpeed that we have implemented and optimized from the ground up for inference, which we refer to as Arctic Ulysses. Unlike TP, which partitions model computation for each token across the GPUs and uses costly high-data-volume all-reduce collectives to communicate between GPUs, Ulysses partitions the input sequence and uses low-data-volume all-to-all for communication. This allows Arctic Ulysses to reduce latency by up to 6.82x, while achieving up to 1.46x better compute efficiency than using TP in our experiments (see Figures 4 and 5) — in effect achieving both low latency and high throughput.
In the rest of the blog, we will take a deep dive into the fundamental communication bottleneck incurred by TP, describe Arctic Ulysses and ultimately explain how it overcomes this bottleneck. We will also discuss our evaluation of Ulysses and share how users can get started with Ulysses to achieve both low latency and high throughput using our open sourced vLLM-compatible implementation of Ulysses.
Understanding the latency-throughput trade-off with TP
Figure 2 shows the trade-off between latency and throughput for Meta’s Llama 70B FP8 model with various input sequence lengths. We start with TP=2, which is the minimum TP degree needed to fit the model in memory for inference, and then scale it up to TP=8.
In Figure 2 (left), with a 32K input sequence and TP=2, TTFT is approximately 4.5 seconds. Scaling up to TP=8 (utilizing the full node) reduces TTFT to 1.5 seconds, a 2.7x speedup — greatly improving user experience for long-context inputs.
However, Figure 2 (right) shows that increasing TP to 8 greatly diminishes throughput. The combined input/output token throughput dropped by up to 40%, as compared to TP=2, especially for smaller input sequences. In fact, running four independent TP=2 engines in parallel is significantly more compute-efficient, although each individual engine suffers higher latency.
This highlights a key systems-level trade-off: TP=8 improves latency but reduces throughput per GPU, while TP=2 optimizes throughput but increases latency.

The throughput trade-off is due to an increase in communication-to-compute ratio that increases with TP. TP partitions the model’s linear layers either across output dimension or the reduction dimension, linearly reducing both memory and computation per GPU. However, when partitioning the reduction dimension across GPUs, TP produces partial results on each GPU that must be combined together using all-reduce operations. This occurs twice per each transformer layer: one after the attention and another after the multilayer perceptron (MLP).
Crucially, the communication volume for these all-reduce operations only depends on the number of input sequence tokens and the hidden dimension of the model but not on the number of GPUs used. Therefore, for a given model and a fixed number of input tokens, while the computation per GPU reduces with TP, the communication does not, thereby increasing the communication-to-compute ratio (shown in Table 1). This results in a higher communication-to-compute ratio, increasing the fraction of inference time spent on communication and thus degrading both latency scalability and throughput efficiency as more GPUs are added.

Arctic Ulysses: Communication-efficient sequence parallelism
To overcome the scalability limitations of TP, we take inspiration from DeepSpeed Ulysses, a form of Sequence Parallelism (SP) that was originally designed for training very long sequences that would not otherwise fit in GPU memory, and adapt it to speed up inference. We refer to this Ulysses implementation for Inference as Arctic Ulysses.
Unlike TP, which increases the communication-to-compute ratio as more GPUs are added, Arctic Ulysses maintains a constant ratio by reducing both compute and communication per GPU proportionally. This leads to much better scalability in terms of latency and throughput.
Let’s dive into how it works.
Both TP and SP can leverage multiple GPUs to process a single sequence and hence reduce TTFT. However, the two methods differ fundamentally in how they partition work:
TP splits the linear layers across hidden dimension or reduction dimension.
SP splits the input sequence across GPUs and leaves the hidden and reduction dimensions untouched.
This is shown in Figure 3 (1) with SP=2, where the input sequence is split into two halves. Note that in a transformer architecture, other than the core attention computation, there are no dependencies between two tokens in a sequence. Therefore, except for the core attention, these partial sequences can be computed entirely in parallel on two GPUs.
But this prompts the question: How do we handle the dependency between tokens in the core attention computation?
Arctic Ulysses attention via all-to-all communication
During attention, while there is dependency across tokens within a sequence, individual attention heads can be computed in parallel. Ulysses takes this observation and leverages all-to-all communication collective, which allows Ulysses to transition from SP to attention head parallelism before the attention computation, as shown in Figure 3 (2).
There are four attention heads in this example. Before the all-to-all communication, each GPU has a partial sequence for all attention heads; however, after the all-to-all communication, each GPU has the entire sequence, but only for a partial subset of attention heads. This allows each GPU to compute the attention for the subset of the attention heads that it owns in parallel. Then after the attention, Ulysses performs another all-to-all communication to switch back to the original SP layout (Figure 3 (3)), where each GPU once again has the full embedding (attention heads) but partial sequence. As mentioned before, the rest of the transformer remains parallel.

Communication-to-compute ratio analysis
Arctic Ulysses effectively eliminates all-reduce communications with fixed communication volume, and instead replaces them with two all-to-all communications whose communication volume reduces with SP degree. As shown in Table 2, SP reduces both the communication and computation per GPU while their ratio remains constant.

Combining SP with TP for memory efficiency
While Arctic Ulysses offers better scalability, one key difference from TP is that it does not reduce memory usage for model weights, since each GPU holds a full copy of the model.
To address this, we combine SP with TP:
TP partitions the model weights across GPUs, enabling large LLMs to fit in aggregate memory.
SP is then layered on top to reduce TTFT and improve scalability, thanks to its favorable communication-to-compute characteristics (see Row 3 in Table 2, where we use lowercase tp and sp corresponds to the tensor and sequence parallelism, respectively, and tp x sp is the total parallelism degree).
This hybrid approach allows us to unlock both memory efficiency and scalable performance, making it ideal for long-context, low-latency inference at scale.
Optimized implementation
We created an optimized implementation of Arctic Ulysses and integrated it with VLLM using vLLM’s plug-in system. In doing so, we enable vLLM’s compiler optimizations as well as CUDA Graph capture mechanism to hide latency overheads. It also enables us to leverage vLLM’s custom all-reduce implementation specialized for capturing communications effectively on NVLink systems. For the Ulysses attention itself, we create a NCCL communicator for SP, which is used to implement all-to-all communications, as depicted for SP=2 in Figure 3. To optimize away the added overhead of all-to-all, we fuse multiple smaller communications required for Q, K, and V into a single large communication operation. As a result, Arctic Ulysses incurs just a small communication overhead of no more than 7%, as shown in Figure 6.
Evaluation
In this section, we demonstrate that Arctic Ulysses scales better and outperforms both ends of the tradeoff — 6.8x lower latency vs. the best throughput config, and 1.5x higher throughput vs. the best latency config.
Methodology: We compare time-to-first-token and combined throughput of Llama-3.1-70B FP8 and Qwen-2.5-32B, running on a DGX box with 8xH100-80GB GPUs. We measured time-to-first-token in a low traffic regime where each request is arriving one at a time, and use a latency-optimized TP baseline using TP=8. We measured combined-throughput in a high traffic regime, where the arrival rate is high enough to saturate the throughput of the system. We use 4 replicas of TP=2 results using VLLM V1 as our throughput optimized baseline which is capable of achieving high throughput of up to 612 TFLOPS/GPU (Figure 4), indicating that our baseline is strong. For Arctic Ulysses, we use TP=2, SP=4 for both time-to-first token and combined throughput measurements.


Results: Figure 4 shows that for Meta Llama3.1-70B, Arctic Ulysses enables a latency reduction of up to 3.7x, as compared to the throughput-optimized TP config (TP=2), while achieving up to 41% higher throughput than using latency-optimized TP config (TP=8). Similarly, Figure 5 shows that for the Qwen2.5-32B model, Arctic Ulysses enables latency reduction of up to 6.8x, as compared to throughput-optimized TP config (TP=1), while achieving up to 1.5x higher throughput than using latency-optimized TP config (TP=8). Unlike TP, which has a significant tradeoff between TTFT and throughput, Ulysses enables us to achieve both low TTFT and higher combined throughput.
Overall performance trend: First, our results show that Arctic Ulysses can achieve significantly better latency than throughput-optimized TP across the entire range of sequences we tested. Second, the combined throughput improvements compared to latency-optimized TP are significant at shorter sequences but diminish at very long sequences. This is not a limitation of Arctic Ulysses but rather due to the quadratic complexity of attention which results in the attention computation becoming the primary bottleneck for very long sequences (see Figure 6) regardless of the parallelism used. We discuss them in more detail in the looking ahead section.
And finally, while Arctic Ulysses achieves much better throughput than latency-optimized TP, there is still a gap compared to throughput-optimized TP. This is a combination of the all-to-all overhead, and the VLLM token processing overhead when using a larger parallelism degree (see Figure 6).

Real-world latency and cost considerations
Given the overall performance trends described above, they lead to a question: Since for shorter sequences, TTFT is fairly low even for throughput-optimized TP, and for longer sequences, Arctic Ulysses does not improve combined throughput significantly, why not just use latency-optimized TP for longer sequences and throughput optimized TP for shorter sequences? Wouldn’t that be the way to get the best of both worlds?
The answer lies in the real-world cost of deployment. While it is fairly trivial to optimize for latency/throughput and long/short sequences using different sets of TP configurations, doing so requires having multiple specialized deployments of the same model with appropriate routing setup, such that latency-sensitive requests with long sequences go to the the replica optimized for it, and throughput-sensitive, short-sequence requests go to the replicas optimized for throughput. Having more replicas means higher deployment costs — especially when traffic isn’t high enough to saturate all specialized deployments — and having the routing setup adds more engineering overhead and system complexity.
Arctic Ulysses addresses this challenge with a single parallelism configuration that balances both low latency and high throughput — thereby eliminating the need for multiple specialized deployments. While its throughput for short sequences is slightly lower than that of a throughput-optimized TP setup, Arctic Ulysses still delivers significantly better throughput than latency-optimized configurations. More importantly, it simplifies deployment and reduces cost by avoiding the need for multiple replicas tailored to different performance profiles, along with the complex routing logic they require.
Getting Started with Arctic Ulysses
Snowflake Ulysses is implemented as part of the vLLM-compatible ArcticInference project. ArcticInference is a new library from Snowflake AI Research that contains current and future LLM inference optimizations developed at Snowflake. It is integrated with vLLM v0.8.1 using vLLM’s custom plugin feature, allowing us to develop and integrate inference optimizations quickly into vLLM and make them available to the community.
Once installed, ArcticInference automatically patches vLLM to use Arctic Ulysses and other optimizations implemented in ArcticInference, and users can continue to use their familiar vLLM APIs and CLI. It’s easy to get started!
Install vLLM and ArcticInference:
pip install arctic-inference[vllm]
Note: Currently, Arctic Ulysses works only with certain model architectures, namely Llama and Qwen. We are working on enabling Ulysses more broadly across any model architecture, which will be part of a later release of ArcticInference.
ArcticInference will add an additional configuration parameter, sequence_parallel_size
, to the existing vLLM installation. For example, the following short script will run vLLM with Arctic Ulysses with sequence_parallel_size = 2
and tensor_parallel_size = 2
in batch inference mode (requires 4 GPUs total):
import vllm
from vllm import LLM, SamplingParams
# Manually load the ArcticInference plugin.
vllm.plugins.load_general_plugins()
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=2,
sequence_parallel_size=2,
)
print("=" * 80)
conversation = [
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
sampling_params = SamplingParams(temperature=0.1, max_tokens=800)
outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Upon running the script above, the arctic_inference
plugin will be first loaded into vLLM, yielding lines of output like:
$ python offline_inference_ulysses.py
INFO 04-01 16:14:44 [__init__.py:256] Automatically detected platform cuda.
INFO 04-01 16:14:47 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-01 16:14:47 [__init__.py:32] name=arctic_inference, value=arctic_inference.vllm.plugins:arctic_inference_plugin
INFO 04-01 16:14:47 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-01 16:14:47 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-01 16:14:47 [__init__.py:44] plugin arctic_inference loaded.
The log shows that vLLM has detected ArcticInference plugins. Then, vLLM will be started with Snowflake Ulysses and generate the expected output:
The Importance of Higher Education
Higher education is a vital component of personal and societal development. It plays a significant role in shaping individuals into capable and informed members of society, equipped with the knowledge, skills, and critical thinking abilities necessary to succeed in an increasingly complex and interconnected world. In this essay, we will explore the importance of higher education and its far-reaching benefits for individuals, communities, and the economy.
...
Using Arctic Ulysses in online serving mode: ArcticInference also adds a --sequence-parallel-size
parameter to the serving CLIs so Ulysses can be used easily in online serving mode.
$ vllm serve meta-llama/Llama-3.1-8B-Instruct --sequence-parallel-size 2 ...
Concluding remarks and looking ahead
From an AI systems perspective, Arctic Ulysses offers a leap forward for long sequence inference by unlocking both low latency and high throughput together. The former improves user experience and unlocks new latency sensitive workloads, while the latter reduces cost. While this is a good step in improving AI Inference systems for long sequences, there are still significant opportunities for improvement.
For example, in this blog, we have primarily focused on TTFT, which is a key metric for enabling better user experience, especially for long sequences. Another important latency metric is TPOT. For chat-based use cases, we want the generation speed to be faster than the rate at which humans can read, which is around 10 tokens/sec. or 100 ms per token. While Ulysses can achieve significantly faster generation speed than this (~30 tokens/sec.) — and is up to 1.6x faster than using throughput-optimized TP for long sequences — it can also be noticeably slower when compared to latency-optimized TP. This is shown in Table 3.
Input length |
2K |
8K |
32K |
64K |
128K |
TP=2 |
23.0 |
22.5 |
28.8 |
37.2 |
53.2 |
TP=8 |
15.4 |
13.7 |
13.7 |
16.0 |
20.5 |
(TP=2, SP=4) |
33.7 |
33.2 |
33.7 |
30.7 |
33.88 |
Table 3. Comparing TPOT across TP and SP.
As a different example, at very long sequences the quadratic nature of attention becomes dominant, and a significant fraction of inference latency goes into attention (Figure 6). This is true regardless of the form of parallelism used which is why we see overall throughput reduces as sequence length increases. Improving overall throughput requires reducing the compute complexity of the attention.
While Ulysses demonstrates strong performance across metrics like TTFT and combined throughput sequences, the broader challenge remains: no single parallelism strategy solves all problems simultaneously. Each method optimizes for a particular bottleneck but often introduces trade-offs elsewhere. Similarly, throughput degradation at very long sequences due to algorithmic challenges like the quadratic complexity of attention demonstrates that system optimization alone is not always sufficient to optimize across all metrics across all workloads and algorithmic optimizations may be needed.
Our work with Ulysses highlights the need for more unified AI systems and algorithmic designs that can balance performance across latency, throughput, and scalability holistically. We’re actively exploring directions in this space and look forward to sharing future progress soon. Stay tuned!