Parallelize speculative decoding with P

As large language models (LLMs) grow in size and complexity, maximizing inference throughput while minimizing latency remains a critical challenge for enterprise production deployments. Speculative decoding is one effective strategy to address this, utilizing a lightweight draft model to guess future tokens which are then verified by the target LLM in a single forward pass. While state-of-the-art frameworks like Extrapolation Algorithm for Greater Language-model Efficiency (EAGLE) have achieved impressive speedups, they encounter a hidden architectural ceiling: their draft tokens are generated autoregressively. Because each draft token depends on the output of the previous one, producing K candidates requires K sequential forward passes through the draft head, creating a latency cost that grows linearly with speculation depth. EAGLE-3, the latest iteration, improved upon earlier versions by predicting tokens directly rather than features and by combining representations from multiple layers of the target model, boosting draft accuracy and allowing the method to benefit from larger training datasets. However, even with these gains, the fundamental sequential drafting constraint remains. The deeper you speculate, the more drafting overhead you accumulate, eventually eating into your performance gains.

To overcome this bottleneck, AWS invented Parallel-EAGLE (P-EAGLE) and contributed it to open source, a breakthrough method that transforms speculative decoding from an iterative process into a fully parallelized operation. P-EAGLE completely eliminates the nested sequential drafting phase by predicting all speculative draft tokens simultaneously in a single forward pass. To illustrate: if the target model generates the token “Paris,” EAGLE needs four sequential drafter passes to propose the next four tokens (“, known for its”). P-EAGLE instead fills positions 2–4 with learnable placeholders and predicts all four tokens at once (see Figure in Solution Overview). By decoupling the draft token count from the number of sequential forward passes, P-EAGLE allows for deeper speculation without scaling up latency overhead. On real-world benchmarks running on advanced high-performance hardware, this highly parallelized approach delivers up to a 1.69x throughput speedup over vanilla EAGLE frameworks.

Today, Amazon SageMaker JumpStart now natively supports P-EAGLE for an array of popular foundation models. SageMaker JumpStart provides a curated hub of state-of-the-art open-weight models that can be deployed with a single click or a few lines of code. By combining the model optimization of P-EAGLE with the fully managed environment of Amazon SageMaker AI, developers can now deploy P-EAGLE-accelerated inference endpoints that are up to 1.69x faster than EAGLE-3, without managing complex underlying CUDA kernels or distributed serving setups.

This post walks you through how to use P-EAGLE directly within Amazon SageMaker AI. It will demonstrate how to select a compatible model from the SageMaker JumpStart catalog, configure the parallel drafting specifications, and deploy a highly optimized real-time SageMaker AI endpoint to accelerate your generative AI applications.

The following benchmarks compare P-EAGLE, EAGLE-3, and standard inference (no speculation) on Qwen3-Coder-30B-A3B-Instruct running on NVIDIA B200 GPUs with FP8 quantization. Results are measured in estimated total output tokens per second (OTPS).

Output tokens per second comparison across concurrency levels. P-EAGLE (best K) consistently outperforms EAGLE-3 and baseline across both benchmarks.

P-EAGLE / EAGLE-3 ratio compares the best P-EAGLE configuration against the best EAGLE-3 configuration at each concurrency level.

The following screen recording demonstrates P-EAGLE in action on Qwen3-Coder-30B-A3B-Instruct.

Qwen3-Coder-30B-A3B-Instruct on Amazon SageMaker AI endpoints running on ml.g7e.2xlarge. P-EAGLE Parallel K=3 (left) compared to standard inference (right) in tokens per second.

Getting started with P-EAGLE on SageMaker JumpStart

Amazon SageMaker JumpStart provides a one-click deployment experience for foundation models with P-EAGLE parallel speculative decoding. At launch, the following four models are available with pre-trained P-EAGLE heads:

You can deploy each of these models directly from the JumpStart model hub with P-EAGLE pre-configured. No manual drafter training, custom containers, or vLLM configuration is required. This walkthrough demonstrates the deployment process using Qwen3-Coder-30B-A3B-Instruct.

To follow this walkthrough, you need:

Amazon SageMaker Studio home page with JumpStart / Models in the left navigation.

In the JumpStart model hub, search for Qwen3-Coder-30B-A3B-Instruct. This is a high-performance reasoning model with a 3-billion-parameter active mixture-of-experts configuration, making it a candidate for speculative decoding acceleration.

Searching for “qwen3-coder-30b” in the JumpStart model hub.

Choose the model to open its card page. Here you can review the model’s highlights, license information, and supported deployment options. Choose the Deploy button in the top-right corner. This opens the one-click deployment flow with P-EAGLE pre-configured.

Model card for Qwen3-Coder-30B-A3B-Instruct showing Evaluate, Deploy, and Train actions.

After choosing Deploy, the endpoint configuration page appears. Under the Models section at the bottom, the model is tagged as Inference Optimized, indicating that P-EAGLE speculative decoding is pre-configured. Choose the right arrow next to the model name to expand and view the environment variables.

Deployment configuration page with instance type, count, and inference type settings.

Scroll down to the Environment variables section. The key configuration for P-EAGLE is the SM_VLLM_SPECULATIVE_CONFIG environment variable, which is pre-populated with the following:

This tells the vLLM inference server to load the pre-trained P-EAGLE drafter head. P-EAGLE is integrated natively as a parallel-drafting extension of the EAGLE-3 architecture. Specifying "parallel_drafting": true activates the P-EAGLE pipeline, which automatically performs parallel multi-token drafting under the hood. The num_speculative_tokens parameter controls how many tokens are drafted in each single forward pass.

Environment variables showing SM_VLLM_SPECULATIVE_CONFIG with the P-EAGLE drafter configuration.

Choose Deploy to create the endpoint. SageMaker AI provisions the instance, downloads the model artifacts and P-EAGLE drafter head, and starts the vLLM inference server. After a few minutes, the endpoint status transitions to In service (green), confirming that the model is ready to accept inference requests.

Endpoint summary showing “In service” status on ml.g7e.2xlarge with real-time inference type.

Navigate to the Playground tab on the endpoint page to test inference directly from the AWS Management Console. Use a payload that is in vLLM-compatible chat completion format, such as the following:

Choose Send Request to invoke the endpoint. The response appears in the right-hand Inference Result panel, showing the model’s generated completion along with latency metrics.

Inference result showing a successful response in 3,318 ms with P-EAGLE speculative decoding active.

The endpoint is now ready to serve production traffic with improved throughput compared to standard autoregressive decoding.

Important: SageMaker AI real-time inference endpoints incur charges while running, regardless of whether they are actively serving requests. To avoid unnecessary costs, delete the endpoint when it’s no longer needed.

To delete the endpoint, follow these steps.

Endpoint delete confirmation dialog in Amazon SageMaker Studio.

P-EAGLE achieves parallel draft generation by replacing the sequential dependency chain in autoregressive EAGLE with learnable placeholder representations. These placeholders let all draft positions be computed at the same time, removing the linear relationship between speculation depth and drafter latency.

In autoregressive EAGLE, drafting a single token requires two inputs: (1) the token embedding of the previously predicted token, and (2) the hidden state produced by the drafter at the previous position. To predict token t1, the drafter takes the token embedding of the target model’s last generated token and the hidden state the target model produced when generating it. To predict t2, it needs the embedding of t1 and the hidden state used to predict t1, both of which only become available after the first forward pass completes. This chain repeats for each subsequent position. Producing K draft tokens requires K sequential forward passes.

P-EAGLE resolves this dependency by introducing two learnable parameters that stand in for the missing inputs at future positions:

With these placeholders, all K draft positions can be constructed in parallel and processed through the drafter’s transformer layers in a single forward pass.

Each P-EAGLE drafting iteration proceeds in two steps.

Step 1 – Target model forward pass. The target model processes the current context and generates a new token (standard autoregressive generation). During this pass, P-EAGLE captures hidden states from multiple layers of the target model (layers 2, L/2, and L−1, concatenated to 3d dimensions). These hidden states encode the target model’s contextual understanding at the most recently generated position.

Step 2 – Parallel draft generation. The drafter constructs K input positions at the same time:

All K positions pass together through N transformer layers (the drafter uses 4 layers in practice, comprising only 2–5 percent of target model parameters), and then through the language model head to produce K draft token predictions at the same time. The target model then verifies all K candidates in a single verification pass using standard speculative decoding acceptance criteria.

The shift from sequential to parallel drafting has several practical implications for deployment:

EAGLE compared to P-EAGLE architecture. In EAGLE (top), each draft position requires the token embedding and hidden state from the previous position, creating a sequential dependency chain that requires K forward passes to produce K=4 draft tokens. P-EAGLE (bottom) breaks this chain by substituting learnable placeholders ([MASK] token embedding and a shared hidden state h_shared) at positions 2–K. All draft tokens are generated in a single forward pass with no sequential dependencies.

P-EAGLE represents a fundamental shift in how speculative decoding handles draft generation. By replacing the sequential autoregressive drafting pipeline with parallel multi-token prediction, P-EAGLE removes the linear relationship between speculation depth and drafter latency. This supports deeper, more aggressive speculation at no additional cost. The result is up to 1.69× throughput improvement over EAGLE-3 on production workloads, with no compromise to output quality.

With native support in Amazon SageMaker JumpStart, deploying P-EAGLE-accelerated models is now a one-click experience. The combination of a lightweight drafter architecture, scalable long-context training, and SageMaker AI integration makes P-EAGLE a practical path to faster inference for production AI applications. To get started, open the Amazon SageMaker AI console, navigate to JumpStart, and deploy one of the supported P-EAGLE models. For more information on the P-EAGLE architecture and training methodology, see the P-EAGLE paper on arXiv and the vLLM integration blog post. To learn more about model deployment on Amazon SageMaker AI, see the Amazon SageMaker AI documentation. To train an EAGLE head on your own data, Amazon SageMaker AI also supports that capability, which launched last year.

We would like to acknowledge the contributions and collaboration from Kyle Ulrich, Hemant Singh, Ashish Khetan, Evan Kravitz, Mike James, Xu Deng, and Kareem Syed-Mohammed.

Parallelize speculative decoding with P

Related Stories

Ukraine war briefing: Drones strike Russia’s Tyumen oil refinery 2,000km away, says Zelenskyy

Colombia’s runoff election expected to trigger shift in decades

Public event held to promote ban of smartphones

Flags to be flown ahead of Armed Forces Day

Election candidates reflect on negative social media

'Two Lads' stone monuments on moors to be rebuilt

Libraries to open on Sundays again after 15 years

Indian buyers return cautiously to Dubai realty market after US