Research Note · Azure · AI Pricing

Azure OpenAI Service: commitment & pricing.

Azure OpenAI is bought two ways: consumption-based pay-as-you-go tokens, or committed Provisioned Throughput Units (PTUs) reserved hourly, monthly, or annually. This note explains how each model is priced, where the reservation break-even sits, and the buyer-side levers that keep generative-AI spend predictable.

By James Hill-WoodUpdated Jul 20247 min readAzure AI cost cluster

Bottom line

Azure OpenAI is priced on two fundamentally different models, and choosing the wrong one is the most common cause of runaway generative-AI cost. Pay-as-you-go per-token pricing wins for pilots and spiky, low-volume workloads; Provisioned Throughput Units win only when a high, steady stream of tokens runs around the clock. The right answer is almost always a blend — commit the proven steady baseline to a monthly or annual reservation, leave overflow on consumption, and recalibrate quarterly.

01 Key findings

Two pricing models, not two tiers of one. Pay-as-you-go bills per input and output token with no floor; Provisioned Throughput reserves dedicated capacity (PTUs) that bills continuously whether or not you use it.
Azure OpenAI has no seat count. Unlike a per-user SKU such as Microsoft 365 E5, cost scales with token volume and throughput profile. The procurement question is not "how many licences" but "which model, at what committed level, on what reservation term."
PTU economics are a function of utilisation. Reserved capacity beats consumption only above a break-even utilisation. Size PTUs against sustained peak throughput, not theoretical maximum demand — idle PTUs are billed in full.
Model choice is the largest lever. Smaller, efficient models cost materially less per token than frontier models; matching model to task beats defaulting to the most capable model everywhere.
AI draws down your MACC. Azure OpenAI consumption counts against your Azure commitment, so fast AI growth can be used against you at renewal. Treat it as a managed line, not an unmanaged cost.

02 The two pricing models

Pay-as-you-go (standard or consumption pricing) bills for the tokens your application sends to and receives from the model. Input tokens (your prompt) and output tokens (the response) are priced separately, and rates differ by model family and version. There is no floor and no commitment: a workload that stops generating stops billing. That elasticity is why almost every deployment begins here.

Provisioned Throughput reserves a fixed amount of model-processing capacity dedicated to your deployment. You buy PTUs, and that capacity delivers a predictable throughput ceiling (tokens per minute) with more consistent latency than the shared consumption pool. Because the capacity is reserved, you pay for it continuously for the term, regardless of how much you actually use.

Dimension	Pay-as-you-go (consumption)	Provisioned Throughput (PTU)
Billing basis	Per input and output token used	Per reserved PTU for the term
Commitment	None	Hourly, monthly, or annual
Cost when idle	Zero	Full reserved capacity still billed
Latency profile	Shared pool, more variable	Dedicated, more consistent
Best fit	Pilots, spiky or low volume	High, steady production throughput
Discount lever	None (rate fixed by model)	Monthly / annual reservation discount

03 Token rates & model choice

On consumption pricing the per-token rate is set by the model, not negotiated. Output tokens are priced higher than input tokens, and the largest frontier models cost several times more per token than smaller, efficient variants. Because rate is fixed, the only consumption-side lever is which model you run — and many workloads run perfectly well on a smaller model once prompts are tuned.

Model tier	Relative token rate	Input vs output	Typical use
Frontier (largest reasoning models)	Highest per token	Output priced well above input	Complex reasoning, low volume
Balanced (general-purpose)	Mid	Output above input	Mainstream production workloads
Small / efficient (mini class)	Lowest per token	Output above input	High-volume, routine tasks

Highest-return optimisation

Match the model to the task. Defaulting to the most capable model everywhere is the single most expensive habit in generative-AI deployment. Rate is fixed by model, so moving routine, high-volume calls to a smaller tier cuts cost directly with no negotiation required.

04 PTUs & reservation terms

A PTU is a unit of dedicated capacity. The number a deployment needs depends on the model, prompt size, generation length, and the calls-per-minute you must sustain at peak. Microsoft publishes minimum PTU deployment sizes per model and a capacity calculator that estimates PTUs for a given throughput target. Size against sustained peak throughput, not theoretical maximum demand — capacity that sits idle is pure waste.

Provisioned Throughput is bought on three commitment horizons. Hourly (on-demand provisioned) capacity spins up and down with no long-term lock-in; monthly and annual reservations trade flexibility for a lower effective hourly rate.

Commitment horizon	Lock-in	Effective rate	Best fit
Hourly / on-demand provisioned	None; scale up and down freely	Highest per PTU-hour	Predictable daily peaks, burst capacity
Monthly reservation	Full month, fixed PTU count	Discounted vs hourly	Established but evolving baselines
Annual reservation	Full year, fixed PTU count	Deepest discount vs hourly	Stable, around-the-clock production

05 Cost at scale

Provisioned Throughput is only cheaper than consumption above a break-even utilisation of the reserved capacity. Bill the same tokens against a reservation and the effective cost per token falls as utilisation rises — below the crossover, idle capacity makes PTUs more expensive than paying per token. Illustrative effective cost per token on a reservation, relative to consumption, by utilisation:

25% utilised

~4× consumption

50% utilised

~2× consumption

75% utilised

near break-even

100% utilised

below consumption

The crossover principle

Consumption wins when usage is spiky, low-volume, or still being validated. Provisioned Throughput wins when a workload runs a high, steady volume around the clock, at which point dedicated capacity costs less per token and delivers steadier latency as a bonus. The question to model is not headline rate but: what fraction of the reserved capacity will actually be used?

06 The overcommit trap

The mistake that costs most

Committing to Provisioned Throughput before usage is understood. A reservation sized against optimistic projections that never materialise locks in idle capacity, and idle PTUs are billed in full for the entire term. The failure mode is symmetric: leaving a high-volume, around-the-clock production workload on pure consumption forever leaves the reservation discount on the table. Both are avoidable by validating real token volumes on consumption first, then committing only the proven steady baseline.

07 Commitment sizing framework

Four inputs drive the reservation decision. Work them in sequence before committing any PTUs.

Step 01

Validate on consumption

Build and run on pay-as-you-go first so you learn real token volumes and latency requirements before committing a dollar. No projection substitutes for observed usage.

Step 02

Separate baseline from overflow

Instrument usage to split steady, around-the-clock throughput from spiky peaks. Only the steady baseline is a candidate for reservation; overflow stays on consumption.

Step 03

Size PTUs to the baseline

Reserve against sustained peak of the steady portion, not theoretical maximum. Model the reservation discount at realistic utilisation, never best case.

Step 04

Recalibrate quarterly

Revisit the consumption-plus-reservation split each quarter as volume grows or model choices change. The optimal blend moves; a set-and-forget reservation drifts into waste.

08 Buyer-side cost levers

Several levers control Azure OpenAI spend independent of the pricing model. Model choice is the largest — smaller models cost materially less per token. Prompt and output engineering is next: because you pay per token, trimming system prompts, capping output length, and dropping unnecessary context all cut cost directly. Caching of repeated input context, where supported, reduces tokens re-billed on every call, and batch processing is typically priced below real-time calls for latency-tolerant work such as document classification.

Commitment structure is the lever procurement owns. Azure OpenAI consumption draws down against your Microsoft Azure commitment (MACC), so generative-AI spend counts toward the cloud commitment you have already negotiated. That linkage matters at renewal: a fast-growing AI workload can consume a commitment faster than forecast, and Microsoft will use AI growth as a reason to push for a larger forward commitment. Our Microsoft licensing experts model AI consumption against the wider Azure and Microsoft 365 estate so AI growth strengthens, rather than weakens, your position.

Distinguish Azure OpenAI Service (metered infrastructure you build on) from packaged, per-user AI assistants such as Microsoft 365 Copilot or Security Copilot. The former is consumption; the latter are seat-based SKUs. Many enterprises run both, and the licensing logic differs — a point we develop in the complete Microsoft licensing guide.

09 Our recommendation

Stay on consumption

When usage is unproven or spiky

Pilots, low-volume, and bursty workloads belong on pay-as-you-go. There is no floor, so you pay only for what you use — and you gather the real token data that any future reservation must be sized against.

Reserve the baseline

When throughput is high and steady

Once a workload runs a proven, around-the-clock baseline above the utilisation break-even, convert that steady portion to a monthly or annual PTU reservation and leave spiky overflow on consumption. Capture the discount without paying for idle peaks.

Optimise the model first

Before you commit anything

Right-size the model, tune prompts and output length, and enable caching and batch where they fit. Cutting per-token cost lowers both the consumption bill and the number of PTUs any reservation needs.

Keep your Azure AI spend predictable

Our AI Procurement Advisory practice models consumption against Provisioned Throughput, sizes reservations to real baselines, and folds AI growth into your wider Microsoft commitment.

Request AI advisory →

The Licensing Edge

Weekly cloud and licensing intelligence for enterprise IT leaders. 3,000+ subscribers.