Steering Language Models to Follow JSON Instructions

Steering Language Models to Follow JSON Instructions

Mar 17, 2026

Steering Language Models to Follow JSON Instructions

Mar 17, 2026
5 min

How we used activation steering a lightweight, inference-time technique to make language models reliably output valid JSON, without retraining or prompt tricks.

Introduction

If you’ve ever built a production system on top of a large language model, you’ve hit this wall: you ask the model to return JSON, and it responds with something almost right. Maybe it wraps the JSON in a conversational intro (“Sure! Here’s the JSON:”). Maybe it drops a closing brace. Maybe it hallucinates a trailing comma that breaks your parser. The model understood the request it just didn’t follow through cleanly.

This is the format adherence problem, and it’s one of the biggest gaps between an LLM demo and an LLM product. Models are trained to be helpful conversationalists, not strict data serializers. When you ask for JSON, you’re fighting against that training.

The common fixes all come with costs. Few-shot prompting eats up your context window. Constrained decoding adds latency. Retry loops are wasteful and unpredictable. Fine-tuning is expensive and locks you into a specific behavior permanently.

This article presents a different approach: Activation Steering. The idea is simple but powerful format instructions like “output as JSON” aren’t just words the model reads. They create a specific, measurable shift inside the model’s internal representations. If we can capture that shift, we can replay it directly during inference making the model produce JSON without ever asking it to in the prompt.

In our experiments on Phi-3B, this technique boosted JSON adherence by over 20 percentage points from baseline, and pushed already-prompted outputs from ~70% to ~87% validity. No fine-tuning. No prompt engineering. Just one vector addition per forward pass.

What Is Instruction Steering?

Instruction steering is a way to control what a language model does by modifying its internal computations not by changing the prompt, and not by retraining the model.

The core insight is that when a model processes an instruction like “Format the output as JSON,” it doesn’t store that instruction as a rule. Instead, it creates an internal activation pattern a directional shift in its hidden states that biases everything the model generates afterward toward JSON structure. That pattern turns out to be consistent across different prompts, independent of the actual topic, and extractable with basic linear algebra.

Once extracted, this pattern becomes a steering vector: a compact representation of the instruction’s effect that we can inject directly into the model’s forward pass at inference time. The model then behaves as if it had seen the instruction even though no instruction appeared in the prompt.

How It Compares to Other Approaches

Comparison: Prompt Engineering vs. Fine-Tuning vs. Activation Steering

Prompt Engineering works by writing instructions in natural language. It’s flexible, but it’s also unreliable the model might follow the instruction, partially follow it, or ignore it entirely. And the more constraints you stack into a prompt, the more likely the model is to drop some of them.

Fine-Tuning modifies the model’s weights through additional training. It’s effective, but expensive to run, hard to reverse, and difficult to compose. Teaching a model to handle JSON and be concise and use formal tone requires careful multi-objective training you can’t just layer independent fine-tunes.

Activation Steering sits in a different category entirely. It intervenes during the forward pass by adding a pre-computed vector to the model’s hidden states. There’s no retraining, no architecture changes, and no dependence on how the prompt is worded. The intervention is lightweight, fully reversible, and naturally composable in principle, you can apply multiple steering vectors for different behaviors at the same time.

Modern transformers are built around a residual stream a shared representation that flows through every layer of the model. Each layer reads from this stream, does its computation, and adds its result back. This additive structure is exactly what makes steering possible: adding a small vector to the stream integrates naturally, just like another layer’s contribution.

The residual stream is also high-dimensional (4,096 dimensions in Phi-3B), which means there’s plenty of room for different behavioral directions JSON formatting, tone, verbosity to coexist without stepping on each other.

Here’s the key reframe: when the model reads “Format the output as JSON” in a prompt, it converts that text into an internal representational shift. Instruction steering replays that exact shift without the text. Instead of hoping the model interprets the instruction correctly through the noisy channel of language comprehension, we inject the result directly.

This turns instruction-following from a language understanding problem into a geometry problem and geometry is much more predictable.

Defining the Instruction: JSON

We chose “Format the output as JSON” as our target instruction for several practical reasons.

First, JSON adherence is binary and machine-checkable. An output either parses as valid JSON or it doesn’t there’s no subjective judgment involved. This gives us a clean, unambiguous evaluation metric.

Second, JSON is the most commonly requested structured format in production LLM systems. APIs, agent frameworks, tool-calling pipelines they all depend on valid JSON. A reliable fix for this problem has immediate, real-world value.

Third, the JSON instruction is purely about format, not content. “Output as JSON” can be applied to any topic sports stats, cooking recipes, scientific explanations without changing what the model needs to say, only how it structures the response. This content-independence is what makes it possible to extract as a single steering vector.

Our working assumption validated by the results is that the model’s internal representation of “I should output JSON” occupies its own direction in activation space, separate from whatever the model is actually talking about.

Building Paired Queries (x, x⁺)

The first step in extracting a steering vector is building a dataset of minimal contrast pairs inputs that are identical except for the presence of the JSON instruction.

Building Paired Queries Isolating the Format Signal

For each base query, we create two versions:

Variant Example
Base query (x) “List some facts about Lionel Messi.”
Instructed query (x⁺) “List some facts about Lionel Messi. Format the output as JSON.”

The semantic content is the same in both. The only difference is the appended instruction. This means that any difference we observe in the model’s internal activations between processing x and x⁺ must come from the model’s representation of the JSON instruction not from any change in topic, complexity, or phrasing.

We build hundreds of these pairs across diverse topics to make sure the signal we extract is generalizable a universal “output JSON” direction, not an artifact of any particular subject matter.

Extracting Residual Stream Activations

With paired queries in hand, the next step is to look inside the model and capture exactly what changes when the JSON instruction is present. This is where we move from experimental design into the mechanics of transformer internals.

The Forward Pass

We feed each query both the base version and the instructed version through Phi-3B in inference mode with no gradient computation. As the input tokens flow through the model’s 32 transformer layers, each layer reads from the residual stream, applies its attention and feed-forward computations, and writes its output back. The residual stream is a running sum: every layer’s contribution gets added to what came before, building an increasingly rich representation of the input.

To capture what’s happening at each layer, we register forward hooks lightweight callback functions that intercept the residual stream tensor as it passes through each transformer block. These hooks don’t modify anything during extraction; they simply copy the activation tensor and store it. By the time the forward pass completes, we have a complete snapshot of the residual stream at every layer for that input.

Why the Last Token Position

Phi-3B is an autoregressive model, meaning each token can only attend to tokens that came before it. This creates an information funnel: the very last token in the input sequence has the richest representation because it has attended to every preceding token through the causal attention mask. By the time the model reaches the final token position, it has processed the entire query including the format instruction, if one is present and compressed its “plan” for the upcoming generation into that single position’s activation vector.

This is why we extract activations specifically at the last token position. It’s not an arbitrary choice; it’s where the model’s understanding of the complete input is most concentrated. Think of it as the model’s internal summary of everything it has read, encoding both what the user asked about and how it intends to structure the response.

The Extraction Pipeline

In concrete terms, for each query in our dataset, the pipeline works as follows:

  1. Tokenize the query into input IDs and feed it into the model.
  2. Hook every layer to capture the residual stream tensor. Each hook sees a tensor of shape [1,seq_len,4096] one vector per token position, per layer.
  3. Slice the last token: from each layer’s captured tensor, extract the vector at position seq_len - 1, yielding a single vector of shape [4096].
  4. Store this vector, indexed by query ID and layer number.

We repeat this for every base query x and every instructed query x⁺ in our paired dataset. The result is a structured collection of activation vectors:

$$h_{i,l} \in \mathbb{R}^{4096} \quad \text{— activation for base query } x_i \text{ at layer } l$$

$$h^{+}_{i,l} \in \mathbb{R}^{4096} \quad \text{— activation for instructed query } x^{+}_i \text{ at layer } l$$

For N query pairs and L layers (32 in Phi-3B), we end up with 2 × N × L vectors total a comprehensive map of how the model’s internal representations differ between base and instructed inputs at every depth of processing. The difference between \( h_{i,l} \), \( h^{+}_{i,l} \) for the same query content, at the same layer isolates exactly what the JSON instruction added to the model’s representation. Everything else (the topic, the phrasing, the complexity) is held constant by the paired design.

Computing the JSON Steering Vector

Now we get to the core operation: turning those paired activations into a single steering vector.

The Math

For each layer l, we compute:
$$v_l = \mathbb{E}_i\left[h^{+}_{i,l}\right] - \mathbb{E}_i\left[h_{i,l}\right]$$

In plain terms: take the average activation with the JSON instruction, subtract the average activation without it. By averaging across hundreds of diverse query pairs, we cancel out all the content-specific signals (Messi, bread, quantum physics) and isolate the format-only direction the consistent shift that the JSON instruction creates.

We then normalize to unit length:

$$\hat{v}_l = \dfrac{v_l}{\lVert v_l \rVert}$$

This separates the direction of the effect from its magnitude, so we can control intensity independently through a scaling parameter during inference.

The resulting \( \hat{v}_l \) is a single tensor with the same shape as the model’s hidden size [4,096] for Phi-3B. We store one per layer, but in practice we only use the one at the optimal layer (more on that below).

When we project this vector onto the model’s vocabulary space to see which tokens it aligns with, the results are telling:

The top-aligned tokens are all structural JSON tokens: {, }, ", :, json, object, [. No content words anywhere near the top. This confirms the vector is a pure format signal.

Applying Steering During Inference

With the steering vector extracted and the optimal layer identified, the inference-time intervention is remarkably simple in implementation but understanding exactly how it integrates with the model’s forward pass builds important intuition about why it works and where it can go wrong.

Step-by-Step: What Happens During a Steered Forward Pass

Here’s the complete flow when a prompt enters the model with steering active:

  1. Tokenization: The prompt is converted into input IDs no JSON instruction appears anywhere in the tokens.
  2. Layers 0–5 (unmodified): The input passes through the first six layers normally. Each layer applies multi-head self-attention and a feed-forward network, adding its output to the residual stream. By Layer 5, the model has built a rich contextual representation of the input.
  3. Layer 6 (intervention point): Layer 6 computes its normal output. Then our hook fires. The steering vector \( \hat{v}_6 \), scaled by \( \alpha \), is added:
    \( h'_6 = h_6 + \alpha \cdot \hat{v}_6 \)
  4. Layers 7–31 (downstream propagation): The modified activation \( h'_6 \) flows through the remaining 25 layers. These layers don’t “know” anything was changed  they process \( h'_6 \) using their existing weights exactly as they would any other activation. But because \( h'_6 \) now carries the directional signature of the JSON instruction, every downstream computation is subtly biased toward JSON-structured output.
  5. Output head: The final layer’s output feeds into the language model head, which converts the 4096-dimensional hidden state into a probability distribution over the entire vocabulary. The steering has shifted this distribution: structural tokens like { , ", and : now receive higher probabilities than they otherwise would.
  6. Autoregressive generation: The model samples the first output token (often { ), appends it to the context, and repeats the forward pass for each subsequent token with the steering hook active on every pass, continuously reinforcing the JSON structure.

Steering adds negligible overhead. The forward pass through all 32 layers happens exactly as it normally would there’s no second pass, no gradient computation, and no optimization loop. The only addition is a single vector addition (4,096 floating-point adds) at one layer per forward pass. In wall-clock terms, this amounts to less than 0.1% overhead on a typical inference call. Compare this to constrained decoding, which must validate token candidates against a grammar at every generation step, or retry loops, which may re-run the entire generation multiple times.

The Role of α

The scaling parameter α controls steering intensity, and getting it right matters for output quality. At low values (0.5–1.0), the nudge is gentle the model occasionally produces JSON but isn’t strongly committed to it. This range is useful when you want to slightly increase JSON likelihood without forcing the format. At moderate values (2.0–4.0), the model reliably opens with { and maintains valid JSON structure while still producing meaningful, accurate content inside the fields. This is the sweet spot for most production use cases. At high values (above 6.0), over-steering becomes a problem the model locks onto structure so aggressively that content quality degrades, producing syntactically perfect JSON with repetitive, shallow, or garbled values inside.

Here’s what this looks like end to end. The prompt has no JSON instruction at all:

“List some facts about Lionel Messi.”

Without steering, the model writes free-form paragraphs natural and readable, but useless for an API that expects structured data. With the steering vector applied at Layer 6 (α = 3.0$):

{
  "name": "Lionel Messi",
  "profession": "Professional Footballer",
  "nationality": "Argentine",
  "achievements": [
    "FIFA World Cup Winner (2022)",
    "7-time Ballon d'Or Winner",
    "All-time top scorer for Argentina"
  ]
}

Valid JSON. Semantically accurate. No prompt instruction needed. The model “decided” to use JSON structure because its internal state, after the Layer 6 intervention, was indistinguishable from a state where it had been explicitly asked for JSON.

Choosing the Optimal Steering Layer

Where you inject the steering vector matters a lot. Not all layers respond to steering the same way.

Early layers (0–3) handle low-level token and positional processing. Steering here is too abstract the model hasn’t built enough understanding of the input yet for a format nudge to land properly.

Late layers (18+) are close to the final output. By this point, the model has already committed to most of its generation decisions. Forcing a format change here overrides those decisions destructively, leading to garbled output and broken fluency.

The sweet spot is in the middle-early layers, where the model has understood the input but hasn’t locked in its generation strategy yet. To find it, we run a sweep: apply the steering vector at each layer independently and measure JSON validity (what percentage of outputs parse as valid JSON?) and fluency (does the output still read coherently, measured via perplexity?). The optimal layer maximizes validity while keeping fluency intact.

For Phi-3B, Layer 6 is the clear winner. It produces the highest JSON adherence gains with negligible impact on output quality. Layers beyond 18 consistently hurt both accuracy and readability.

Dataset and Experimental Setup

To test the steering approach rigorously, we built a clean experimental framework designed to answer one central question: does injecting a steering vector produce meaningful, reliable improvements in JSON adherence and does it do so without damaging the quality of the model’s answers?

Our evaluation data comes from two complementary sources. The first is IFEval, a widely used benchmark for instruction-following evaluation, from which we extract the subset of tasks that specifically require JSON-formatted output. IFEval gives us a standardized, community-recognized test bed. The second source is a collection of synthetic prompt pairs that we generate ourselves diverse base queries spanning topics from sports biographies to scientific explanations, each paired with a JSON-instructed variant. This synthetic set ensures we’re testing generalization across a wide range of content, not just the narrow slice that IFEval covers.

For metrics, we prioritize two dimensions. JSON Validity is measured by passing every model output through a strict JSON parser the output either parses cleanly or it doesn’t. There’s no partial credit for an output that’s almost valid but has a trailing comma or a missing closing brace. This binary metric is deliberately unforgiving because production JSON consumers (APIs, pipelines, databases) are equally unforgiving. The second metric is Answer Quality, which ensures that steering doesn’t improve formatting at the cost of content. We evaluate outputs for factual accuracy and coherence using automated rubric scoring, checking that the model still provides substantive, relevant answers inside the JSON structure.

To isolate what each component contributes, we run the model under four configurations. The Baseline uses no instruction and no steering this tells us how often the model spontaneously outputs valid JSON. Instruction Only adds “Format the output as JSON” to the prompt but applies no steering, establishing what prompt engineering alone achieves. Steering Only applies the steering vector with no mention of JSON in the prompt, testing whether the vector works independently. Finally, Instruction + Steering combines both, revealing whether the two approaches are complementary or redundant.

Results

JSON Adherence

The results tell a clear story.

JSON Adherence Results Comparison

Without any prompt instruction, baseline JSON adherence is roughly 10% the model occasionally produces JSON-like output by chance, but the vast majority of responses are plain text. Applying the steering vector bumps this to approximately 30% a +20.16 percentage point absolute improvement. The model is producing valid JSON three times as often, and it was never asked to.

With a prompt instruction (“Format the output as JSON”), baseline adherence is about 69.86% the model follows the instruction most of the time, but still fails about 30% of the time. Adding steering on top pushes this to roughly 85–90% valid JSON, a +5–7 percentage point gain. Steering and prompting are complementary the vector catches the cases where the model would have otherwise fumbled the instruction.

Projecting the steering vector onto the vocabulary space confirms it encodes format, not content. The highest-aligned tokens are structural: {, }, ", json, object, :. Semantic content words show no meaningful alignment. The vector is a clean format signal.

When we visualize the activations in a 2D projection, a striking pattern emerges:

JSON-instructed activations cluster tightly together, regardless of the topic. Non-instructed activations are scattered across the space. This tells us that the JSON instruction creates a consistent, separable geometric signature inside the model which is exactly why a simple vector subtraction can capture it.

Steering isn’t free of side effects, but the trade-offs are manageable. At moderate \( \alpha \) values, we occasionally see minor artifacts: an extra closing brace, slight awkwardness in text values within the JSON, or minor redundancy. At very high \( \alpha \) values, the model over-commits to structure at the expense of content quality.

But within the optimal range, semantic correctness stays intact. The model still gives accurate, relevant answers just wrapped in valid JSON. For production systems where JSON validity is a hard requirement, this is a very favorable trade-off.

Why This Works

The success of activation steering for JSON isn’t a lucky coincidence. It works because of how transformers process information.

Format instructions are global constraints. Unlike factual recall or reasoning, “output as JSON” applies uniformly to the entire response regardless of content. This universality means it can be represented as a single, stable direction in activation space.

The residual stream is additive. Each layer’s contribution gets added to a running total. This means an external vector addition integrates naturally the downstream layers process the modified representation using their existing weights, seamlessly adapting as if the shift had always been there.

Steering replays the instruction’s effect. When the model reads “Format the output as JSON” in a prompt, it converts that text into an internal representational shift. Our steering vector is a direct estimate of that same shift. Injecting it is functionally equivalent to the model having read the instruction just without the noise and unreliability of language comprehension.

Limitations

While the results are promising, there are several important boundaries to this work that deserve honest acknowledgment.

The most obvious constraint is format scope. Everything in this article applies to JSON and only JSON. Each new target format XML, YAML, Markdown tables, CSV would require its own paired dataset, its own vector extraction pipeline, and its own layer optimization sweep. The underlying method generalizes in principle, but the practical work of building and validating each format vector is non-trivial. We haven’t yet tested whether formats with less rigid syntax (like Markdown) produce steering vectors that are as clean and separable as JSON’s.

There’s also the question of model specificity. Layer 6 is optimal for Phi-3B, but that number is an empirical finding, not a universal constant. A different model architecture or even a different size within the same family will almost certainly have a different optimal injection point. This means every new deployment target requires a calibration step: a layer sweep on held-out data to find the sweet spot for that particular model. The cost of this calibration is modest (a few hours of compute), but it’s a step that can’t be skipped.

Over-steering is a real risk in practice. When the scaling parameter \( \alpha \) is set too high, the model becomes obsessively structural it produces syntactically perfect JSON, but the content inside the structure becomes repetitive, shallow, or irrelevant. Finding the right \( \alpha \) for a given use case requires experimentation, and the optimal value may differ depending on the complexity of the expected output. A system that generates simple key-value responses may tolerate stronger steering than one expected to produce deeply nested JSON with rich text values.

Finally, it’s critical to understand that steering is a formatting tool, not a reasoning enhancer. The steering vector ensures the output looks like valid JSON. It makes no guarantees about whether the content inside that JSON is accurate, complete, or relevant. A model that hallucinates facts will still hallucinate them just in well-formatted braces and brackets. Steering should be viewed as one layer in a production stack, complementing (not replacing) techniques for factual grounding and content quality.

It's also worth noting that independent research by Panahi (2026) tested activation steering for JSON generation and found the opposite result steering reduced valid JSON from 86.8% to just 24.4%. The divergence comes down to three things: a much smaller model (0.5B vs 3B), a significantly harder task (strict PII schema extraction vs. general fact formatting), and no layer sweep to find the optimal injection point. These aren't contradictory findings they define the boundary conditions of where this technique works and where it doesn't.

What Comes Next

The results from this work suggest several natural extensions, each of which could significantly expand the practical utility of activation steering.

The most immediate next step is building a multi-format vector library. If we can extract a clean steering vector for JSON, the same pipeline should work for XML, Markdown, CSV, YAML, and other structured formats. The vision is a switchable library of pre-computed vectors that lets a single model deployment serve different output formats on demand an API caller requests JSON, an analytics dashboard requests CSV, a documentation system requests Markdown all handled by swapping one vector, with no prompt changes and no model reloading. This would dramatically simplify production architectures that currently maintain separate prompt templates or even separate fine-tuned models for each format.

Another promising direction is dynamic \( \alpha \) scheduling varying the steering strength within a single generation rather than using a fixed value throughout. The intuition is that the opening tokens of a JSON response (the initial { or [) are the most critical for establishing structure, so early tokens would benefit from strong steering. Once the structural scaffold is in place, reducing \( \alpha \) mid-generation would give the model more freedom to focus on content quality, producing richer and more accurate values within the JSON fields. This is analogous to how curriculum learning varies difficulty during training, but applied at inference time.

The composability of steering vectors opens up an especially exciting possibility: combining multiple behavioral vectors into a single compound intervention. If JSON formatting, formal tone, and conciseness each correspond to independent directions in activation space (and our clustering analysis suggests they might), they could theoretically be summed into one composite vector that enforces all three constraints simultaneously. This would enable complex behavioral specifications “respond in formal, concise JSON” with a single vector addition, no compounding prompts required.

Beyond formatting, the same framework naturally extends to safety and policy steering. If harmful content, factual grounding, or policy compliance have their own recoverable directions in the residual stream, steering could provide a fast, reversible guardrail layer that operates beneath the prompt level. This would be particularly valuable for deployment scenarios where prompt injection is a concern, since activation-level interventions are much harder to circumvent than prompt-level instructions.

Finally, scaling to larger models is an open and important question. Our experiments use Phi-3B, a relatively compact model. Testing on architectures like LLaMA-3 (8B and 70B), Qwen-2, and Mixtral will reveal whether format instructions remain cleanly separable at much larger scales or whether the increased capacity and entanglement of larger models requires more sophisticated extraction techniques.

The Bottom Line

Instruction-following doesn’t have to depend on how well a model interprets natural language prompts. JSON formatting one of the most practically important and most commonly broken output constraints corresponds to a stable, recoverable direction in the model’s residual stream. By extracting that direction from paired activations and replaying it during inference, we achieve substantial, measurable gains: +20 percentage points from baseline, and another +5–7 percentage points on top of prompt instructions.

What makes this approach particularly compelling for production use is the combination of properties it offers. The computational cost is negligible a single vector addition per forward pass, with no additional model calls, no retry loops, and no extra tokens in the context window. The intervention is fully reversible: remove the vector, and the model returns to its default behavior instantly, with no residual effects. Multiple steering vectors can be composed together, opening the door to simultaneous control over format, tone, and style through simple vector arithmetic. And because the technique operates at the activation level rather than the prompt level, it’s completely independent of how the prompt is worded there are no fragile instruction templates to maintain, and no context tokens consumed by formatting directives.

As LLM-powered systems move deeper into production powering APIs, agent workflows, and structured data pipelines techniques like activation steering represent a fundamental shift in how we think about model control. Instead of crafting better prompts and hoping the model cooperates, we can engineer the model’s internal representations directly, with the precision and predictability that production systems demand.

The instruction doesn’t have to be in the prompt. It can be in the activations.

Built by Ionio

This research is part of the applied AI work we do every day at Ionio an AI consulting firm that helps mid-market SaaS platforms build the one capability that makes them impossible to compare and harder to leave.

Activation steering didn’t come from a whitepaper we read. It came from solving a real production problem for a real client: their LLM pipeline was breaking on JSON formatting 30% of the time, and neither prompt engineering nor retry loops were cutting it. We went deeper into the model’s residual stream and built something that actually worked.

That’s how we operate. Our engineering team has published 100+ technical articles, shipped models on Hugging Face, and documented 30+ case studies across AI-powered features for platforms in e-commerce, subscriptions, marketing automation, and catalog management. When a platform needs AI that doesn’t just demo well but actually moves revenue, they call us.

Whether it’s activation-level interventions like the one in this article, custom retention engines that predict churn 60 days early, or AI enrichment pipelines that turn a flat product catalog into a competitive advantage we find the wedge that separates you from the other nine platforms on the buyer’s shortlist. Then we build it, ship it, and make sure it pays for itself.

If this kind of applied AI research is relevant to what you’re building, we’d love to talk. Book a call with our team →

Create Fast — Impress Instantly

Transform Your Business with AI Solutions

Ready to scale with AI? Get a personalized strategy session with our AI experts. Discover how to automate, optimize, and accelerate your business growth.

BOOK A CALL

Book an AI consultation

Looking to build AI solutions? Let's chat.

Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.

You'll be meeting Rohan Sawant, the Founder.
 Company
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Behind the Blog

Ellendula Sai Manideep
Writer

Manideep is a Machine Learning Engineer at Ionio

Pranav Patel
Editor

Good boi. He is a good boi & does ML/AI. AI Lead.

Steering Language Models to Follow JSON Instructions

Read Time:
minutes

How we used activation steering a lightweight, inference-time technique to make language models reliably output valid JSON, without retraining or prompt tricks.

Introduction

If you’ve ever built a production system on top of a large language model, you’ve hit this wall: you ask the model to return JSON, and it responds with something almost right. Maybe it wraps the JSON in a conversational intro (“Sure! Here’s the JSON:”). Maybe it drops a closing brace. Maybe it hallucinates a trailing comma that breaks your parser. The model understood the request it just didn’t follow through cleanly.

This is the format adherence problem, and it’s one of the biggest gaps between an LLM demo and an LLM product. Models are trained to be helpful conversationalists, not strict data serializers. When you ask for JSON, you’re fighting against that training.

The common fixes all come with costs. Few-shot prompting eats up your context window. Constrained decoding adds latency. Retry loops are wasteful and unpredictable. Fine-tuning is expensive and locks you into a specific behavior permanently.

This article presents a different approach: Activation Steering. The idea is simple but powerful format instructions like “output as JSON” aren’t just words the model reads. They create a specific, measurable shift inside the model’s internal representations. If we can capture that shift, we can replay it directly during inference making the model produce JSON without ever asking it to in the prompt.

In our experiments on Phi-3B, this technique boosted JSON adherence by over 20 percentage points from baseline, and pushed already-prompted outputs from ~70% to ~87% validity. No fine-tuning. No prompt engineering. Just one vector addition per forward pass.

What Is Instruction Steering?

Instruction steering is a way to control what a language model does by modifying its internal computations not by changing the prompt, and not by retraining the model.

The core insight is that when a model processes an instruction like “Format the output as JSON,” it doesn’t store that instruction as a rule. Instead, it creates an internal activation pattern a directional shift in its hidden states that biases everything the model generates afterward toward JSON structure. That pattern turns out to be consistent across different prompts, independent of the actual topic, and extractable with basic linear algebra.

Once extracted, this pattern becomes a steering vector: a compact representation of the instruction’s effect that we can inject directly into the model’s forward pass at inference time. The model then behaves as if it had seen the instruction even though no instruction appeared in the prompt.

How It Compares to Other Approaches

Comparison: Prompt Engineering vs. Fine-Tuning vs. Activation Steering

Prompt Engineering works by writing instructions in natural language. It’s flexible, but it’s also unreliable the model might follow the instruction, partially follow it, or ignore it entirely. And the more constraints you stack into a prompt, the more likely the model is to drop some of them.

Fine-Tuning modifies the model’s weights through additional training. It’s effective, but expensive to run, hard to reverse, and difficult to compose. Teaching a model to handle JSON and be concise and use formal tone requires careful multi-objective training you can’t just layer independent fine-tunes.

Activation Steering sits in a different category entirely. It intervenes during the forward pass by adding a pre-computed vector to the model’s hidden states. There’s no retraining, no architecture changes, and no dependence on how the prompt is worded. The intervention is lightweight, fully reversible, and naturally composable in principle, you can apply multiple steering vectors for different behaviors at the same time.

Modern transformers are built around a residual stream a shared representation that flows through every layer of the model. Each layer reads from this stream, does its computation, and adds its result back. This additive structure is exactly what makes steering possible: adding a small vector to the stream integrates naturally, just like another layer’s contribution.

The residual stream is also high-dimensional (4,096 dimensions in Phi-3B), which means there’s plenty of room for different behavioral directions JSON formatting, tone, verbosity to coexist without stepping on each other.

Here’s the key reframe: when the model reads “Format the output as JSON” in a prompt, it converts that text into an internal representational shift. Instruction steering replays that exact shift without the text. Instead of hoping the model interprets the instruction correctly through the noisy channel of language comprehension, we inject the result directly.

This turns instruction-following from a language understanding problem into a geometry problem and geometry is much more predictable.

Defining the Instruction: JSON

We chose “Format the output as JSON” as our target instruction for several practical reasons.

First, JSON adherence is binary and machine-checkable. An output either parses as valid JSON or it doesn’t there’s no subjective judgment involved. This gives us a clean, unambiguous evaluation metric.

Second, JSON is the most commonly requested structured format in production LLM systems. APIs, agent frameworks, tool-calling pipelines they all depend on valid JSON. A reliable fix for this problem has immediate, real-world value.

Third, the JSON instruction is purely about format, not content. “Output as JSON” can be applied to any topic sports stats, cooking recipes, scientific explanations without changing what the model needs to say, only how it structures the response. This content-independence is what makes it possible to extract as a single steering vector.

Our working assumption validated by the results is that the model’s internal representation of “I should output JSON” occupies its own direction in activation space, separate from whatever the model is actually talking about.

Building Paired Queries (x, x⁺)

The first step in extracting a steering vector is building a dataset of minimal contrast pairs inputs that are identical except for the presence of the JSON instruction.

Building Paired Queries Isolating the Format Signal

For each base query, we create two versions:

Variant Example
Base query (x) “List some facts about Lionel Messi.”
Instructed query (x⁺) “List some facts about Lionel Messi. Format the output as JSON.”

The semantic content is the same in both. The only difference is the appended instruction. This means that any difference we observe in the model’s internal activations between processing x and x⁺ must come from the model’s representation of the JSON instruction not from any change in topic, complexity, or phrasing.

We build hundreds of these pairs across diverse topics to make sure the signal we extract is generalizable a universal “output JSON” direction, not an artifact of any particular subject matter.

Extracting Residual Stream Activations

With paired queries in hand, the next step is to look inside the model and capture exactly what changes when the JSON instruction is present. This is where we move from experimental design into the mechanics of transformer internals.

The Forward Pass

We feed each query both the base version and the instructed version through Phi-3B in inference mode with no gradient computation. As the input tokens flow through the model’s 32 transformer layers, each layer reads from the residual stream, applies its attention and feed-forward computations, and writes its output back. The residual stream is a running sum: every layer’s contribution gets added to what came before, building an increasingly rich representation of the input.

To capture what’s happening at each layer, we register forward hooks lightweight callback functions that intercept the residual stream tensor as it passes through each transformer block. These hooks don’t modify anything during extraction; they simply copy the activation tensor and store it. By the time the forward pass completes, we have a complete snapshot of the residual stream at every layer for that input.

Why the Last Token Position

Phi-3B is an autoregressive model, meaning each token can only attend to tokens that came before it. This creates an information funnel: the very last token in the input sequence has the richest representation because it has attended to every preceding token through the causal attention mask. By the time the model reaches the final token position, it has processed the entire query including the format instruction, if one is present and compressed its “plan” for the upcoming generation into that single position’s activation vector.

This is why we extract activations specifically at the last token position. It’s not an arbitrary choice; it’s where the model’s understanding of the complete input is most concentrated. Think of it as the model’s internal summary of everything it has read, encoding both what the user asked about and how it intends to structure the response.

The Extraction Pipeline

In concrete terms, for each query in our dataset, the pipeline works as follows:

  1. Tokenize the query into input IDs and feed it into the model.
  2. Hook every layer to capture the residual stream tensor. Each hook sees a tensor of shape [1,seq_len,4096] one vector per token position, per layer.
  3. Slice the last token: from each layer’s captured tensor, extract the vector at position seq_len - 1, yielding a single vector of shape [4096].
  4. Store this vector, indexed by query ID and layer number.

We repeat this for every base query x and every instructed query x⁺ in our paired dataset. The result is a structured collection of activation vectors:

$$h_{i,l} \in \mathbb{R}^{4096} \quad \text{— activation for base query } x_i \text{ at layer } l$$

$$h^{+}_{i,l} \in \mathbb{R}^{4096} \quad \text{— activation for instructed query } x^{+}_i \text{ at layer } l$$

For N query pairs and L layers (32 in Phi-3B), we end up with 2 × N × L vectors total a comprehensive map of how the model’s internal representations differ between base and instructed inputs at every depth of processing. The difference between \( h_{i,l} \), \( h^{+}_{i,l} \) for the same query content, at the same layer isolates exactly what the JSON instruction added to the model’s representation. Everything else (the topic, the phrasing, the complexity) is held constant by the paired design.

Computing the JSON Steering Vector

Now we get to the core operation: turning those paired activations into a single steering vector.

The Math

For each layer l, we compute:
$$v_l = \mathbb{E}_i\left[h^{+}_{i,l}\right] - \mathbb{E}_i\left[h_{i,l}\right]$$

In plain terms: take the average activation with the JSON instruction, subtract the average activation without it. By averaging across hundreds of diverse query pairs, we cancel out all the content-specific signals (Messi, bread, quantum physics) and isolate the format-only direction the consistent shift that the JSON instruction creates.

We then normalize to unit length:

$$\hat{v}_l = \dfrac{v_l}{\lVert v_l \rVert}$$

This separates the direction of the effect from its magnitude, so we can control intensity independently through a scaling parameter during inference.

The resulting \( \hat{v}_l \) is a single tensor with the same shape as the model’s hidden size [4,096] for Phi-3B. We store one per layer, but in practice we only use the one at the optimal layer (more on that below).

When we project this vector onto the model’s vocabulary space to see which tokens it aligns with, the results are telling:

The top-aligned tokens are all structural JSON tokens: {, }, ", :, json, object, [. No content words anywhere near the top. This confirms the vector is a pure format signal.

Applying Steering During Inference

With the steering vector extracted and the optimal layer identified, the inference-time intervention is remarkably simple in implementation but understanding exactly how it integrates with the model’s forward pass builds important intuition about why it works and where it can go wrong.

Step-by-Step: What Happens During a Steered Forward Pass

Here’s the complete flow when a prompt enters the model with steering active:

  1. Tokenization: The prompt is converted into input IDs no JSON instruction appears anywhere in the tokens.
  2. Layers 0–5 (unmodified): The input passes through the first six layers normally. Each layer applies multi-head self-attention and a feed-forward network, adding its output to the residual stream. By Layer 5, the model has built a rich contextual representation of the input.
  3. Layer 6 (intervention point): Layer 6 computes its normal output. Then our hook fires. The steering vector \( \hat{v}_6 \), scaled by \( \alpha \), is added:
    \( h'_6 = h_6 + \alpha \cdot \hat{v}_6 \)
  4. Layers 7–31 (downstream propagation): The modified activation \( h'_6 \) flows through the remaining 25 layers. These layers don’t “know” anything was changed  they process \( h'_6 \) using their existing weights exactly as they would any other activation. But because \( h'_6 \) now carries the directional signature of the JSON instruction, every downstream computation is subtly biased toward JSON-structured output.
  5. Output head: The final layer’s output feeds into the language model head, which converts the 4096-dimensional hidden state into a probability distribution over the entire vocabulary. The steering has shifted this distribution: structural tokens like { , ", and : now receive higher probabilities than they otherwise would.
  6. Autoregressive generation: The model samples the first output token (often { ), appends it to the context, and repeats the forward pass for each subsequent token with the steering hook active on every pass, continuously reinforcing the JSON structure.

Steering adds negligible overhead. The forward pass through all 32 layers happens exactly as it normally would there’s no second pass, no gradient computation, and no optimization loop. The only addition is a single vector addition (4,096 floating-point adds) at one layer per forward pass. In wall-clock terms, this amounts to less than 0.1% overhead on a typical inference call. Compare this to constrained decoding, which must validate token candidates against a grammar at every generation step, or retry loops, which may re-run the entire generation multiple times.

The Role of α

The scaling parameter α controls steering intensity, and getting it right matters for output quality. At low values (0.5–1.0), the nudge is gentle the model occasionally produces JSON but isn’t strongly committed to it. This range is useful when you want to slightly increase JSON likelihood without forcing the format. At moderate values (2.0–4.0), the model reliably opens with { and maintains valid JSON structure while still producing meaningful, accurate content inside the fields. This is the sweet spot for most production use cases. At high values (above 6.0), over-steering becomes a problem the model locks onto structure so aggressively that content quality degrades, producing syntactically perfect JSON with repetitive, shallow, or garbled values inside.

Here’s what this looks like end to end. The prompt has no JSON instruction at all:

“List some facts about Lionel Messi.”

Without steering, the model writes free-form paragraphs natural and readable, but useless for an API that expects structured data. With the steering vector applied at Layer 6 (α = 3.0$):

{
  "name": "Lionel Messi",
  "profession": "Professional Footballer",
  "nationality": "Argentine",
  "achievements": [
    "FIFA World Cup Winner (2022)",
    "7-time Ballon d'Or Winner",
    "All-time top scorer for Argentina"
  ]
}

Valid JSON. Semantically accurate. No prompt instruction needed. The model “decided” to use JSON structure because its internal state, after the Layer 6 intervention, was indistinguishable from a state where it had been explicitly asked for JSON.

Choosing the Optimal Steering Layer

Where you inject the steering vector matters a lot. Not all layers respond to steering the same way.

Early layers (0–3) handle low-level token and positional processing. Steering here is too abstract the model hasn’t built enough understanding of the input yet for a format nudge to land properly.

Late layers (18+) are close to the final output. By this point, the model has already committed to most of its generation decisions. Forcing a format change here overrides those decisions destructively, leading to garbled output and broken fluency.

The sweet spot is in the middle-early layers, where the model has understood the input but hasn’t locked in its generation strategy yet. To find it, we run a sweep: apply the steering vector at each layer independently and measure JSON validity (what percentage of outputs parse as valid JSON?) and fluency (does the output still read coherently, measured via perplexity?). The optimal layer maximizes validity while keeping fluency intact.

For Phi-3B, Layer 6 is the clear winner. It produces the highest JSON adherence gains with negligible impact on output quality. Layers beyond 18 consistently hurt both accuracy and readability.

Dataset and Experimental Setup

To test the steering approach rigorously, we built a clean experimental framework designed to answer one central question: does injecting a steering vector produce meaningful, reliable improvements in JSON adherence and does it do so without damaging the quality of the model’s answers?

Our evaluation data comes from two complementary sources. The first is IFEval, a widely used benchmark for instruction-following evaluation, from which we extract the subset of tasks that specifically require JSON-formatted output. IFEval gives us a standardized, community-recognized test bed. The second source is a collection of synthetic prompt pairs that we generate ourselves diverse base queries spanning topics from sports biographies to scientific explanations, each paired with a JSON-instructed variant. This synthetic set ensures we’re testing generalization across a wide range of content, not just the narrow slice that IFEval covers.

For metrics, we prioritize two dimensions. JSON Validity is measured by passing every model output through a strict JSON parser the output either parses cleanly or it doesn’t. There’s no partial credit for an output that’s almost valid but has a trailing comma or a missing closing brace. This binary metric is deliberately unforgiving because production JSON consumers (APIs, pipelines, databases) are equally unforgiving. The second metric is Answer Quality, which ensures that steering doesn’t improve formatting at the cost of content. We evaluate outputs for factual accuracy and coherence using automated rubric scoring, checking that the model still provides substantive, relevant answers inside the JSON structure.

To isolate what each component contributes, we run the model under four configurations. The Baseline uses no instruction and no steering this tells us how often the model spontaneously outputs valid JSON. Instruction Only adds “Format the output as JSON” to the prompt but applies no steering, establishing what prompt engineering alone achieves. Steering Only applies the steering vector with no mention of JSON in the prompt, testing whether the vector works independently. Finally, Instruction + Steering combines both, revealing whether the two approaches are complementary or redundant.

Results

JSON Adherence

The results tell a clear story.

JSON Adherence Results Comparison

Without any prompt instruction, baseline JSON adherence is roughly 10% the model occasionally produces JSON-like output by chance, but the vast majority of responses are plain text. Applying the steering vector bumps this to approximately 30% a +20.16 percentage point absolute improvement. The model is producing valid JSON three times as often, and it was never asked to.

With a prompt instruction (“Format the output as JSON”), baseline adherence is about 69.86% the model follows the instruction most of the time, but still fails about 30% of the time. Adding steering on top pushes this to roughly 85–90% valid JSON, a +5–7 percentage point gain. Steering and prompting are complementary the vector catches the cases where the model would have otherwise fumbled the instruction.

Projecting the steering vector onto the vocabulary space confirms it encodes format, not content. The highest-aligned tokens are structural: {, }, ", json, object, :. Semantic content words show no meaningful alignment. The vector is a clean format signal.

When we visualize the activations in a 2D projection, a striking pattern emerges:

JSON-instructed activations cluster tightly together, regardless of the topic. Non-instructed activations are scattered across the space. This tells us that the JSON instruction creates a consistent, separable geometric signature inside the model which is exactly why a simple vector subtraction can capture it.

Steering isn’t free of side effects, but the trade-offs are manageable. At moderate \( \alpha \) values, we occasionally see minor artifacts: an extra closing brace, slight awkwardness in text values within the JSON, or minor redundancy. At very high \( \alpha \) values, the model over-commits to structure at the expense of content quality.

But within the optimal range, semantic correctness stays intact. The model still gives accurate, relevant answers just wrapped in valid JSON. For production systems where JSON validity is a hard requirement, this is a very favorable trade-off.

Why This Works

The success of activation steering for JSON isn’t a lucky coincidence. It works because of how transformers process information.

Format instructions are global constraints. Unlike factual recall or reasoning, “output as JSON” applies uniformly to the entire response regardless of content. This universality means it can be represented as a single, stable direction in activation space.

The residual stream is additive. Each layer’s contribution gets added to a running total. This means an external vector addition integrates naturally the downstream layers process the modified representation using their existing weights, seamlessly adapting as if the shift had always been there.

Steering replays the instruction’s effect. When the model reads “Format the output as JSON” in a prompt, it converts that text into an internal representational shift. Our steering vector is a direct estimate of that same shift. Injecting it is functionally equivalent to the model having read the instruction just without the noise and unreliability of language comprehension.

Limitations

While the results are promising, there are several important boundaries to this work that deserve honest acknowledgment.

The most obvious constraint is format scope. Everything in this article applies to JSON and only JSON. Each new target format XML, YAML, Markdown tables, CSV would require its own paired dataset, its own vector extraction pipeline, and its own layer optimization sweep. The underlying method generalizes in principle, but the practical work of building and validating each format vector is non-trivial. We haven’t yet tested whether formats with less rigid syntax (like Markdown) produce steering vectors that are as clean and separable as JSON’s.

There’s also the question of model specificity. Layer 6 is optimal for Phi-3B, but that number is an empirical finding, not a universal constant. A different model architecture or even a different size within the same family will almost certainly have a different optimal injection point. This means every new deployment target requires a calibration step: a layer sweep on held-out data to find the sweet spot for that particular model. The cost of this calibration is modest (a few hours of compute), but it’s a step that can’t be skipped.

Over-steering is a real risk in practice. When the scaling parameter \( \alpha \) is set too high, the model becomes obsessively structural it produces syntactically perfect JSON, but the content inside the structure becomes repetitive, shallow, or irrelevant. Finding the right \( \alpha \) for a given use case requires experimentation, and the optimal value may differ depending on the complexity of the expected output. A system that generates simple key-value responses may tolerate stronger steering than one expected to produce deeply nested JSON with rich text values.

Finally, it’s critical to understand that steering is a formatting tool, not a reasoning enhancer. The steering vector ensures the output looks like valid JSON. It makes no guarantees about whether the content inside that JSON is accurate, complete, or relevant. A model that hallucinates facts will still hallucinate them just in well-formatted braces and brackets. Steering should be viewed as one layer in a production stack, complementing (not replacing) techniques for factual grounding and content quality.

It's also worth noting that independent research by Panahi (2026) tested activation steering for JSON generation and found the opposite result steering reduced valid JSON from 86.8% to just 24.4%. The divergence comes down to three things: a much smaller model (0.5B vs 3B), a significantly harder task (strict PII schema extraction vs. general fact formatting), and no layer sweep to find the optimal injection point. These aren't contradictory findings they define the boundary conditions of where this technique works and where it doesn't.

What Comes Next

The results from this work suggest several natural extensions, each of which could significantly expand the practical utility of activation steering.

The most immediate next step is building a multi-format vector library. If we can extract a clean steering vector for JSON, the same pipeline should work for XML, Markdown, CSV, YAML, and other structured formats. The vision is a switchable library of pre-computed vectors that lets a single model deployment serve different output formats on demand an API caller requests JSON, an analytics dashboard requests CSV, a documentation system requests Markdown all handled by swapping one vector, with no prompt changes and no model reloading. This would dramatically simplify production architectures that currently maintain separate prompt templates or even separate fine-tuned models for each format.

Another promising direction is dynamic \( \alpha \) scheduling varying the steering strength within a single generation rather than using a fixed value throughout. The intuition is that the opening tokens of a JSON response (the initial { or [) are the most critical for establishing structure, so early tokens would benefit from strong steering. Once the structural scaffold is in place, reducing \( \alpha \) mid-generation would give the model more freedom to focus on content quality, producing richer and more accurate values within the JSON fields. This is analogous to how curriculum learning varies difficulty during training, but applied at inference time.

The composability of steering vectors opens up an especially exciting possibility: combining multiple behavioral vectors into a single compound intervention. If JSON formatting, formal tone, and conciseness each correspond to independent directions in activation space (and our clustering analysis suggests they might), they could theoretically be summed into one composite vector that enforces all three constraints simultaneously. This would enable complex behavioral specifications “respond in formal, concise JSON” with a single vector addition, no compounding prompts required.

Beyond formatting, the same framework naturally extends to safety and policy steering. If harmful content, factual grounding, or policy compliance have their own recoverable directions in the residual stream, steering could provide a fast, reversible guardrail layer that operates beneath the prompt level. This would be particularly valuable for deployment scenarios where prompt injection is a concern, since activation-level interventions are much harder to circumvent than prompt-level instructions.

Finally, scaling to larger models is an open and important question. Our experiments use Phi-3B, a relatively compact model. Testing on architectures like LLaMA-3 (8B and 70B), Qwen-2, and Mixtral will reveal whether format instructions remain cleanly separable at much larger scales or whether the increased capacity and entanglement of larger models requires more sophisticated extraction techniques.

The Bottom Line

Instruction-following doesn’t have to depend on how well a model interprets natural language prompts. JSON formatting one of the most practically important and most commonly broken output constraints corresponds to a stable, recoverable direction in the model’s residual stream. By extracting that direction from paired activations and replaying it during inference, we achieve substantial, measurable gains: +20 percentage points from baseline, and another +5–7 percentage points on top of prompt instructions.

What makes this approach particularly compelling for production use is the combination of properties it offers. The computational cost is negligible a single vector addition per forward pass, with no additional model calls, no retry loops, and no extra tokens in the context window. The intervention is fully reversible: remove the vector, and the model returns to its default behavior instantly, with no residual effects. Multiple steering vectors can be composed together, opening the door to simultaneous control over format, tone, and style through simple vector arithmetic. And because the technique operates at the activation level rather than the prompt level, it’s completely independent of how the prompt is worded there are no fragile instruction templates to maintain, and no context tokens consumed by formatting directives.

As LLM-powered systems move deeper into production powering APIs, agent workflows, and structured data pipelines techniques like activation steering represent a fundamental shift in how we think about model control. Instead of crafting better prompts and hoping the model cooperates, we can engineer the model’s internal representations directly, with the precision and predictability that production systems demand.

The instruction doesn’t have to be in the prompt. It can be in the activations.

Built by Ionio

This research is part of the applied AI work we do every day at Ionio an AI consulting firm that helps mid-market SaaS platforms build the one capability that makes them impossible to compare and harder to leave.

Activation steering didn’t come from a whitepaper we read. It came from solving a real production problem for a real client: their LLM pipeline was breaking on JSON formatting 30% of the time, and neither prompt engineering nor retry loops were cutting it. We went deeper into the model’s residual stream and built something that actually worked.

That’s how we operate. Our engineering team has published 100+ technical articles, shipped models on Hugging Face, and documented 30+ case studies across AI-powered features for platforms in e-commerce, subscriptions, marketing automation, and catalog management. When a platform needs AI that doesn’t just demo well but actually moves revenue, they call us.

Whether it’s activation-level interventions like the one in this article, custom retention engines that predict churn 60 days early, or AI enrichment pipelines that turn a flat product catalog into a competitive advantage we find the wedge that separates you from the other nine platforms on the buyer’s shortlist. Then we build it, ship it, and make sure it pays for itself.

If this kind of applied AI research is relevant to what you’re building, we’d love to talk. Book a call with our team →