Skip to content

Explaining AI agent decisions with the Kiji Inspector™

When an AI agent receives your request and decides to search a database instead of browsing the web, what's actually happening inside the model? Not what the model says is happening, but what's actually going on in the neural network at the moment of decision?

That's the question Kiji Inspector™ answers. Built by Dataiku's 575 Lab, Kiji Inspector™ is an open-source toolkit for mechanistic interpretability that cracks open the internal representations of language models to explain why an agent picks one tool over another. It's the first pipeline purpose-built for agentic decision explainability.

The increasing deployment of human-level AI, particularly in critical systems, raises concerns about potential deception, which is difficult to detect without proper interpretability. Kiji Inspector aims to address this challenge by helping to make the decisions of AI agents explainable.

Big picture: How does the Kiji Inspector™ work?

Kiji Inspector™ follows a six-step pipeline that goes from raw model internals to human-readable explanations:

1. Contrastive pair generation: Synthetic pairs of user requests are created that share the same intent but require different tools. For example: "What is the latest version of Product X?" (requires file read) vs. "Set the latest version of Product X to v3.2.1" (requires file write). Same topic, different tool decision.

2. Activation extraction: The model processes each prompt, and Kiji Inspector™ captures the hidden-state activations at the exact decision token, the precise position in the output where the model commits to a tool choice. This is the neural "snapshot" of the decision-making process.

3. SAE training: A JumpReLU Sparse Autoencoder (SAE) is trained on those raw activations to decompose them into interpretable, monosemantic features, individual directions in the model's representation space that correspond to human-understandable concepts.

4. Contrastive analysis: The contrastive pairs are used as post-hoc statistical probes to determine which learned features are associated with specific tool decisions. The SAE itself is trained unsupervised; the pairs only help identify which features matter.

5. Feature interpretation: An LLM assigns human-readable descriptions to identified features and generates decision reports.

6. Fuzzing evaluation: A token-level A/B testing methodology (adapted from Eleuther AI's autointerp) validates whether the feature labels actually identify the correct tokens driving each feature, catching explanations that sound plausible but are wrong.

The result: You get a report that tells you which internal computational features drove the agent to pick a specific tool, grounded in the model's actual representations, not in a post-hoc narrative.

rodeo-project-management-software-b2L3f7ednYE-unsplash

Why is using sparse autoencoders different from chain of thought and prompt engineering?

If you've worked with AI agents, you've probably tried the usual approaches to understanding their behavior:

Prompt engineering lets you tweak system prompts and observe what changes. This reveals correlations between instructions and outputs, but it tells you nothing about the actual computational mechanisms inside the model. You're poking the box from the outside.

Chain-of-thought inspection reads the model's generated reasoning. The problem? Models can produce perfectly plausible-sounding explanations that don't reflect their true decision process. The reasoning text is another output, not a window into the computation.

Behavioral testing probes the model with test inputs to characterize input-output relationships. Useful, but it never explains the intermediate computations; the how between input and output remains a black box.

Kiji Inspector™ takes a fundamentally different approach: It directly examines the model's internal representations. Rather than inferring intent from behavior, it extracts and decomposes the actual hidden states at the moment of decision. The SAE doesn't learn from supervision what tools are "correct"; it discovers the model's natural feature vocabulary unsupervised, and contrastive analysis identifies which pre-existing features correlate with different decisions.

This is the difference between asking someone why they made a choice (and hoping they're honest and self-aware) versus looking at an fMRI of their brain at the moment they decided.

What's novel: agentic decision explainability

Sparse autoencoders for interpretability aren't new; there's excellent prior work from Anthropic, EleutherAI, and others on decomposing language model activations into interpretable features. But that work has focused on next-token prediction in general language modeling.

What Kiji Inspector™ brings that's genuinely novel is applying mechanistic interpretability specifically to agentic decision-making, the structured, tool-selection choices that AI agents make in real-world deployments.

The key innovations include:

Decision token extraction. Rather than analyzing random positions in the output, Kiji Inspector™ identifies the exact token position at which the model commits to a tool choice and captures activations at that position. This focuses the analysis on the moment that matters.

Contrastive pairs as post-hoc probes, not a training signal. The SAE learns the model's natural representations without any contrastive supervision. The carefully designed pairs (spanning five domains, general tool selection, investment analysis, manufacturing, supply chain, and customer support) serve only to statistically identify which pre-existing features are decision-relevant. This avoids biasing the learned features toward any particular notion of "correct."

Token-level fuzzing evaluation. Standard interpretability evaluations check whether feature labels predict which texts activate a feature. Kiji Inspector™ goes deeper, testing whether labels predict which specific tokens drive activation, catching labels that are right for the wrong reasons.

Cross-domain applicability. The pipeline works across five scenario configurations with 32 tools and 37 contrast types, demonstrating that the approach generalizes beyond a single use case.

For organizations deploying AI agents in production, especially in regulated industries, this kind of decision explainability is critical. When an agent makes an unexpected tool choice, Kiji Inspector™ can tell you which internal features drove that decision, grounded in the model's actual computation.

Quick start: using a pretrained SAE

Getting started with Kiji Inspector™ is straightforward. Install the package and load a pretrained SAE in just a few lines:

Shell
$ pip install kiji-inspector
Python
from kiji_inspector import SAE

sae, feature_descriptions = SAE.from_pretrained(
    base_model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    layer=20,
)
features = sae.encode(activations)

themes = sae.describe(features, top_k=5)

The SAE.from_pretrained() method loads a trained sparse autoencoder along with human-readable descriptions of its learned features. You pass in the base model identifier and the transformer layer you want to inspect (layer 20 is the recommended default for Nemotron — it offers the best balance of reconstruction fidelity and decision-relevant representations).

From there, encode() decomposes your activations into sparse, interpretable features, and decode() reconstructs the original activations from those features. The gap between original and reconstruction tells you how well the SAE captures the model's representations.

Supported models

Kiji Inspector™ currently supports pretrained SAEs for the following models:

NVIDIA

  • nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  • nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF8
  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Google

  • google/gemma3-27b-it (experimental)

In a preprint paper, we summarized our findings using the Nemotron-3-Nano-30B model, which uses a Mixture-of-Experts architecture with 30 billion total parameters and 3 billion active per token. Its SAE is trained on the post-residual hidden state after expert outputs are aggregated, focusing on the model's combined representations rather than on expert-routing behavior.

Training your own Kiji Inspector™

If you're working with a model not on the supported list, or want to inspect tool-selection behavior in your own agent setup, you can train a custom SAE from scratch. The process has two major phases.

Phase 1: custom pair generation

The quality of your contrastive pairs directly impacts the quality of your analysis. You define scenarios as JSON configurations that specify a domain (e.g., "customer support"), the available tools, and the types of contrasts you want to explore (e.g., "escalate vs. resolve," "self-service vs. agent-assisted").

A generation model (the paper uses Qwen3-VL-235B, but you can substitute any capable LLM, including API-based models like Claude or GPT) then produces thousands of paired prompts. Each pair shares the same user intent but requires different tools due to subtle differences in the request. The generation pipeline includes robust JSON parsing with fallback recovery to handle LLM output variability.

Phase 2: running the training pipeline

With pairs in hand, you run the multi-step pipeline: format prompts in ChatML with your tool descriptions and system prompt, extract hidden-state activations at the decision token for each prompt, train the JumpReLU SAE on the collected activations (in our internal experiments, we use a dictionary size of 4× the hidden dimension, a batch size of 8192, and a cosine learning rate schedule), run contrastive activation analysis to identify decision-relevant features, then use an LLM to label features and run fuzzing evaluation to validate the labels.

The full pipeline is designed for GPU-accelerated environments; the paper's experiments ran on GB200 NVL4 hardware with four B200 GPUs. That said, smaller-scale experiments with fewer pairs are possible on more modest hardware.

Early learnings

Agent interpretability is opening up significant opportunities. Our initial testing with the Kiji Inspector has already revealed valuable insights, such as instances where an agent selected the correct tool for the wrong reason, or where the underlying Large Language Model (LLM) was misled by context and chose the wrong tool.

The Kiji Inspector eliminates guesswork, offering a clear understanding and optimization of AI agent decisions. While sufficiently capable systems may attempt to deceive, the kiji-inspector augments transparency and confidence in your tool calls.

Resources

To dive deeper into Kiji Inspector™, check out these resources:

  • Source code & documentation: github.com/dataiku/kiji-inspector/tree/main/docs
    These step-by-step guides cover pair generation, activation extraction, SAE training, contrastive analysis, feature interpretation, and fuzzing evaluation.
  • Presentation slides: github.com/dataiku/kiji-inspector/tree/main/presentation
    The conference presentation includes visual overviews of the architecture and results.
  • Research paper: The full paper, "Opening the Black Box: Mechanistic Interpretability for AI Agent Tool Selection Using Sparse Autoencoders" by Hannes Hapke and David Cardozo, is available in the repository's paper/ directory. It includes the complete methodology, mathematical formulations, layer sweep analysis, baseline comparisons, and evaluation results.

Kiji Inspector™ is open source under the Apache 2.0 license. Built by Dataiku's 575 Lab.

Leave a Github star, fork the project, or join the conversation

Join the project on GitHub

You May Also Like

Explore the Blog
Explaining AI agent decisions with the Kiji Inspector™

Explaining AI agent decisions with the Kiji Inspector™

When an AI agent receives your request and decides to search a database instead of browsing the web, what's...

Decision 2 of 7: when AI must be defended for deployment

Decision 2 of 7: when AI must be defended for deployment

This is the second installment in our seven-part breakdown of insights from the report, “7 Career-Making AI...

Decision 1 of 7: When AI becomes a leadership referendum

Decision 1 of 7: When AI becomes a leadership referendum

This is the first installment in our seven-part breakdown of insights from the report, “7 Career-Making AI...