SimplexRL  ·  branch searchpi  ·  a deep explainer

SearchPI

How SGTS and SGSPI turn an exact LP solver into a policy-improvement operator — and what happens when the loop feeding search back into the network goes wrong.

src/simplexrl/searchpi/ written 2026-06-10 assumes: simplex, RL, this repo
SGTS — Sampled Gumbel Tree Search SGSPI — Simplex Graph Search Policy Iteration

1The big idea

Everything you already train in this repo — PPO, GRPO, IQL — improves the policy through gradients on sampled experience. SearchPI improves it through planning. At each pivot decision during collection, instead of just sampling the network's policy, a planner spends a budget of real solver queries exploring what each candidate entering variable would actually lead to. The planner's conclusion — "from this basis, variable 17 is genuinely the best pivot, and here's how good each alternative looked" — is then converted into supervised targets, stored in replay, and distilled back into the network. That's the PI in SearchPI: policy iteration, where the improvement step is a search procedure and the evaluation/distillation step is plain supervised regression.

This is the AlphaZero family of algorithms, but the domain twists it in a way worth internalizing before any details:

  • Transitions are exact and cheap. A pivot step queried through the HiGHS oracle is deterministic and costs far less than a network forward pass. Nobody learns a dynamics model here (that would be MuZero); the real solver is the planning model. The network only supplies guidance: policy logits, a value estimate, optionally per-candidate Q estimates — all from one fused policy_value_q call.
  • The problem is a deterministic shortest-path problem. Reward is −1 per pivot, terminal at optimality. So "act well" literally means "take the shortest pivot path to the optimal basis," and search methods for shortest paths (best-first / A*) become natural competitors to MCTS.
  • Ground truth is purchasable. From any basis you can run a classical pivot rule (Dantzig, devex, steepest edge) to termination and observe the exact remaining pivot count under that rule. This gives search access to verified, terminal-grounded value evidence that most RL domains never have — and the distinction between verified and network-estimated evidence turns out to be the single most important idea in the whole system (§5.5, §7.2).

SearchPI contains two planners that share all of this infrastructure but answer differently the question "how should a limited search budget be spent?":

SGTS — Sampled Gumbel Tree Searchplanners/sgts.py

A simplex specialization of Sampled MuZero + Gumbel MuZero ("Policy improvement by planning with Gumbel," Danihelka et al. 2022). It samples a small subset of root actions with the Gumbel-top-k trick, races them with Sequential Halving on exact solver branches, and builds completed-Q policy targets: searched actions get their measured values, unsearched actions get a neutral baseline. Optimizes simple regret at the root — pick the best next pivot among a sample.

SGSPI — Simplex Graph Search Policy Iterationplanners/sgspi.py

Treats phase-2 pivoting as deterministic shortest-path on a (cyclic) graph of bases and runs a bounded best-first graph search with the learned value as heuristic: f = g + α·hθ. It merges transpositions, attributes discovered evidence back to every root action that can reach it, optionally completes leaves with a classical pivot rule to get terminal-grounded values, and distills searched-Q softmax targets gated by evidence fidelity.

A useful slogan for the whole document: SGTS asks "which of these sampled arms is best?"; SGSPI asks "what is the cheapest path out of here, and which first step starts it?" Both then face the same second problem — turning what search found into training targets that improve the network without poisoning it. The empirical history (§7) shows that this second problem, not the search itself, is where both methods initially failed: SGSPI's search was excellent on day one and trained itself into collapse, while SGTS's improvement operator was accidentally parameterized into a mathematical no-op.

2The decision problem & shared machinery

Four pieces of shared infrastructure shape everything both planners do. They're worth getting precise before the algorithms, because most of the subtle correctness constraints in the codebase live here.

2.1Values are negative pivot counts

During phase 2, each state is the current simplex basis for a fixed LP; an action chooses one legal entering variable; the transition is one exact HiGHS pivot; reward is −1 per pivot (0 at the optimal terminal). With no discounting, the optimal value function is simply:

$$V^*(s) \;=\; -\big(\text{minimum remaining phase-2 pivots from } s\big), \qquad Q^*(s,a) \;=\; -\,\Delta_{\text{iters}}(s,a) \;+\; V^*\!\big(T(s,a)\big)$$
value convention — larger (less negative) is better; a two-pivot completion has value −2, not 2design doc · unified_search_policy_iteration_context.md

where \(\Delta_{\text{iters}}\) is the iteration count HiGHS reports for the step (almost always 1). Both planners, both target builders, the replay buffer, and the loss all speak this unit — raw pivots, undiscounted. Keep five quantities mentally separate, because conflating them is the documented way to corrupt the method: root_value (network V at the root), root_q_values (network Q head at the root), q_target (search-derived per-action evidence), value_target (scalar supervised target), and mc_return (the realized episode suffix). And one more for SGTS: v_mix is a policy-improvement baseline, never a value target (§4.3).

Why undiscounted matters

The original IQL warm-start checkpoint was trained with γ=0.99, so its value head predicts discounted returns: at 35 true remaining pivots it says ≈ −29.6; at 100 it says ≈ −63. Systematically optimistic — and optimism is precisely the failure mode that argmax-driven search amplifies (§7.2). Fix A5 retrains IQL at γ=1.0 so warm-start values share SearchPI's units.

2.2The phase-2 oracle: search by replaying prefixes

HiGHS doesn't expose "clone this solver state." What it does support, through the pivot-callback machinery you already know from SimplexPivotEnv, is deterministic replay: from the phase-2 start of a given problem, feeding the same sequence of entering variable ids reproduces the same basis. So search identifies any state by

$$\big(\texttt{problem\_index},\ \texttt{phase2\_prefix\_var\_ids}\big)$$
the universal state address — a prefix of internal entering-variable ids from phase-2 startsearchpi/oracle.py

The oracle (Phase2Oracle, searchpi/oracle.py) maintains pooled HiGHS sessions plus a budgeted checkpoint cache of basis snapshots, so "step once from this node" usually restores a nearby checkpoint and replays a short suffix rather than replaying from scratch. Planners receive checkpoint_id hints in every step result and thread them into child nodes. SGTS additionally caches each root action's first step (root_step_cache) so re-simulating the same root arm doesn't repeat the pivot. The practical consequence: tree/graph search over solver states is feasible, but each node expansion is a real solver interaction — search budgets are counted in oracle queries and wallclock, and the metrics (oracle_actions_replayed_sum, checkpoint_*) exist to keep replay amplification visible.

2.3Candidate slots vs var-ids — the #1 bug source

Every observation exposes candidates as a tensor of up to max_candidates slots, and slot indices are node-local: after one pivot, the same HiGHS variable can sit in a different slot, or vanish from the candidate list entirely. The codebase therefore enforces a strict identity discipline, called out in the design doc as the origin of "most subtle target bugs":

  • Candidate slots index tensors aligned to one specific observation (logits, masks, targets).
  • Internal entering-variable ids name actions durably: search paths, replay prefixes, root-action evidence, successor metadata.
  • Every node remaps explicitly (legal_var_to_slot); target builders construct candidate-aligned tensors from var-id keyed maps and fail loudly if evidence names a var-id that isn't legal at the root.

When you read any SearchPI code, this is why dictionaries are keyed by int(var_id) everywhere while tensors appear only at the boundary.

2.4The shared pipeline

Both methods are presets over one framework (searchpi/types.py · targets.py · replay.py · loss.py · trainer.py):

SearchRequestroot state, prefix, checkpoint, episode, padding
PlannerSGTS or SGSPI; queries oracle + network
SearchResultselected action + per-root-action evidence
TargetBuildercompleted-Q or searched-Q-softmax
Replayquality labels, state-key index, reanalyse
SearchPILossmasked policy/value/Q/rank/BC terms

The contract that makes the two planners interchangeable is RootActionEvidence types.py: for each root action the planner searched, a record (value, source, depth, via_transposition, quality) in value-space. Sources are strings like terminal, terminal_non_optimal, completion_steepest, estimate_at_depth_1; qualities form the ladder that the rest of the system keys on:

QualityMeaningProduced byLoss weight
exactProven: an optimal terminal was reached through this actionterminal hit in search1.00
boundedVerified dead end / limit (non-optimal terminal)infeasible · unbounded · iteration limit0.75
rolloutTerminal-grounded via a classical rule finishing the suffixcompletion (SGSPI) · classical leaf rollout (SGTS)0.75
approximateNetwork value bootstrap — could be optimistic fictionVθ at a search leaf0.25
failedCycle, oracle replay errorerror paths0.00

Quality ladderquality.py:11–18. Verified means the first three rungs (evidence_is_verified, targets.py:741). This split — verified vs approximate — gates SGSPI's improvement operator (§5.5) and stratifies replay sampling (§8).

One more shared piece: root action proposals proposals.py. Neither planner ever considers all legal candidates (often hundreds). Both build a proposal set as a union of sources — Gumbel-top-k samples from a tempered policy, top-k by the network's Q head, classical heuristic anchors, the incumbent, a random tail — with each chosen action tagged by where it came from (source_tags_by_var_id, visible in WandB as root_source_tag_*). How that union is configured is one of the main practical differences between the two methods.

3The toolbox: four techniques you need first

SGTS and SGSPI are both assembled from a small set of techniques that don't appear elsewhere in this repo. Each gets a compact treatment here; the method sections then read as composition.

3.1The Gumbel-max / Gumbel-top-k trick

Take a categorical distribution \(\beta\) with logits \(\log\beta_i\). Draw one i.i.d. Gumbel per action, \(g_i = -\log(-\log U_i)\) with \(U_i\sim\mathrm{Uniform}(0,1)\). Then:

$$\arg\max_i\,\big(\log\beta_i + g_i\big)\ \sim\ \beta \qquad\text{and}\qquad \operatorname*{arg\,top-}k_i\,\big(\log\beta_i + g_i\big)\ \sim\ \text{sampling }k\text{ without replacement from }\beta$$
the Gumbel-max trick and its top-k extensionproposals.py:92 · gumbel_topk_without_replacement

Why a planner cares, beyond "it's a clean way to sample a diverse subset":

  • One noise draw, reused everywhere. The planner draws the Gumbels once per root and keeps them. The same \(g_i\) that decided which actions got sampled later participates in ranking the survivors. This is the heart of Gumbel MuZero's policy-improvement guarantee: since \(\arg\max_i(g_i + \log\beta_i)\) is an exact sample from \(\beta\), then \(\arg\max_i(g_i + \log\beta_i + \sigma(\hat q_i))\) — adding a monotone bonus \(\sigma\) of measured action values — picks an action whose expected value is at least that of a fresh sample from \(\beta\). Search can only help, in expectation, no matter how few actions were sampled. Drop the stored Gumbels and re-noise, and that guarantee evaporates.
  • Without replacement matters. With hundreds of candidates and a budget of ~12 sampled arms, you can't afford duplicates, and you want the sample to respect \(\beta\)'s mass without being a deterministic top-k (which would never explore the tail).

In SearchPI, the distribution being sampled is not the raw policy but a mixed proposal \(\beta\) proposals.py:336 · _proposal_probs:

$$\beta \;=\; \big(1-\varepsilon_u-\varepsilon_h\big)\,\operatorname{softmax}\!\big(z/T\big)\;+\;\varepsilon_h\,\rho_{\text{heur}}\;+\;\varepsilon_u\,\mathrm{Unif}(\text{legal})$$
z = policy logits; T = policy_temperature (1.5); εh = heuristic_floor (0.25 SGTS run), ρheur = uniform over steepest-edge top-k; εu = uniform_floor (0.01)configs/2026-06-10/*.toml [planner.root_proposal]

So a quarter of the proposal mass in the current SGTS config sits on the classical steepest-edge shortlist — the planner is never fully at the mercy of a bad learned prior.

Widget · Gumbel-top-k sampler8 actions · k = 3
Each bar stacks the fixed log-proposal \( \log\beta_i \) (dark) with a fresh Gumbel draw \(g_i\) (light). The top-3 totals (outlined) are the sampled set. Hit draw ×500 and compare the tally row with \(\beta\): inclusion frequencies track the proposal — that's sampling without replacement, not greedy truncation.

3.2Sequential Halving: spending a budget to find the best arm

Once SGTS has sampled \(m\) root actions, it faces a pure-exploration bandit: with a total budget of \(N\) simulations, identify the best arm. You don't care about cumulative reward during the search (no UCB-style regret balancing); you care only about the quality of the final pick — simple regret. Sequential Halving (Karnin et al. 2013) is the standard answer:

  1. Keep a surviving set, initially all \(m\) arms. Plan \(\lceil\log_2 m\rceil\) rounds.
  2. Each round, split the round's share of the budget evenly across survivors, pull (simulate) accordingly, then eliminate the worse half by current estimated value.
  3. The last survivor is the answer. Early rounds are wide and shallow; later rounds concentrate the budget on near-finalists, which therefore have low-variance estimates exactly where it matters.

The implementation sgts.py:307 · sequential_halving follows this with two domain adaptations: an initial pass guarantees every sampled arm at least one simulation, and ranking between rounds is not by raw \(\bar q\) but by the full Gumbel MuZero score (next subsection), so elimination respects both the prior and the noise draw that selected the arms. Budget bookkeeping: round budget = remaining ÷ rounds-left, leftovers to the top-ranked arms, survivors = \(\lceil |S|/2 \rceil\).

3.3Policy improvement by planning: σ-scores and completed-Q

This is the conceptual core of SGTS, and it answers two questions at once.

Question 1 — how should search evidence influence ranking and acting? Gumbel MuZero's answer: transform measured Q-values monotonically and add them to the logits-plus-Gumbel score. SearchPI uses exactly its σ transform:

$$\hat q(a) \;=\; \frac{q(a) - \min_b q(b)}{\max_b q(b) - \min_b q(b)} \in [0,1] \qquad \sigma\big(q(a)\big) \;=\; \big(c_{\text{visit}} + \max_b N(b)\big)\cdot c_{\text{scale}}\cdot \hat q(a)$$
min–max over the actions being compared; c_visit = 50, c_scale = 1 ⇒ the best-vs-worst spread is worth ≥ 50 logitstargets.py:330–362 · sigma_transform · sgts.py:289 _sigma_bonus

Read the magnitude: with \(c_{\text{visit}}=50\), the action with the best measured Q gets a bonus ~50 logits above the worst. Gumbel noise has standard deviation \(\pi/\sqrt6 \approx 1.28\); prior logits live on a similar scale. So when search has actually measured a difference, σ dominates — the ranking becomes search-driven, with prior and Gumbel breaking ties. The \(+\max_b N(b)\) term grows the dominance as evidence accumulates. (Hold this thought: the original SearchPI configuration replaced σ with a transform whose maximum contribution was ±0.031 logits, drowning under the Gumbel noise — that's failure story §7.3.)

Question 2 — what should the policy target be when search only measured a few actions? Naive options both fail: train only on searched actions and you starve the rest of the distribution; fill unsearched actions with the network's own Q estimates and you teach the network its own guesses. The completed-Q construction threads the needle targets.py:293–327:

$$v_{\text{mix}} \;=\; \alpha\, v_\theta \;+\; (1-\alpha)\,\frac{\sum_{a\,\in\,\text{searched}} \pi(a)\,\bar q(a)}{\sum_{a\,\in\,\text{searched}} \pi(a)}, \qquad \alpha = \frac{1}{1+N_{\text{total}}}$$ $$Q_{\text{completed}}(a) \;=\; \begin{cases} \bar q(a) & a \text{ searched}\\[2pt] v_{\text{mix}} & a \text{ legal, unsearched} \end{cases} \qquad\quad \pi_{\text{target}} \;=\; \operatorname{softmax}\!\big(z \;+\; \sigma(Q_{\text{completed}})\big)$$
π = prior probs; N_total = total root visits; z = prior logits; σ as above, min–max over all legal actionstargets.py:293 compute_v_mix · :315 complete_q_values · :365 build_pi_target_logits

Three properties to internalize:

  • \(v_{\text{mix}}\) interpolates from the network's value (no evidence, \(\alpha=1\)) toward the prior-weighted mean of measured values as visits accumulate. It is the "what a typical action is worth here" baseline.
  • Unsearched actions completed to \(v_{\text{mix}}\) have zero advantage against that baseline — after min-max and σ they sit in the middle, so their target probability ≈ their prior probability. Conservative neutrality: search silence neither promotes nor punishes an action, and crucially, the Q head's opinion alone can never promote an unsearched action into the target. (Contrast with SGSPI, which instead leaves unsearched actions out of the target support entirely — two different valid answers, §6.)
  • The target is a full distribution over legal actions, so the policy CE trains all logits coherently, not just the searched subset.

The whole AlphaZero-style loop is then: act with the search-improved decision → distill \(\pi_{\text{target}}\) into the policy head → a better prior makes the next search better. The fixed point is a policy that search can no longer improve.

3.4Best-first search with a learned, inadmissible heuristic

For SGSPI, the relevant classical algorithm is A*: maintain a priority queue (the frontier) over discovered nodes, scored by

$$f(n) \;=\; \underbrace{g(n)}_{\text{exact pivots root}\to n} \;+\; \alpha_h\cdot \underbrace{\operatorname{clamp}\!\big(-V_\theta(n),\,0,\,h_{\max}\big)}_{h_\theta(n)\;=\;\text{estimated pivots } n\to\text{optimal}}$$
α_h = 1.0, h_max = 200 in the current config; the learned value head is the heuristicsgspi.py:1511 · :1945 _heuristic

and always expand the lowest-\(f\) node. With an admissible heuristic (never overestimates remaining cost), A* expanding a goal proves optimality. Two things break that purity here, both deliberate:

  • \(h_\theta\) is a neural estimate — not admissible. So frontier_bound_stop (halt when the best frontier \(f\) is no better than the best terminal found) is a heuristic stopping rule, not a proof. SGSPI is "bounded best-first," not exact A*.
  • The state space is a graph with massive transposition structure: different pivot orders reach the same basis. A search that treats it as a tree re-pays for repeated states exponentially. So SGSPI keeps a transposition table keyed by canonical basis identity (StateKey / SearchStateKey, state_key.py): revisits with worse-or-equal \(g\) are pruned, revisits with strictly better \(g\) reopen the node. The subtle obligation this creates — evidence discovered through a transposition must still be credited to every root action that reaches it — is §5.3.

Why does best-first make sense here when SGTS exists? Because in a deterministic shortest-path problem, value information is shared across root actions: if actions \(a\) and \(c\) both funnel into the same successor basis two pivots deep, one discovery prices both. Sequential Halving treats arms as independent and can't exploit that; a graph frontier gets it for free. That trade — independent-arm statistics vs shared-descendant evidence — is the deepest difference between the two methods.

4SGTS — Sampled Gumbel Tree Search

SGTS is the conservative, theory-backed planner: a Gumbel MuZero root over exact simplex branches. Configuration anchors for this section come from the live experiment config configs/2026-06-10/sgts-sigma-01.toml: root_sample_size = 12, root_simulations = 96, max_depth = 4, nonroot_sample_size = 4, c_visit = 50, c_scale = 1, leaf evaluation by pure steepest-edge rollout.

4.1Anatomy of one root decision

Everything happens inside SGTSPlanner.search() sgts.py:456. Walking it in order:

  1. One fused network call at the root. policy_value_q(root_obs) returns logits \(z\), value \(v_\theta\), and Q estimates in a single encoder pass sgts.py:467. (Q is used by proposals if k_q>0; the tree statistics themselves never trust it.)
  2. Neighborhood mask. An optional pre-filter restricts legal actions to a near-classical neighborhood — in the current config, the steepest-edge top-6 plus a budget of at most 1 "deviation" from that shortlist along any root-to-leaf path ([planner.sgts.neighborhood], mode top_k_deviation_budget). This keeps the search inside a region where the warm-started network is trustworthy.
  3. Mixed-β proposal, Gumbel-top-k sample. Build \(\beta\) (§3.1) and sample root_sample_size = 12 actions without replacement sgts.py:485–508. Note the override: SGTS forces k_prior = root_sample_size sgts.py:1512 _effective_root_proposal_config — the sampled set is the arm set. For each sampled var-id it stores the score \(g(a) + \log\beta(a)\) for later ranking sgts.py:530.
  4. Race the arms with Sequential Halving (§3.2), total budget root_simulations = 96. Each evaluate(a) call simulates root action \(a\) once (next subsection) and returns the running mean \(\bar q(a) = \frac{\text{q\_sum}(a)}{N(a)}\) sgts.py:1166. Ranking between rounds scores survivors by $$\text{score}(a) = g(a) + \log\beta(a) + \sigma\big(\bar q(a)\big)$$ with ties broken by visit count, then Gumbel sgts.py:352–377 rank_actions. The min–max inside σ is taken over the current survivor pool, and \(\max_b N(b)\) is the pool's max visit count — dominance grows as the race narrows.
  5. The winner is the action that gets executed (in action_selection = "planner" mode, fix A6): \(\arg\max\) of the same score among final survivors — exactly Gumbel MuZero's acting rule \(\arg\max\, g + \text{logits} + \sigma(\hat q)\).
  6. Targets. Per-arm running means become RootActionEvidence; CompletedQTargetBuilder builds \(v_{\text{mix}}\), completed-Q, and \(\pi_{\text{target}}\) (§4.3). The sample is shipped to replay.

A budget-shape intuition for 12 arms / 96 sims: every arm gets one scouting sim, then rounds of roughly 21 → 21 → 21 → 21 sims are split over 12 → 6 → 3 → 2 survivors — the finalists end up with ~15–20 sims each while early discards cost only 1–3. That asymmetry is the entire point of halving.

What one "simulation" of a root arm does

Each evaluation descends an explicit tree of _SGTSNodes sgts.py:134, each holding the solver state, its own network outputs, per-action visit/value statistics, and the replay prefix + checkpoint needed to regenerate it:

  1. Step the root action through the oracle (cached after the first time — root_step_cache). If the pivot terminates the LP: optimal/infeasible/unbounded leaves are worth their terminal value (0 for optimal, so \(q = -\Delta_{\text{iters}} + 0\)); iteration/time limits get a pessimistic −iteration_hint; the evidence source/quality is set accordingly (terminal/exact or terminal_non_optimal/bounded) sgts.py:1674–1698.
  2. Descend up to max_depth further pivots. At each non-root node, lazily sample a small child set (nonroot_sample_size = 4, Gumbel from the child proposal) and pick among sampled children deterministically (§4.4). A per-simulation visited set of state keys detects cycles; revisiting a basis poisons that path with \(q = -10^6\) (quality failed) sgts.py:811–822.
  3. Evaluate the leaf when depth runs out (or candidates do). Three modes sgts.py:661 leaf_value: terminal value; network \(V_\theta\); or mixed_rollout — run a classical pivot rule from the leaf for up to max_depth × 20 steps and blend: $$v_{\text{leaf}} = (1-w)\cdot v_{\text{rollout}} + w\cdot V_\theta, \qquad w = \texttt{leaf\_mixing\_weight}$$ The live config sets w = 0 with rule steepest — leaves are valued by pure steepest-edge rollout (capped at 80 pivots, then \(V_\theta\) at the capped state). SGTS leaf values are therefore mostly terminal-grounded, the same medicine SGSPI takes via completions.
  4. Back up. Values flow up the recursion as \(q_{\text{parent}}(a) = -\Delta_{\text{iters}} + v_{\text{child}}\) sgts.py:878–891, accumulating into visits[a] / q_sum[a] at every node along the path sgts.py:1625 _update_node_stats. No max-backup, no UCT — plain running means, consumed by σ at the root.

One optimization worth knowing: when max_depth ≤ 1 with network leaves, the planner pre-steps all sampled root actions and batch-evaluates the children in a single network call (prewarm_depth_one_network_leaves sgts.py:996) — the entire search collapses to one oracle sweep plus two network calls.

4.2The race, live

Widget · Sequential Halving race8 arms · 28 sims · c_visit=50
Eight sampled arms with hidden true values \(q^*\) (revealed at the end). Each simulation observes \(q^* + \text{noise}\); bars show the running mean \(\bar q\) with one dot per visit. The score column is \(g + \log\beta + \sigma(\bar q)\) over the surviving pool — watch σ (the colored share) overwhelm prior + Gumbel as visits accumulate, and notice how a lucky early arm can survive round 1 but rarely wins the final: later rounds buy the precision that matters. The executed action and the policy target both come from this winner's statistics.

4.3From statistics to targets: v_mix, completed-Q, σ

After the race, CompletedQTargetBuilder targets.py:96 assembles the training sample exactly as §3.3 described: candidate-aligned \(\bar q\)/visit tensors from the evidence map → \(v_{\text{mix}}\) → completed-Q over all legal actions → \(\pi_{\text{target}} = \operatorname{softmax}(z + \sigma(Q_{\text{completed}}))\). The widget below is that computation — drag the evidence and watch the operator respond.

Widget · Completed-Q target labtargets.py:293–438
transform:
Six legal actions; the prior favors a₁. Toggle which actions were searched and drag their measured \(\bar q\) (pivots-to-go, so −22 ⇒ 22 more pivots). Unsearched actions are completed to \(v_{\text{mix}}\) (dashed line) — note their target probability barely moves from the prior no matter what the searched arms do: that's conservative neutrality. Then hit legacy advantage/τ: the pre-fix transform (advantage clipped to ±0.25, divided by τ=8 ⇒ max ±0.031 logits) and watch the target collapse onto the prior — §7.3's inert operator, reproduced live.

4.4Below the root: sampled descent, not UCT

Non-root nodes don't run their own halving race (budget wouldn't survive it). Instead, _select_nonroot_action sgts.py:1324 implements Gumbel MuZero's deterministic in-tree policy: at each node, compute the same completed-Q policy improvement locally over the node's sampled children, then pick

$$a^\dagger \;=\; \operatorname*{arg\,max}_{a\,\in\,\text{sampled}} \;\Big[\, \pi'_{\text{node}}(a) \;-\; \frac{N(a)}{1+\sum_b N(b)} \,\Big]$$
π′ = local completed-Q improved policy at the node; the visit-ratio penalty makes repeated descents spread over children in proportion to π′sgts.py:1410–1421

This rule has a neat property: over successive visits it allocates child visits to approximate π′ itself (it's a derandomized sampling of the improved policy), so deeper search refines exactly the branches the improved policy would actually take. Child proposals there are small (4 Gumbel samples from the tempered child policy), every node carries its own cycle-checked state, and depth is capped at 4 pivots — SGTS in this repo is a local improvement operator: look a handful of pivots ahead, with terminal-grounded steepest rollouts beyond that horizon.

What SGTS does not do

No transposition table — only per-simulation cycle detection. Two simulations reaching the same basis through different prefixes pay twice. The design doc treats this as intentional: a global table would change the arms' budget accounting and break Sequential Halving's independent-arm statistics. The corollary is that SGTS wastes budget on shared structure — which is precisely the inefficiency SGSPI is built around (§3.4, §6).

Finally, the value target. The search itself doesn't set one — SearchTargets.value_target passes through as None for SGTS — so the scalar value supervision is resolved later by the loss's priority chain (search → A* → MC return, loss.py:517 resolve_value_targets), which for SGTS means the realized Monte-Carlo return of the collected episode (or an injected A* label when available). This is the trap the design doc warns about by name: \(v_{\text{mix}}\) is the improvement baseline, and using it as the value target would teach the value head its own blend — a different (and wrong) method.

5SGSPI — Simplex Graph Search Policy Iteration

SGSPI starts from the observation that SGTS's bandit abstraction throws away the structure of this domain. Phase-2 pivoting is a deterministic shortest-path problem on a graph of bases; root actions are not independent arms — they share descendants. So: run a bounded best-first graph search, let any discovery (a terminal! a cheap completion! a transposition into known territory!) flow back to every root action that can reach it, and distill the per-root-action evidence directly. Config anchors from configs/2026-06-10/sgspi-fidelity-01.toml: root_node_budget = 64, max_depth = 0, proposals k_prior = 24 / k_q = 8 / k_rand = 4, alpha_h = 1.0, h_max = 200, target_tau_policy = 0.5, steepest completion pinned at probability 1.0.

5.1Depth semantics and the root-only regime

One config gotcha before the loop: SGSPI's max_depth counts expanded depths, and children one level deeper still get evaluated. So max_depth = 0 — the live setting — means: expand only the root (propose ~36 actions, pivot each once through the oracle), evaluate every depth-1 child with the network and/or a completion, and stop. The frontier machinery, transpositions, and deeper expansion all activate when max_depth ≥ 1 (this is the "tree search mode" lineage of recent commits — and note merge_by_state_key = false exists to key nodes by path instead of basis, turning the graph search into a pure tree for debugging/ablation). The explainer walks the general loop; keep in mind the production run is currently the depth-0 special case where SGSPI is essentially "one exact pivot per candidate + steepest completion from each child, then pick the best total." That simple procedure is the thing that beat the steepest baseline by 4.3 pivots at update 0 (§7.1).

5.2The search loop

Inside SGSPIPlanner.search() sgspi.py:681:

  1. Root setup. Fused network call; capture a root checkpoint; compute the root's canonical StateKey/SearchStateKey; push it on a heap keyed by \(f\) with an insertion-order tiebreaker sgspi.py:752–758.
  2. Pop the lowest-\(f\) node. Skip it if the transposition table already closed this basis at an equal-or-better \(g\); record terminals (an optimal terminal updates best_terminal_cost); skip nodes already costlier than the best terminal; skip nodes past max_depth sgspi.py:1038–1056. Stop on node budget (64 expansions), wallclock (40 s), or — with frontier_bound_stop — when the frontier's best \(f\) can no longer beat the best terminal found.
  3. Propose actions. At the root: the full union (Gumbel-top-k of tempered policy ∪ Q-top-k ∪ random tail; the steepest anchor arrives via the proposal's heuristic mixing), each tagged with its source. At deeper nodes: a slimmer union (k_child = 6 prior + k_child_q = 4) sgspi.py:1072–1120. If every proposed root action fails oracle replay, a fallback batch expands all remaining legal slots — the root must end up with evidence.
  4. Pivot each proposed action through the oracle. Terminal children immediately produce evidence (next subsection). Nonterminal children are batch-evaluated in one network call sgspi.py:1319–1331 · 1759 _evaluate_children (full policy/value/Q forward only if the child can itself be expanded; value-only otherwise — the depth-0 fast path).
  5. For each nonterminal child: compute its keys; check the transposition table (discovered_best_g); record the bookkeeping that drives evidence flow — its arrival (which root action reached it, at what \(g\), whether via a transposition), its predecessor edge, and its state evidence (a completion result if one ran, else the network estimate); then, unless pruned as a worse-path duplicate, push it with \(f = g + \alpha_h\,h_\theta\) sgspi.py:1370–1550. A strictly better \(g\) to a known basis reopens it (reopen_count).
  6. Wrap up. Root evidence map → selection → target builder → replay sample, with a descriptor recording why search stopped (terminal_found / budget_exhausted / frontier_bound_stop / depth_limited).
Widget · Best-first graph search, step by stepdemo: max_depth=1 · budget=3 · α_h=1
Frontier (priority queue)
Root-action evidence
A hand-sized instance: root R with three proposed actions. Watch four things. (1) The frontier always pops the lowest \(f = g + h_\theta\) — the optimistic bootstrap at C jumps the queue. (2) Completions (◎) convert leaves into rollout evidence; C's completion fails, leaving only an approximate estimate. (3) The pause at step 3 shows the fidelity gate: ungated argmax would pick c on bootstrap fiction; the gate picks within verified evidence. (4) When A is expanded, its child D is a transposition — already discovered via c — so the duplicate is pruned, but the arrival is still recorded and D's completion evidence flows to root action a too (the ⇄ flag).

5.3Evidence: discovery, valuation, and propagation

Everything SGSPI learns from search is funneled into per-root-action evidence, in plain pivot units. The valuation rules sgspi.py:1212–1501:

Discovery at a node with cost gRoot-action valueSource / quality
Optimal terminal−gterminal · exact
Non-optimal terminal (infeasible/unbounded/limit)−(g + αh·hmax)terminal_non_optimal · bounded
Classical completion finishes in k pivots from the node−(g + k)completion_steepest · rollout
Completion hits a verified dead end after k pivots−(g + k + αh·hmax)terminal_non_optimal · bounded
Network estimate only−(g + αh·clamp(−Vθ, 0, hmax))estimate_at_depth_k · approximate

Evidence valuation — the hmax=200 penalty makes verified dead ends rank below any plausible completion; the estimate row is the only one whose value can be fiction.

When the same root action accumulates several pieces of evidence, the better one wins by a strict ordering sgspi.py:1949–1993: source priority first (terminalcompletion_*terminal_non_optimalestimate), then higher value, then shallower depth, then non-transposed provenance. A bootstrap estimate can never displace completion- or terminal-backed evidence for the same action, no matter how optimistic.

Propagation through the graph — the load-bearing part

In a graph search, evidence rarely appears where it's needed. A completion runs at some depth-2 state \(D\); the root needs to know what that implies for root actions \(a\) and \(c\), both of which reach \(D\). SGSPI maintains three index structures and a fixpoint-ish propagation sgspi.py:815–994:

  • state_arrivals[key] — every (root action, g, via-transposition) tuple that has reached this basis. Recorded before transposition pruning, which is exactly why pruned duplicate paths still earn their root action credit.
  • state_evidence[key] — the best known suffix evidence at this basis: (suffix_cost, suffix_depth, source_kind). A completion at the node sets suffix_cost = k; a network estimate sets suffix_cost = αh·hθ.
  • predecessors[key] — parent edges, so suffix evidence can also roll upward: a parent's suffix is \(1 + \) child's suffix sgspi.py:930 propagate_state_evidence_to_parent.

Whenever a state's evidence improves, it re-fires across all recorded arrivals (updating root targets as \(-(g_{\text{arrival}} + \text{suffix})\)), across state-action arrival edges (feeding the consistency targets for child-state Q supervision), and recursively to predecessors. The via_transposition flag is carried along the whole way and lands in the final evidence — diagnostics can tell exactly how much of the target was earned through merged paths. The design doc flags this machinery with an explicit warning: collapse it to a naive state → evidence map without arrival/predecessor records, and root attribution through transpositions silently breaks.

5.4Classical completion: buying ground truth at the leaves

The single most consequential design choice in SGSPI. A learned value head — especially early — is unreliable; but this domain offers an oracle of a different kind: let steepest edge finish the job and count the pivots. The ClassicalCompletionOracle sgspi.py:247 replays prefix + path-to-leaf, then runs the configured rule (steepest) to termination with caps (max_pivots = 1000, timeout_ms = 20000), backed by an LRU cache. Outcomes map to the evidence table above; truncations and timeouts fall back to the network estimate.

Two subtleties make this more than "just use steepest":

  • It evaluates first actions, not the classical rule's own choice. The evidence for root action \(a\) is "take \(a\) first, then let steepest finish." If some non-steepest first pivot leads to a basin where steepest's suffix is shorter, completion finds and certifies that — this is exactly the headroom the search exploits to beat the steepest baseline from the warm start. The improvement is real, verified, and immediately distillable.
  • It is a target-evidence source, not an actor override. The deployed/raw policy never calls steepest; completions only shape what the network is trained toward.
Why probability 1.0, pinned

Completion probability is an episode-driven schedule (completion_probability_for_episode sgspi.py:183). The original configs decayed it — wean off the crutch as the value head improves. §7.2 shows what actually happened: each decayed leaf became a coin flip between truth and optimism, and optimism won the argmax. The fixed config pins probability at 1.0 (decay_episodes = 0) and adds an eval-side override so eval/search/* always measures completion-on search. Weaning is deferred until dist_gap/v_error_mae sits at ~1–2 pivots, and then only with pessimistic value estimates.

5.5Gated searched-Q targets

SGSPI's target builder, SearchedQSoftmaxTargetBuilder targets.py:207, is philosophically opposite to completed-Q: only actions search actually valued get supervised; unsearched actions are undefined, not neutral. The construction, in order:

$$S \;=\; \begin{cases} \{a : \text{evidence}(a)\ \text{is verified}\} & \text{if any verified evidence exists} \\[2pt] \{a : \text{any evidence}\} & \text{otherwise — all-bootstrap, sample labeled approximate} \end{cases}$$ $$\pi_{\text{target}}(a) \;=\; \operatorname{softmax}_{a\in S}\!\big(q(a)/\tau_\pi\big), \qquad v_{\text{target}} \;=\; \textstyle\sum_{a\in S}\pi_{\text{target}}(a)\,q(a)$$
the fidelity gate (A2) then softmax at τ_π = 0.5; finally an optional uniform floor ε=0.01 over all legal actions: π ← (1−ε)π + ε/|legal|targets.py:233–265

Unpack each piece:

  • The gate (fix A2). Bootstrap evidence still flows to the Q-head target (with its 0.25 action-quality weight) — the Q head should keep learning from everything — but it is barred from the improvement operator: it cannot define the policy softmax or the value target while any verified evidence exists. The planner's executed action obeys the same rule (_initial_selected_var_id sgspi.py:2091: greedy over verified first; and the final selection is re-derived as argmax(policy_target) after gating sgspi.py:1680). The principle generalizes: the learned heuristic may order the search (f = g + αh), but it may not win the target.
  • τ = 0.5 is sharp by design. Values are pivot counts; a 2-pivot gap yields a probability ratio of \(e^{2/0.5} = e^4 \approx 55\). The target is near-one-hot on the best verified action unless alternatives are within ~a pivot. Try it: Δq = 2.0 pivots, τ = 0.5 ⇒ ratio 54.6×
  • The value target is the search's own appraisal — the softmax-expected Q over the support, ≈ the best verified value at this τ. Unlike SGTS there's no MC-return dependence: search appraises, replay stores, reanalyse refreshes.
  • Consistency targets. From the state-action evidence recorded at expanded interior states, the builder also extracts per-child-state Q targets (ConsistencyTargets targets.py:611) — a Bellman-consistency signal for the Q head at successor states. Wired but currently disabled in the live config (consistency_coeff = 0).

5.6Reanalyse: the policy-iteration flywheel

Replay samples store everything needed to reconstruct their root — problem index, phase-2 prefix, state key, the original observation. Reanalyse reanalyse.py periodically picks stored samples (priority = weakest targets first), replays the prefix through a ReplayOracle, verifies the state key and observation match what was stored (fail loudly, never silently relabel), re-runs search with the current network and budget, and replaces the targets in place while preserving replay identity. Old experience keeps getting re-appraised by an ever-better searcher — MuZero Reanalyze, adapted to exact dynamics. This is what makes SGSPI a genuine policy-iteration loop rather than one-shot target generation. (Both 2026-06-10 configs ship with it disabled while the A1–A8 fixes are validated; the machinery and its strict verification discipline are the part to know about.)

6SGTS vs SGSPI, side by side

DimensionSGTSSGSPI
Search objectSampled tree from the rootBounded graph from the root
Budget logicSequential Halving over sampled arms (simple regret)Best-first frontier by f = g + αhhθ
Root question"Best arm in this sample?""Cheapest verified path out, and its first step?"
Repeated statesPer-simulation cycle check only; duplicates cost twiceMandatory transposition table; prune / reopen; shared-descendant evidence
Unsearched actionsCompleted to vmix — neutral in target, prior preservedOutside target support — undefined (ε uniform floor only)
Policy targetsoftmax(prior logits + σ(completed-Q))softmax(searched Q / τ) on fidelity-gated support
Value targetNone at search time → MC return / A* via loss prioritySearch's own appraisal Σπq; refreshed by reanalyse
Ground truth at leavesClassical rollout leaf evaluator (steepest, w=0)Classical completion per leaf, prob. 1.0, cached
Trust in network Q headProposals only (k_q=0 in live config) — never in tree statsProposals (k_q=8) + ordering; gated out of targets
Improvement guaranteeGumbel MuZero expected-improvement argument at the rootNone formal — hθ inadmissible; verified evidence is the safeguard
Replay profileStep replay; uniform samplingRich replay: state-key index, quality buckets, stratified sampling, in-place target replacement
Failure mode observedInert improvement operator (Δ search−raw ≡ 0) — §7.3Winner's-curse collapse from bootstrap optimism — §7.2
Cost profile~4.5 s / root decision at 96 sims, depth 4Cheap at depth 0; completions cached; collapse was never a cost problem

When should each win? SGTS is the safer operator: its target construction is conservative, its acting rule carries an expected-improvement argument, and it never lets any estimate — network or otherwise — promote an action search didn't measure. It is the right shape when branching is huge, budgets are small, and you want robust local improvement. SGSPI is the domain-shaped operator: it exploits determinism, transpositions, and purchasable ground truth to produce sharply better-than-baseline decisions immediately, and its rich replay + reanalyse make it a true policy-iteration loop. It is also the one with more moving parts and more ways to silently break — evidence attribution, replay identity, fidelity gating are all load-bearing. The 2026-06 experiments effectively ask: can SGTS be made alive (A1/A6) and SGSPI be made safe (A2/A3/A4) without losing what made each attractive?

7The failure story — and why the fixes are what they are

The 2026-06-04 WandB post-mortem docs/wandb-searchpi-analysis-2026-06-04.md covers ten runs; its two headline phenomena are worth studying because each is a clean lesson about search-based RL, taught by this exact codebase.

7.1What the runs showed

Figure · The SGSPI collapserun h1b2rc6g · eval/search vs eval/raw
Return ≈ −pivots (higher is better). Solid markers are logged evals from h1b2rc6g (sgspi r3-long): search starts at −26.90, clearly above steepest (−31.15), holds through update 200, then degrades monotonically to −61.80 by update 1400 — worse than Dantzig. The dotted segment interpolates the documented monotone fall between logged points. Raw policy (teal, flat ≈ −31.5) never collapses; total training loss fell from 52 to 13 the whole way. Every fresh SGSPI run reproduced this; best checkpoint was at update 0 or 200 in all of them.
  • SGSPI: search eval at update 0 ≈ −26.8 vs steepest −31.15 and raw ≈ −32 — the search procedure, on day one, beat the best classical baseline by ~4.3 pivots. Then, in every fresh run (h1b2rc6g, mra6a2vo, 59v4ci71, msaxwies, wffdti4l, nk703lou), it collapsed to −43…−62 within 600–1400 updates while losses improved and the raw policy stayed flat. Collapse began before greedy collection ramped (greedy probability still 0 at update 400 in r3-long) — the planner/target feedback itself was the cause, not exploration policy.
  • SGTS: perfectly stable — and perfectly inert. eval/delta/search_minus_raw_return_mean was exactly 0.00 at every eval of both runs; the planner cost ~4.6 s per root decision for it.

The fix plan's framing docs/searchpi-fix-plan-2026-06.md: the search works; the learning loop destroys or ignores its signal. Two mechanisms, one per method.

7.2SGSPI's winner's curse

The original configs decayed completion probability over episodes. Once it dropped below 1, each leaf was a coin flip: exact steepest completion, or raw network value estimate. Both landed in the same evidence pool, and three argmax-shaped operations sat downstream of that pool:

  1. the policy target — softmax at τ=0.5 is effectively argmax (a 2-pivot edge ⇒ 55× ratio);
  2. the value target — Σπq ≈ max over the pool;
  3. eval acting — literally argmax(policy_target) at every step.

The warm-start value head was systematically optimistic (γ=0.99 discounting, §2.1) and never became accurate (value loss plateaued at ~2.5 smooth-L1 ⇒ several pivots of error). Take a max over a mixed pool of one unbiased measurement and several optimistically-biased estimates, and the biased ones win — the winner's curse. Worse, it self-reinforces: the value target ≈ the optimistic winner, so the value head is trained toward its own optimism; sharper optimism wins more argmaxes; evidence quality degrades further. The smoking gun in the logs: in all four analyzed runs, collapse began at exactly the eval where root_evidence_source_estimate_at_depth_1_count first became nonzero.

Hence the layered fix (A2): the fidelity gate in the target builder and in action selection (§5.5) makes bootstrap evidence structurally unable to win the improvement operator; completion pinned at 1.0 removes the coin flip; the eval-side override makes eval/search/* measure search capability rather than the decay schedule. And A5 (γ=1.0 IQL) attacks the optimism at its source. The general lesson transfers far beyond this repo: any argmax over evidence of heterogeneous fidelity is a winner's-curse engine; gate by fidelity before you max.

7.3SGTS's inert improvement operator

SGTS's failure was quieter: a parameterization that reduced the method to an expensive identity. The pre-fix target was \( \operatorname{softmax}(z + \text{adv}/\tau) \) where the advantage had been routed through the AdvantageNormalizer in per_state mode with advantage_scale_c = 0.25 (capping |adv| at 0.25) and then divided by completed_q_temperature = 8.0. Maximum possible logit shift: ±0.031. Recall the scales it had to compete with: prior logits O(1), Gumbel noise std 1.28, and σ's intended multiplier ≈ 50. Consequences, all visible in the logs:

  • πtarget ≈ prior ± 3% — each gradient step distilled the policy into itself; eval search action argmax(πtarget) never differed from the raw argmax ⇒ delta ≡ 0.00.
  • The same capped advantage ranked Sequential Halving — so the "winner" was, to first order, a Gumbel sample from the proposal. Measured: selected_action_evidence_rank_mean ≈ 4.2 out of ~10 arms with evidence; the race's conclusion was uncorrelated with its own measurements.

Fix A1 replaces the transform with the real Gumbel MuZero σ in all three places it must act — target construction, SH ranking, non-root selection — and Widget B (§4.3) lets you feel the difference directly. A6 then makes the collector actually execute the SH winner (action_selection = "planner") instead of sampling the (previously inert) target. A second-order lesson worth keeping: flat-but-stable can be a worse failure than collapse, because nothing in the loss curves complains. Only a metric that directly measures the improvement operator (Δ search−raw; argmax-changed fraction) could have caught it — which is what A7 adds.

7.4The fix matrix

FixMechanism repairedWatch in WandB
A1σ transform replaces advantage/τ targets.py:352SGTS improvement operator strength (×0.031 → ×~50)pi_target_argmax_changed · pi_target_logit_delta_mean_abs
A2Fidelity gate in targets + selection; completion pinned; eval override targets.py:233 · sgspi.py:2091SGSPI winner's curse — bootstrap can't win the operatorpi_target_argmax_evidence_approximate ≈ 0
A3Training replay sampling → quality_stratified replay.py · trainer.pyPre-fix "priority" reused the reanalyse formula (priority ∝ 1−quality) for training — exact samples were drawn with weight 1e-6, i.e. neverreplay_batch_sample_quality_exact_fraction > 0
A4BC-anchor decay 1.0 → 0 over 1500 updates loss configbc_coeff = policy_coeff = 1.0 pinned the policy at the 50/50 mixture of search target and frozen IQL — foreverbc_coeff_effective · loss_bc rises then falls
A5IQL warm start retrained at γ = 1.0 configs/*/iql.tomlOptimistic discounted values feeding the A2 loopdist_gap/v_error_mae
A6action_selection = "planner" collectors.pyCollector executes the search winner instead of sampling the target (modes: stochastic · greedy · annealed · planner)collect/action_planner_count
A7Operator-liveness diagnosticsMake A1/A2/A3-class failures visible in one evaldist_gap/* · evidence-source mixes
A8configs/2026-06-10/ experiment setB3 (SGTS-sigma) and B4 (SGSPI-fidelity) runs wiring all of the aboveeval/delta/search_minus_raw_return_mean

Success criteria are deliberately first-few-evals observable (compute nodes are wallclock-limited): for B3, Δ(search−raw) > 0 from the first eval and a clearly nonzero argmax-changed fraction; for B4, no collapse — eval/search/return_mean holding ≥ ≈−27 — and, the real prize, eval/raw/return_mean climbing past steepest toward the search level: the first configuration in which the raw policy has an undistorted path to absorbing the search signal.

8Training loop & implementation map

Brief by design — the methods were the point — but these are the pieces you'll touch.

The loss

SearchPILoss loss.py:80 is a sum of independently masked terms; no sample is ever forced to fabricate a label for a term it doesn't have:

$$\mathcal{L} \;=\; c_\pi\,\underbrace{\mathrm{CE}\big(\pi_\theta,\ \pi_{\text{target}}\big)}_{\text{on target support only}} \;+\; c_v\,\mathrm{sL1}\big(v_\theta, v_{\text{target}}\big) \;+\; c_q\,\underbrace{\mathrm{sL1}\big(q_\theta, q_{\text{target}}\big)}_{\text{searched actions only}} \;+\; c_{\text{rank}}\,\mathcal{L}_{\text{rank}} \;+\; c_{\text{bc}}(t)\,\mathrm{KL}\big(\pi_{\text{IQL}}\,\|\,\pi_\theta\big) \;+\; c_{\text{ref}}\,\mathcal{L}_{\text{ref-v}} \;+\; c_{\text{cons}}\,\mathcal{L}_{\text{cons}}$$
SGTS run: c = (0.75, 0.25, 0.50, 0, 0, 0, 0) · SGSPI run: c = (1.0, 0.5, 1.0, 0.25, 1.0→0 over 1500, 0.5, 0)loss.py:163 forward · configs/2026-06-10/*.toml [loss]

Details that matter: the policy CE masks logits to the target support before the log-softmax (so SGSPI's gated support is airtight, while SGTS's support is all legal actions); value targets resolve by priority chain search → A* → MC-return; the rank loss is a pairwise margin on searched-Q ordering; the BC anchor (full_kl mode) pulls toward the frozen IQL checkpoint with the A4-decayed coefficient; and every term is weighted by the quality ladder — per-sample weights from sample quality, per-action weights from action quality (quality.py; note _weighted_mean renormalizes within the batch, which is why bad sampling composition couldn't be rescued by weights alone — the A3 lesson).

dist_gap: is the network catching up to its teacher?

Logged from every update loss.py:368–372 — the panel to read before any other when judging a run:

dist_gap/policy_ce_to_targetdistillation gap; should fall while the teacher keeps moving — falling CE with collapsing eval = misaligned targets (the SGSPI signature, §7.1)
policy_top1_matches_target_argmaxwould the raw policy already make the search's decision? The raw→search convergence metric.
policy_matches_collector_actionagreement with what was actually executed (planner winner under A6)
target_entropyhow decisive the teacher is (τ=0.5 targets are sharp; rising entropy ⇒ weak/ambiguous evidence)
v_error_maevalue-head calibration in pivots — the explicit gate for ever re-enabling completion decay (~1–2 pivots)

Collection & evaluation

Collection runs sync (SearchPIEpisodeCollector) or async (AsyncSearchPICollector): 4 worker processes execute episodes (each step = one full planner search), while a central batched inference server inference.py serves all network calls (max 256 items / 2 ms batching window, root-priority queueing), with a watchdog restarting silent workers. Evaluation eval.py · process_eval.py measures both deployment modes per checkpoint — eval/raw/* (network argmax, no search) and eval/search/* (planner + argmax(πtarget) each step) — against Dantzig and steepest on a fixed problem set; eval/delta/search_minus_raw_return_mean is the one-number health check of the whole enterprise. Both 2026-06-10 runs train solver_feature_attention with the dueling Q head enabled, warm-started from IQL.

File map

File (src/simplexrl/searchpi/)Role
planners/sgts.pySGTS: proposal → Gumbel-top-k → Sequential Halving → completed-Q evidence (§4)
planners/sgspi.pySGSPI: best-first frontier, transpositions, evidence propagation, completions (§5)
targets.pyBoth target builders + v_mix / completed-Q / σ / gating math
proposals.pyMixed-β proposals, Gumbel-top-k, heuristic anchors, source tags
oracle.pyPhase-2 HiGHS sessions, prefix replay, checkpoint cache, branch cursors
state_key.pyCanonical basis identity (transpositions, replay verification)
quality.pyQuality ladders, loss weights, source→quality mapping
types.pySearchRequest / SearchResult / RootActionEvidence contracts
replay.py · dataset.pyRich replay (state-key index, quality buckets, stratified sampling) → TensorDict batches
loss.pySearchPILoss, value-target priority, dist_gap metrics
reanalyse.pyTarget refresh with strict state/observation verification
collectors.py · inference.pySync/async collection, action-selection modes, batched inference server
trainer.py · cli.py · eval.pysimplexrl-searchpi-train <toml> entrypoint, training loop, raw/search eval harness

9Glossary

candidate slot / var-id
Node-local tensor index vs durable HiGHS entering-variable id. Tensors use slots; search paths, prefixes, and evidence use var-ids; every node remaps (§2.3).
phase-2 prefix
The sequence of entering var-ids from phase-2 start that deterministically reconstructs a state via oracle replay.
state key
Canonical basis identity; powers transposition detection and replay/reanalyse verification.
root action evidence
Per-root-action record (value, source, depth, via_transposition, quality) — the planner-agnostic currency the target builders consume.
verified / approximate
Evidence fidelity split: terminal- or completion-backed (exact/bounded/rollout) vs network bootstrap (approx). Verified gates the improvement operator (A2).
v_mix
SGTS's improvement baseline: visit-weighted blend of network value and prior-weighted searched-Q mean. Never a value target.
completed-Q
SGTS target construction: searched actions keep measured values, unsearched legal actions are filled with v_mix (zero advantage ⇒ neutral).
σ transform
(c_visit + max-visits)·c_scale·q̂ with q̂ min-max normalized — the monotone value bonus added to logits for ranking, acting, and targets.
searched-Q softmax
SGSPI target construction: softmax(q/τ) over the (gated) searched support only; unsearched actions undefined.
completion
Evaluating a search leaf by letting a classical pivot rule (steepest) finish the suffix and counting pivots — terminal-grounded evidence on demand.
transposition / reopening
Reaching a known basis by another path. Worse-or-equal g ⇒ pruned (arrival still credited); strictly better g ⇒ node reopened.
reanalyse
Re-running search from stored replay roots with the current network to refresh targets in place, under strict identity verification.
raw vs search eval
Deploy the bare network argmax vs deploy planner-amplified decisions. Their delta measures whether search still has anything to teach.