SearchPI
How SGTS and SGSPI turn an exact LP solver into a policy-improvement operator — and what happens when the loop feeding search back into the network goes wrong.
1The big idea
Everything you already train in this repo — PPO, GRPO, IQL — improves the policy through gradients on sampled experience. SearchPI improves it through planning. At each pivot decision during collection, instead of just sampling the network's policy, a planner spends a budget of real solver queries exploring what each candidate entering variable would actually lead to. The planner's conclusion — "from this basis, variable 17 is genuinely the best pivot, and here's how good each alternative looked" — is then converted into supervised targets, stored in replay, and distilled back into the network. That's the PI in SearchPI: policy iteration, where the improvement step is a search procedure and the evaluation/distillation step is plain supervised regression.
This is the AlphaZero family of algorithms, but the domain twists it in a way worth internalizing before any details:
- Transitions are exact and cheap. A pivot step queried through the HiGHS oracle is deterministic and costs far less than a network forward pass. Nobody learns a dynamics model here (that would be MuZero); the real solver is the planning model. The network only supplies guidance: policy logits, a value estimate, optionally per-candidate Q estimates — all from one fused
policy_value_qcall. - The problem is a deterministic shortest-path problem. Reward is −1 per pivot, terminal at optimality. So "act well" literally means "take the shortest pivot path to the optimal basis," and search methods for shortest paths (best-first / A*) become natural competitors to MCTS.
- Ground truth is purchasable. From any basis you can run a classical pivot rule (Dantzig, devex, steepest edge) to termination and observe the exact remaining pivot count under that rule. This gives search access to verified, terminal-grounded value evidence that most RL domains never have — and the distinction between verified and network-estimated evidence turns out to be the single most important idea in the whole system (§5.5, §7.2).
SearchPI contains two planners that share all of this infrastructure but answer differently the question "how should a limited search budget be spent?":
A simplex specialization of Sampled MuZero + Gumbel MuZero ("Policy improvement by planning with Gumbel," Danihelka et al. 2022). It samples a small subset of root actions with the Gumbel-top-k trick, races them with Sequential Halving on exact solver branches, and builds completed-Q policy targets: searched actions get their measured values, unsearched actions get a neutral baseline. Optimizes simple regret at the root — pick the best next pivot among a sample.
Treats phase-2 pivoting as deterministic shortest-path on a (cyclic) graph of bases and runs a bounded best-first graph search with the learned value as heuristic: f = g + α·hθ. It merges transpositions, attributes discovered evidence back to every root action that can reach it, optionally completes leaves with a classical pivot rule to get terminal-grounded values, and distills searched-Q softmax targets gated by evidence fidelity.
A useful slogan for the whole document: SGTS asks "which of these sampled arms is best?"; SGSPI asks "what is the cheapest path out of here, and which first step starts it?" Both then face the same second problem — turning what search found into training targets that improve the network without poisoning it. The empirical history (§7) shows that this second problem, not the search itself, is where both methods initially failed: SGSPI's search was excellent on day one and trained itself into collapse, while SGTS's improvement operator was accidentally parameterized into a mathematical no-op.
2The decision problem & shared machinery
Four pieces of shared infrastructure shape everything both planners do. They're worth getting precise before the algorithms, because most of the subtle correctness constraints in the codebase live here.
2.1Values are negative pivot counts
During phase 2, each state is the current simplex basis for a fixed LP; an action chooses one legal entering variable; the transition is one exact HiGHS pivot; reward is −1 per pivot (0 at the optimal terminal). With no discounting, the optimal value function is simply:
where \(\Delta_{\text{iters}}\) is the iteration count HiGHS reports for the step (almost always 1). Both planners, both target builders, the replay buffer, and the loss all speak this unit — raw pivots, undiscounted. Keep five quantities mentally separate, because conflating them is the documented way to corrupt the method: root_value (network V at the root), root_q_values (network Q head at the root), q_target (search-derived per-action evidence), value_target (scalar supervised target), and mc_return (the realized episode suffix). And one more for SGTS: v_mix is a policy-improvement baseline, never a value target (§4.3).
The original IQL warm-start checkpoint was trained with γ=0.99, so its value head predicts discounted returns: at 35 true remaining pivots it says ≈ −29.6; at 100 it says ≈ −63. Systematically optimistic — and optimism is precisely the failure mode that argmax-driven search amplifies (§7.2). Fix A5 retrains IQL at γ=1.0 so warm-start values share SearchPI's units.
2.2The phase-2 oracle: search by replaying prefixes
HiGHS doesn't expose "clone this solver state." What it does support, through the pivot-callback machinery you already know from SimplexPivotEnv, is deterministic replay: from the phase-2 start of a given problem, feeding the same sequence of entering variable ids reproduces the same basis. So search identifies any state by
The oracle (Phase2Oracle, searchpi/oracle.py) maintains pooled HiGHS sessions plus a budgeted checkpoint cache of basis snapshots, so "step once from this node" usually restores a nearby checkpoint and replays a short suffix rather than replaying from scratch. Planners receive checkpoint_id hints in every step result and thread them into child nodes. SGTS additionally caches each root action's first step (root_step_cache) so re-simulating the same root arm doesn't repeat the pivot. The practical consequence: tree/graph search over solver states is feasible, but each node expansion is a real solver interaction — search budgets are counted in oracle queries and wallclock, and the metrics (oracle_actions_replayed_sum, checkpoint_*) exist to keep replay amplification visible.
2.3Candidate slots vs var-ids — the #1 bug source
Every observation exposes candidates as a tensor of up to max_candidates slots, and slot indices are node-local: after one pivot, the same HiGHS variable can sit in a different slot, or vanish from the candidate list entirely. The codebase therefore enforces a strict identity discipline, called out in the design doc as the origin of "most subtle target bugs":
- Candidate slots index tensors aligned to one specific observation (logits, masks, targets).
- Internal entering-variable ids name actions durably: search paths, replay prefixes, root-action evidence, successor metadata.
- Every node remaps explicitly (
legal_var_to_slot); target builders construct candidate-aligned tensors from var-id keyed maps and fail loudly if evidence names a var-id that isn't legal at the root.
When you read any SearchPI code, this is why dictionaries are keyed by int(var_id) everywhere while tensors appear only at the boundary.
2.4The shared pipeline
Both methods are presets over one framework (searchpi/types.py · targets.py · replay.py · loss.py · trainer.py):
The contract that makes the two planners interchangeable is RootActionEvidence types.py: for each root action the planner searched, a record (value, source, depth, via_transposition, quality) in value-space. Sources are strings like terminal, terminal_non_optimal, completion_steepest, estimate_at_depth_1; qualities form the ladder that the rest of the system keys on:
| Quality | Meaning | Produced by | Loss weight |
|---|---|---|---|
| exact | Proven: an optimal terminal was reached through this action | terminal hit in search | 1.00 |
| bounded | Verified dead end / limit (non-optimal terminal) | infeasible · unbounded · iteration limit | 0.75 |
| rollout | Terminal-grounded via a classical rule finishing the suffix | completion (SGSPI) · classical leaf rollout (SGTS) | 0.75 |
| approximate | Network value bootstrap — could be optimistic fiction | Vθ at a search leaf | 0.25 |
| failed | Cycle, oracle replay error | error paths | 0.00 |
Quality ladder — quality.py:11–18. Verified means the first three rungs (evidence_is_verified, targets.py:741). This split — verified vs approximate — gates SGSPI's improvement operator (§5.5) and stratifies replay sampling (§8).
One more shared piece: root action proposals proposals.py. Neither planner ever considers all legal candidates (often hundreds). Both build a proposal set as a union of sources — Gumbel-top-k samples from a tempered policy, top-k by the network's Q head, classical heuristic anchors, the incumbent, a random tail — with each chosen action tagged by where it came from (source_tags_by_var_id, visible in WandB as root_source_tag_*). How that union is configured is one of the main practical differences between the two methods.
3The toolbox: four techniques you need first
SGTS and SGSPI are both assembled from a small set of techniques that don't appear elsewhere in this repo. Each gets a compact treatment here; the method sections then read as composition.
3.1The Gumbel-max / Gumbel-top-k trick
Take a categorical distribution \(\beta\) with logits \(\log\beta_i\). Draw one i.i.d. Gumbel per action, \(g_i = -\log(-\log U_i)\) with \(U_i\sim\mathrm{Uniform}(0,1)\). Then:
Why a planner cares, beyond "it's a clean way to sample a diverse subset":
- One noise draw, reused everywhere. The planner draws the Gumbels once per root and keeps them. The same \(g_i\) that decided which actions got sampled later participates in ranking the survivors. This is the heart of Gumbel MuZero's policy-improvement guarantee: since \(\arg\max_i(g_i + \log\beta_i)\) is an exact sample from \(\beta\), then \(\arg\max_i(g_i + \log\beta_i + \sigma(\hat q_i))\) — adding a monotone bonus \(\sigma\) of measured action values — picks an action whose expected value is at least that of a fresh sample from \(\beta\). Search can only help, in expectation, no matter how few actions were sampled. Drop the stored Gumbels and re-noise, and that guarantee evaporates.
- Without replacement matters. With hundreds of candidates and a budget of ~12 sampled arms, you can't afford duplicates, and you want the sample to respect \(\beta\)'s mass without being a deterministic top-k (which would never explore the tail).
In SearchPI, the distribution being sampled is not the raw policy but a mixed proposal \(\beta\) proposals.py:336 · _proposal_probs:
So a quarter of the proposal mass in the current SGTS config sits on the classical steepest-edge shortlist — the planner is never fully at the mercy of a bad learned prior.
3.2Sequential Halving: spending a budget to find the best arm
Once SGTS has sampled \(m\) root actions, it faces a pure-exploration bandit: with a total budget of \(N\) simulations, identify the best arm. You don't care about cumulative reward during the search (no UCB-style regret balancing); you care only about the quality of the final pick — simple regret. Sequential Halving (Karnin et al. 2013) is the standard answer:
- Keep a surviving set, initially all \(m\) arms. Plan \(\lceil\log_2 m\rceil\) rounds.
- Each round, split the round's share of the budget evenly across survivors, pull (simulate) accordingly, then eliminate the worse half by current estimated value.
- The last survivor is the answer. Early rounds are wide and shallow; later rounds concentrate the budget on near-finalists, which therefore have low-variance estimates exactly where it matters.
The implementation sgts.py:307 · sequential_halving follows this with two domain adaptations: an initial pass guarantees every sampled arm at least one simulation, and ranking between rounds is not by raw \(\bar q\) but by the full Gumbel MuZero score (next subsection), so elimination respects both the prior and the noise draw that selected the arms. Budget bookkeeping: round budget = remaining ÷ rounds-left, leftovers to the top-ranked arms, survivors = \(\lceil |S|/2 \rceil\).
3.3Policy improvement by planning: σ-scores and completed-Q
This is the conceptual core of SGTS, and it answers two questions at once.
Question 1 — how should search evidence influence ranking and acting? Gumbel MuZero's answer: transform measured Q-values monotonically and add them to the logits-plus-Gumbel score. SearchPI uses exactly its σ transform:
Read the magnitude: with \(c_{\text{visit}}=50\), the action with the best measured Q gets a bonus ~50 logits above the worst. Gumbel noise has standard deviation \(\pi/\sqrt6 \approx 1.28\); prior logits live on a similar scale. So when search has actually measured a difference, σ dominates — the ranking becomes search-driven, with prior and Gumbel breaking ties. The \(+\max_b N(b)\) term grows the dominance as evidence accumulates. (Hold this thought: the original SearchPI configuration replaced σ with a transform whose maximum contribution was ±0.031 logits, drowning under the Gumbel noise — that's failure story §7.3.)
Question 2 — what should the policy target be when search only measured a few actions? Naive options both fail: train only on searched actions and you starve the rest of the distribution; fill unsearched actions with the network's own Q estimates and you teach the network its own guesses. The completed-Q construction threads the needle targets.py:293–327:
Three properties to internalize:
- \(v_{\text{mix}}\) interpolates from the network's value (no evidence, \(\alpha=1\)) toward the prior-weighted mean of measured values as visits accumulate. It is the "what a typical action is worth here" baseline.
- Unsearched actions completed to \(v_{\text{mix}}\) have zero advantage against that baseline — after min-max and σ they sit in the middle, so their target probability ≈ their prior probability. Conservative neutrality: search silence neither promotes nor punishes an action, and crucially, the Q head's opinion alone can never promote an unsearched action into the target. (Contrast with SGSPI, which instead leaves unsearched actions out of the target support entirely — two different valid answers, §6.)
- The target is a full distribution over legal actions, so the policy CE trains all logits coherently, not just the searched subset.
The whole AlphaZero-style loop is then: act with the search-improved decision → distill \(\pi_{\text{target}}\) into the policy head → a better prior makes the next search better. The fixed point is a policy that search can no longer improve.
3.4Best-first search with a learned, inadmissible heuristic
For SGSPI, the relevant classical algorithm is A*: maintain a priority queue (the frontier) over discovered nodes, scored by
and always expand the lowest-\(f\) node. With an admissible heuristic (never overestimates remaining cost), A* expanding a goal proves optimality. Two things break that purity here, both deliberate:
- \(h_\theta\) is a neural estimate — not admissible. So
frontier_bound_stop(halt when the best frontier \(f\) is no better than the best terminal found) is a heuristic stopping rule, not a proof. SGSPI is "bounded best-first," not exact A*. - The state space is a graph with massive transposition structure: different pivot orders reach the same basis. A search that treats it as a tree re-pays for repeated states exponentially. So SGSPI keeps a transposition table keyed by canonical basis identity (
StateKey/SearchStateKey, state_key.py): revisits with worse-or-equal \(g\) are pruned, revisits with strictly better \(g\) reopen the node. The subtle obligation this creates — evidence discovered through a transposition must still be credited to every root action that reaches it — is §5.3.
Why does best-first make sense here when SGTS exists? Because in a deterministic shortest-path problem, value information is shared across root actions: if actions \(a\) and \(c\) both funnel into the same successor basis two pivots deep, one discovery prices both. Sequential Halving treats arms as independent and can't exploit that; a graph frontier gets it for free. That trade — independent-arm statistics vs shared-descendant evidence — is the deepest difference between the two methods.
4SGTS — Sampled Gumbel Tree Search
SGTS is the conservative, theory-backed planner: a Gumbel MuZero root over exact simplex branches. Configuration anchors for this section come from the live experiment config configs/2026-06-10/sgts-sigma-01.toml: root_sample_size = 12, root_simulations = 96, max_depth = 4, nonroot_sample_size = 4, c_visit = 50, c_scale = 1, leaf evaluation by pure steepest-edge rollout.
4.1Anatomy of one root decision
Everything happens inside SGTSPlanner.search() sgts.py:456. Walking it in order:
- One fused network call at the root.
policy_value_q(root_obs)returns logits \(z\), value \(v_\theta\), and Q estimates in a single encoder pass sgts.py:467. (Q is used by proposals ifk_q>0; the tree statistics themselves never trust it.) - Neighborhood mask. An optional pre-filter restricts legal actions to a near-classical neighborhood — in the current config, the steepest-edge top-6 plus a budget of at most 1 "deviation" from that shortlist along any root-to-leaf path (
[planner.sgts.neighborhood], modetop_k_deviation_budget). This keeps the search inside a region where the warm-started network is trustworthy. - Mixed-β proposal, Gumbel-top-k sample. Build \(\beta\) (§3.1) and sample
root_sample_size = 12actions without replacement sgts.py:485–508. Note the override: SGTS forcesk_prior = root_sample_sizesgts.py:1512 _effective_root_proposal_config — the sampled set is the arm set. For each sampled var-id it stores the score \(g(a) + \log\beta(a)\) for later ranking sgts.py:530. - Race the arms with Sequential Halving (§3.2), total budget
root_simulations = 96. Eachevaluate(a)call simulates root action \(a\) once (next subsection) and returns the running mean \(\bar q(a) = \frac{\text{q\_sum}(a)}{N(a)}\) sgts.py:1166. Ranking between rounds scores survivors by $$\text{score}(a) = g(a) + \log\beta(a) + \sigma\big(\bar q(a)\big)$$ with ties broken by visit count, then Gumbel sgts.py:352–377 rank_actions. The min–max inside σ is taken over the current survivor pool, and \(\max_b N(b)\) is the pool's max visit count — dominance grows as the race narrows. - The winner is the action that gets executed (in
action_selection = "planner"mode, fix A6): \(\arg\max\) of the same score among final survivors — exactly Gumbel MuZero's acting rule \(\arg\max\, g + \text{logits} + \sigma(\hat q)\). - Targets. Per-arm running means become
RootActionEvidence;CompletedQTargetBuilderbuilds \(v_{\text{mix}}\), completed-Q, and \(\pi_{\text{target}}\) (§4.3). The sample is shipped to replay.
A budget-shape intuition for 12 arms / 96 sims: every arm gets one scouting sim, then rounds of roughly 21 → 21 → 21 → 21 sims are split over 12 → 6 → 3 → 2 survivors — the finalists end up with ~15–20 sims each while early discards cost only 1–3. That asymmetry is the entire point of halving.
What one "simulation" of a root arm does
Each evaluation descends an explicit tree of _SGTSNodes sgts.py:134, each holding the solver state, its own network outputs, per-action visit/value statistics, and the replay prefix + checkpoint needed to regenerate it:
- Step the root action through the oracle (cached after the first time —
root_step_cache). If the pivot terminates the LP: optimal/infeasible/unbounded leaves are worth their terminal value (0 for optimal, so \(q = -\Delta_{\text{iters}} + 0\)); iteration/time limits get a pessimistic−iteration_hint; the evidence source/quality is set accordingly (terminal/exact orterminal_non_optimal/bounded) sgts.py:1674–1698. - Descend up to
max_depthfurther pivots. At each non-root node, lazily sample a small child set (nonroot_sample_size = 4, Gumbel from the child proposal) and pick among sampled children deterministically (§4.4). A per-simulationvisitedset of state keys detects cycles; revisiting a basis poisons that path with \(q = -10^6\) (quality failed) sgts.py:811–822. - Evaluate the leaf when depth runs out (or candidates do). Three modes sgts.py:661 leaf_value: terminal value; network \(V_\theta\); or
mixed_rollout— run a classical pivot rule from the leaf for up tomax_depth × 20steps and blend: $$v_{\text{leaf}} = (1-w)\cdot v_{\text{rollout}} + w\cdot V_\theta, \qquad w = \texttt{leaf\_mixing\_weight}$$ The live config setsw = 0with rulesteepest— leaves are valued by pure steepest-edge rollout (capped at 80 pivots, then \(V_\theta\) at the capped state). SGTS leaf values are therefore mostly terminal-grounded, the same medicine SGSPI takes via completions. - Back up. Values flow up the recursion as \(q_{\text{parent}}(a) = -\Delta_{\text{iters}} + v_{\text{child}}\) sgts.py:878–891, accumulating into
visits[a]/q_sum[a]at every node along the path sgts.py:1625 _update_node_stats. No max-backup, no UCT — plain running means, consumed by σ at the root.
One optimization worth knowing: when max_depth ≤ 1 with network leaves, the planner pre-steps all sampled root actions and batch-evaluates the children in a single network call (prewarm_depth_one_network_leaves sgts.py:996) — the entire search collapses to one oracle sweep plus two network calls.
4.2The race, live
4.3From statistics to targets: v_mix, completed-Q, σ
After the race, CompletedQTargetBuilder targets.py:96 assembles the training sample exactly as §3.3 described: candidate-aligned \(\bar q\)/visit tensors from the evidence map → \(v_{\text{mix}}\) → completed-Q over all legal actions → \(\pi_{\text{target}} = \operatorname{softmax}(z + \sigma(Q_{\text{completed}}))\). The widget below is that computation — drag the evidence and watch the operator respond.
4.4Below the root: sampled descent, not UCT
Non-root nodes don't run their own halving race (budget wouldn't survive it). Instead, _select_nonroot_action sgts.py:1324 implements Gumbel MuZero's deterministic in-tree policy: at each node, compute the same completed-Q policy improvement locally over the node's sampled children, then pick
This rule has a neat property: over successive visits it allocates child visits to approximate π′ itself (it's a derandomized sampling of the improved policy), so deeper search refines exactly the branches the improved policy would actually take. Child proposals there are small (4 Gumbel samples from the tempered child policy), every node carries its own cycle-checked state, and depth is capped at 4 pivots — SGTS in this repo is a local improvement operator: look a handful of pivots ahead, with terminal-grounded steepest rollouts beyond that horizon.
No transposition table — only per-simulation cycle detection. Two simulations reaching the same basis through different prefixes pay twice. The design doc treats this as intentional: a global table would change the arms' budget accounting and break Sequential Halving's independent-arm statistics. The corollary is that SGTS wastes budget on shared structure — which is precisely the inefficiency SGSPI is built around (§3.4, §6).
Finally, the value target. The search itself doesn't set one — SearchTargets.value_target passes through as None for SGTS — so the scalar value supervision is resolved later by the loss's priority chain (search → A* → MC return, loss.py:517 resolve_value_targets), which for SGTS means the realized Monte-Carlo return of the collected episode (or an injected A* label when available). This is the trap the design doc warns about by name: \(v_{\text{mix}}\) is the improvement baseline, and using it as the value target would teach the value head its own blend — a different (and wrong) method.
5SGSPI — Simplex Graph Search Policy Iteration
SGSPI starts from the observation that SGTS's bandit abstraction throws away the structure of this domain. Phase-2 pivoting is a deterministic shortest-path problem on a graph of bases; root actions are not independent arms — they share descendants. So: run a bounded best-first graph search, let any discovery (a terminal! a cheap completion! a transposition into known territory!) flow back to every root action that can reach it, and distill the per-root-action evidence directly. Config anchors from configs/2026-06-10/sgspi-fidelity-01.toml: root_node_budget = 64, max_depth = 0, proposals k_prior = 24 / k_q = 8 / k_rand = 4, alpha_h = 1.0, h_max = 200, target_tau_policy = 0.5, steepest completion pinned at probability 1.0.
5.1Depth semantics and the root-only regime
One config gotcha before the loop: SGSPI's max_depth counts expanded depths, and children one level deeper still get evaluated. So max_depth = 0 — the live setting — means: expand only the root (propose ~36 actions, pivot each once through the oracle), evaluate every depth-1 child with the network and/or a completion, and stop. The frontier machinery, transpositions, and deeper expansion all activate when max_depth ≥ 1 (this is the "tree search mode" lineage of recent commits — and note merge_by_state_key = false exists to key nodes by path instead of basis, turning the graph search into a pure tree for debugging/ablation). The explainer walks the general loop; keep in mind the production run is currently the depth-0 special case where SGSPI is essentially "one exact pivot per candidate + steepest completion from each child, then pick the best total." That simple procedure is the thing that beat the steepest baseline by 4.3 pivots at update 0 (§7.1).
5.2The search loop
Inside SGSPIPlanner.search() sgspi.py:681:
- Root setup. Fused network call; capture a root checkpoint; compute the root's canonical
StateKey/SearchStateKey; push it on a heap keyed by \(f\) with an insertion-order tiebreaker sgspi.py:752–758. - Pop the lowest-\(f\) node. Skip it if the transposition table already closed this basis at an equal-or-better \(g\); record terminals (an optimal terminal updates
best_terminal_cost); skip nodes already costlier than the best terminal; skip nodes pastmax_depthsgspi.py:1038–1056. Stop on node budget (64 expansions), wallclock (40 s), or — withfrontier_bound_stop— when the frontier's best \(f\) can no longer beat the best terminal found. - Propose actions. At the root: the full union (Gumbel-top-k of tempered policy ∪ Q-top-k ∪ random tail; the steepest anchor arrives via the proposal's heuristic mixing), each tagged with its source. At deeper nodes: a slimmer union (
k_child = 6prior +k_child_q = 4) sgspi.py:1072–1120. If every proposed root action fails oracle replay, a fallback batch expands all remaining legal slots — the root must end up with evidence. - Pivot each proposed action through the oracle. Terminal children immediately produce evidence (next subsection). Nonterminal children are batch-evaluated in one network call sgspi.py:1319–1331 · 1759 _evaluate_children (full policy/value/Q forward only if the child can itself be expanded; value-only otherwise — the depth-0 fast path).
- For each nonterminal child: compute its keys; check the transposition table (
discovered_best_g); record the bookkeeping that drives evidence flow — its arrival (which root action reached it, at what \(g\), whether via a transposition), its predecessor edge, and its state evidence (a completion result if one ran, else the network estimate); then, unless pruned as a worse-path duplicate, push it with \(f = g + \alpha_h\,h_\theta\) sgspi.py:1370–1550. A strictly better \(g\) to a known basis reopens it (reopen_count). - Wrap up. Root evidence map → selection → target builder → replay sample, with a descriptor recording why search stopped (
terminal_found/budget_exhausted/frontier_bound_stop/depth_limited).
5.3Evidence: discovery, valuation, and propagation
Everything SGSPI learns from search is funneled into per-root-action evidence, in plain pivot units. The valuation rules sgspi.py:1212–1501:
| Discovery at a node with cost g | Root-action value | Source / quality |
|---|---|---|
| Optimal terminal | −g | terminal · exact |
| Non-optimal terminal (infeasible/unbounded/limit) | −(g + αh·hmax) | terminal_non_optimal · bounded |
| Classical completion finishes in k pivots from the node | −(g + k) | completion_steepest · rollout |
| Completion hits a verified dead end after k pivots | −(g + k + αh·hmax) | terminal_non_optimal · bounded |
| Network estimate only | −(g + αh·clamp(−Vθ, 0, hmax)) | estimate_at_depth_k · approximate |
Evidence valuation — the hmax=200 penalty makes verified dead ends rank below any plausible completion; the estimate row is the only one whose value can be fiction.
When the same root action accumulates several pieces of evidence, the better one wins by a strict ordering sgspi.py:1949–1993: source priority first (terminal ≻ completion_* ≻ terminal_non_optimal ≻ estimate), then higher value, then shallower depth, then non-transposed provenance. A bootstrap estimate can never displace completion- or terminal-backed evidence for the same action, no matter how optimistic.
Propagation through the graph — the load-bearing part
In a graph search, evidence rarely appears where it's needed. A completion runs at some depth-2 state \(D\); the root needs to know what that implies for root actions \(a\) and \(c\), both of which reach \(D\). SGSPI maintains three index structures and a fixpoint-ish propagation sgspi.py:815–994:
state_arrivals[key]— every (root action, g, via-transposition) tuple that has reached this basis. Recorded before transposition pruning, which is exactly why pruned duplicate paths still earn their root action credit.state_evidence[key]— the best known suffix evidence at this basis: (suffix_cost, suffix_depth, source_kind). A completion at the node sets suffix_cost = k; a network estimate sets suffix_cost = αh·hθ.predecessors[key]— parent edges, so suffix evidence can also roll upward: a parent's suffix is \(1 + \) child's suffix sgspi.py:930 propagate_state_evidence_to_parent.
Whenever a state's evidence improves, it re-fires across all recorded arrivals (updating root targets as \(-(g_{\text{arrival}} + \text{suffix})\)), across state-action arrival edges (feeding the consistency targets for child-state Q supervision), and recursively to predecessors. The via_transposition flag is carried along the whole way and lands in the final evidence — diagnostics can tell exactly how much of the target was earned through merged paths. The design doc flags this machinery with an explicit warning: collapse it to a naive state → evidence map without arrival/predecessor records, and root attribution through transpositions silently breaks.
5.4Classical completion: buying ground truth at the leaves
The single most consequential design choice in SGSPI. A learned value head — especially early — is unreliable; but this domain offers an oracle of a different kind: let steepest edge finish the job and count the pivots. The ClassicalCompletionOracle sgspi.py:247 replays prefix + path-to-leaf, then runs the configured rule (steepest) to termination with caps (max_pivots = 1000, timeout_ms = 20000), backed by an LRU cache. Outcomes map to the evidence table above; truncations and timeouts fall back to the network estimate.
Two subtleties make this more than "just use steepest":
- It evaluates first actions, not the classical rule's own choice. The evidence for root action \(a\) is "take \(a\) first, then let steepest finish." If some non-steepest first pivot leads to a basin where steepest's suffix is shorter, completion finds and certifies that — this is exactly the headroom the search exploits to beat the steepest baseline from the warm start. The improvement is real, verified, and immediately distillable.
- It is a target-evidence source, not an actor override. The deployed/raw policy never calls steepest; completions only shape what the network is trained toward.
Completion probability is an episode-driven schedule (completion_probability_for_episode sgspi.py:183). The original configs decayed it — wean off the crutch as the value head improves. §7.2 shows what actually happened: each decayed leaf became a coin flip between truth and optimism, and optimism won the argmax. The fixed config pins probability at 1.0 (decay_episodes = 0) and adds an eval-side override so eval/search/* always measures completion-on search. Weaning is deferred until dist_gap/v_error_mae sits at ~1–2 pivots, and then only with pessimistic value estimates.
5.5Gated searched-Q targets
SGSPI's target builder, SearchedQSoftmaxTargetBuilder targets.py:207, is philosophically opposite to completed-Q: only actions search actually valued get supervised; unsearched actions are undefined, not neutral. The construction, in order:
Unpack each piece:
- The gate (fix A2). Bootstrap evidence still flows to the Q-head target (with its 0.25 action-quality weight) — the Q head should keep learning from everything — but it is barred from the improvement operator: it cannot define the policy softmax or the value target while any verified evidence exists. The planner's executed action obeys the same rule (
_initial_selected_var_idsgspi.py:2091: greedy over verified first; and the final selection is re-derived asargmax(policy_target)after gating sgspi.py:1680). The principle generalizes: the learned heuristic may order the search (f = g + αh), but it may not win the target. - τ = 0.5 is sharp by design. Values are pivot counts; a 2-pivot gap yields a probability ratio of \(e^{2/0.5} = e^4 \approx 55\). The target is near-one-hot on the best verified action unless alternatives are within ~a pivot. Try it: Δq = 2.0 pivots, τ = 0.5 ⇒ ratio 54.6×
- The value target is the search's own appraisal — the softmax-expected Q over the support, ≈ the best verified value at this τ. Unlike SGTS there's no MC-return dependence: search appraises, replay stores, reanalyse refreshes.
- Consistency targets. From the state-action evidence recorded at expanded interior states, the builder also extracts per-child-state Q targets (
ConsistencyTargetstargets.py:611) — a Bellman-consistency signal for the Q head at successor states. Wired but currently disabled in the live config (consistency_coeff = 0).
5.6Reanalyse: the policy-iteration flywheel
Replay samples store everything needed to reconstruct their root — problem index, phase-2 prefix, state key, the original observation. Reanalyse reanalyse.py periodically picks stored samples (priority = weakest targets first), replays the prefix through a ReplayOracle, verifies the state key and observation match what was stored (fail loudly, never silently relabel), re-runs search with the current network and budget, and replaces the targets in place while preserving replay identity. Old experience keeps getting re-appraised by an ever-better searcher — MuZero Reanalyze, adapted to exact dynamics. This is what makes SGSPI a genuine policy-iteration loop rather than one-shot target generation. (Both 2026-06-10 configs ship with it disabled while the A1–A8 fixes are validated; the machinery and its strict verification discipline are the part to know about.)
6SGTS vs SGSPI, side by side
| Dimension | SGTS | SGSPI |
|---|---|---|
| Search object | Sampled tree from the root | Bounded graph from the root |
| Budget logic | Sequential Halving over sampled arms (simple regret) | Best-first frontier by f = g + αhhθ |
| Root question | "Best arm in this sample?" | "Cheapest verified path out, and its first step?" |
| Repeated states | Per-simulation cycle check only; duplicates cost twice | Mandatory transposition table; prune / reopen; shared-descendant evidence |
| Unsearched actions | Completed to vmix — neutral in target, prior preserved | Outside target support — undefined (ε uniform floor only) |
| Policy target | softmax(prior logits + σ(completed-Q)) | softmax(searched Q / τ) on fidelity-gated support |
| Value target | None at search time → MC return / A* via loss priority | Search's own appraisal Σπq; refreshed by reanalyse |
| Ground truth at leaves | Classical rollout leaf evaluator (steepest, w=0) | Classical completion per leaf, prob. 1.0, cached |
| Trust in network Q head | Proposals only (k_q=0 in live config) — never in tree stats | Proposals (k_q=8) + ordering; gated out of targets |
| Improvement guarantee | Gumbel MuZero expected-improvement argument at the root | None formal — hθ inadmissible; verified evidence is the safeguard |
| Replay profile | Step replay; uniform sampling | Rich replay: state-key index, quality buckets, stratified sampling, in-place target replacement |
| Failure mode observed | Inert improvement operator (Δ search−raw ≡ 0) — §7.3 | Winner's-curse collapse from bootstrap optimism — §7.2 |
| Cost profile | ~4.5 s / root decision at 96 sims, depth 4 | Cheap at depth 0; completions cached; collapse was never a cost problem |
When should each win? SGTS is the safer operator: its target construction is conservative, its acting rule carries an expected-improvement argument, and it never lets any estimate — network or otherwise — promote an action search didn't measure. It is the right shape when branching is huge, budgets are small, and you want robust local improvement. SGSPI is the domain-shaped operator: it exploits determinism, transpositions, and purchasable ground truth to produce sharply better-than-baseline decisions immediately, and its rich replay + reanalyse make it a true policy-iteration loop. It is also the one with more moving parts and more ways to silently break — evidence attribution, replay identity, fidelity gating are all load-bearing. The 2026-06 experiments effectively ask: can SGTS be made alive (A1/A6) and SGSPI be made safe (A2/A3/A4) without losing what made each attractive?
7The failure story — and why the fixes are what they are
The 2026-06-04 WandB post-mortem docs/wandb-searchpi-analysis-2026-06-04.md covers ten runs; its two headline phenomena are worth studying because each is a clean lesson about search-based RL, taught by this exact codebase.
7.1What the runs showed
- SGSPI: search eval at update 0 ≈ −26.8 vs steepest −31.15 and raw ≈ −32 — the search procedure, on day one, beat the best classical baseline by ~4.3 pivots. Then, in every fresh run (
h1b2rc6g,mra6a2vo,59v4ci71,msaxwies,wffdti4l,nk703lou), it collapsed to −43…−62 within 600–1400 updates while losses improved and the raw policy stayed flat. Collapse began before greedy collection ramped (greedy probability still 0 at update 400 in r3-long) — the planner/target feedback itself was the cause, not exploration policy. - SGTS: perfectly stable — and perfectly inert.
eval/delta/search_minus_raw_return_meanwas exactly 0.00 at every eval of both runs; the planner cost ~4.6 s per root decision for it.
The fix plan's framing docs/searchpi-fix-plan-2026-06.md: the search works; the learning loop destroys or ignores its signal. Two mechanisms, one per method.
7.2SGSPI's winner's curse
The original configs decayed completion probability over episodes. Once it dropped below 1, each leaf was a coin flip: exact steepest completion, or raw network value estimate. Both landed in the same evidence pool, and three argmax-shaped operations sat downstream of that pool:
- the policy target — softmax at τ=0.5 is effectively argmax (a 2-pivot edge ⇒ 55× ratio);
- the value target — Σπq ≈ max over the pool;
- eval acting — literally argmax(policy_target) at every step.
The warm-start value head was systematically optimistic (γ=0.99 discounting, §2.1) and never became accurate (value loss plateaued at ~2.5 smooth-L1 ⇒ several pivots of error). Take a max over a mixed pool of one unbiased measurement and several optimistically-biased estimates, and the biased ones win — the winner's curse. Worse, it self-reinforces: the value target ≈ the optimistic winner, so the value head is trained toward its own optimism; sharper optimism wins more argmaxes; evidence quality degrades further. The smoking gun in the logs: in all four analyzed runs, collapse began at exactly the eval where root_evidence_source_estimate_at_depth_1_count first became nonzero.
Hence the layered fix (A2): the fidelity gate in the target builder and in action selection (§5.5) makes bootstrap evidence structurally unable to win the improvement operator; completion pinned at 1.0 removes the coin flip; the eval-side override makes eval/search/* measure search capability rather than the decay schedule. And A5 (γ=1.0 IQL) attacks the optimism at its source. The general lesson transfers far beyond this repo: any argmax over evidence of heterogeneous fidelity is a winner's-curse engine; gate by fidelity before you max.
7.3SGTS's inert improvement operator
SGTS's failure was quieter: a parameterization that reduced the method to an expensive identity. The pre-fix target was \( \operatorname{softmax}(z + \text{adv}/\tau) \) where the advantage had been routed through the AdvantageNormalizer in per_state mode with advantage_scale_c = 0.25 (capping |adv| at 0.25) and then divided by completed_q_temperature = 8.0. Maximum possible logit shift: ±0.031. Recall the scales it had to compete with: prior logits O(1), Gumbel noise std 1.28, and σ's intended multiplier ≈ 50. Consequences, all visible in the logs:
- πtarget ≈ prior ± 3% — each gradient step distilled the policy into itself;
evalsearch action argmax(πtarget) never differed from the raw argmax ⇒ delta ≡ 0.00. - The same capped advantage ranked Sequential Halving — so the "winner" was, to first order, a Gumbel sample from the proposal. Measured:
selected_action_evidence_rank_mean ≈ 4.2out of ~10 arms with evidence; the race's conclusion was uncorrelated with its own measurements.
Fix A1 replaces the transform with the real Gumbel MuZero σ in all three places it must act — target construction, SH ranking, non-root selection — and Widget B (§4.3) lets you feel the difference directly. A6 then makes the collector actually execute the SH winner (action_selection = "planner") instead of sampling the (previously inert) target. A second-order lesson worth keeping: flat-but-stable can be a worse failure than collapse, because nothing in the loss curves complains. Only a metric that directly measures the improvement operator (Δ search−raw; argmax-changed fraction) could have caught it — which is what A7 adds.
7.4The fix matrix
| Fix | Mechanism repaired | Watch in WandB | |
|---|---|---|---|
| A1 | σ transform replaces advantage/τ targets.py:352 | SGTS improvement operator strength (×0.031 → ×~50) | pi_target_argmax_changed · pi_target_logit_delta_mean_abs |
| A2 | Fidelity gate in targets + selection; completion pinned; eval override targets.py:233 · sgspi.py:2091 | SGSPI winner's curse — bootstrap can't win the operator | pi_target_argmax_evidence_approximate ≈ 0 |
| A3 | Training replay sampling → quality_stratified replay.py · trainer.py | Pre-fix "priority" reused the reanalyse formula (priority ∝ 1−quality) for training — exact samples were drawn with weight 1e-6, i.e. never | replay_batch_sample_quality_exact_fraction > 0 |
| A4 | BC-anchor decay 1.0 → 0 over 1500 updates loss config | bc_coeff = policy_coeff = 1.0 pinned the policy at the 50/50 mixture of search target and frozen IQL — forever | bc_coeff_effective · loss_bc rises then falls |
| A5 | IQL warm start retrained at γ = 1.0 configs/*/iql.toml | Optimistic discounted values feeding the A2 loop | dist_gap/v_error_mae |
| A6 | action_selection = "planner" collectors.py | Collector executes the search winner instead of sampling the target (modes: stochastic · greedy · annealed · planner) | collect/action_planner_count |
| A7 | Operator-liveness diagnostics | Make A1/A2/A3-class failures visible in one eval | dist_gap/* · evidence-source mixes |
| A8 | configs/2026-06-10/ experiment set | B3 (SGTS-sigma) and B4 (SGSPI-fidelity) runs wiring all of the above | eval/delta/search_minus_raw_return_mean |
Success criteria are deliberately first-few-evals observable (compute nodes are wallclock-limited): for B3, Δ(search−raw) > 0 from the first eval and a clearly nonzero argmax-changed fraction; for B4, no collapse — eval/search/return_mean holding ≥ ≈−27 — and, the real prize, eval/raw/return_mean climbing past steepest toward the search level: the first configuration in which the raw policy has an undistorted path to absorbing the search signal.
8Training loop & implementation map
Brief by design — the methods were the point — but these are the pieces you'll touch.
The loss
SearchPILoss loss.py:80 is a sum of independently masked terms; no sample is ever forced to fabricate a label for a term it doesn't have:
Details that matter: the policy CE masks logits to the target support before the log-softmax (so SGSPI's gated support is airtight, while SGTS's support is all legal actions); value targets resolve by priority chain search → A* → MC-return; the rank loss is a pairwise margin on searched-Q ordering; the BC anchor (full_kl mode) pulls toward the frozen IQL checkpoint with the A4-decayed coefficient; and every term is weighted by the quality ladder — per-sample weights from sample quality, per-action weights from action quality (quality.py; note _weighted_mean renormalizes within the batch, which is why bad sampling composition couldn't be rescued by weights alone — the A3 lesson).
dist_gap: is the network catching up to its teacher?
Logged from every update loss.py:368–372 — the panel to read before any other when judging a run:
| dist_gap/policy_ce_to_target | distillation gap; should fall while the teacher keeps moving — falling CE with collapsing eval = misaligned targets (the SGSPI signature, §7.1) |
|---|---|
| policy_top1_matches_target_argmax | would the raw policy already make the search's decision? The raw→search convergence metric. |
| policy_matches_collector_action | agreement with what was actually executed (planner winner under A6) |
| target_entropy | how decisive the teacher is (τ=0.5 targets are sharp; rising entropy ⇒ weak/ambiguous evidence) |
| v_error_mae | value-head calibration in pivots — the explicit gate for ever re-enabling completion decay (~1–2 pivots) |
Collection & evaluation
Collection runs sync (SearchPIEpisodeCollector) or async (AsyncSearchPICollector): 4 worker processes execute episodes (each step = one full planner search), while a central batched inference server inference.py serves all network calls (max 256 items / 2 ms batching window, root-priority queueing), with a watchdog restarting silent workers. Evaluation eval.py · process_eval.py measures both deployment modes per checkpoint — eval/raw/* (network argmax, no search) and eval/search/* (planner + argmax(πtarget) each step) — against Dantzig and steepest on a fixed problem set; eval/delta/search_minus_raw_return_mean is the one-number health check of the whole enterprise. Both 2026-06-10 runs train solver_feature_attention with the dueling Q head enabled, warm-started from IQL.
File map
| File (src/simplexrl/searchpi/) | Role |
|---|---|
| planners/sgts.py | SGTS: proposal → Gumbel-top-k → Sequential Halving → completed-Q evidence (§4) |
| planners/sgspi.py | SGSPI: best-first frontier, transpositions, evidence propagation, completions (§5) |
| targets.py | Both target builders + v_mix / completed-Q / σ / gating math |
| proposals.py | Mixed-β proposals, Gumbel-top-k, heuristic anchors, source tags |
| oracle.py | Phase-2 HiGHS sessions, prefix replay, checkpoint cache, branch cursors |
| state_key.py | Canonical basis identity (transpositions, replay verification) |
| quality.py | Quality ladders, loss weights, source→quality mapping |
| types.py | SearchRequest / SearchResult / RootActionEvidence contracts |
| replay.py · dataset.py | Rich replay (state-key index, quality buckets, stratified sampling) → TensorDict batches |
| loss.py | SearchPILoss, value-target priority, dist_gap metrics |
| reanalyse.py | Target refresh with strict state/observation verification |
| collectors.py · inference.py | Sync/async collection, action-selection modes, batched inference server |
| trainer.py · cli.py · eval.py | simplexrl-searchpi-train <toml> entrypoint, training loop, raw/search eval harness |
9Glossary
- candidate slot / var-id
- Node-local tensor index vs durable HiGHS entering-variable id. Tensors use slots; search paths, prefixes, and evidence use var-ids; every node remaps (§2.3).
- phase-2 prefix
- The sequence of entering var-ids from phase-2 start that deterministically reconstructs a state via oracle replay.
- state key
- Canonical basis identity; powers transposition detection and replay/reanalyse verification.
- root action evidence
- Per-root-action record (value, source, depth, via_transposition, quality) — the planner-agnostic currency the target builders consume.
- verified / approximate
- Evidence fidelity split: terminal- or completion-backed (exact/bounded/rollout) vs network bootstrap (approx). Verified gates the improvement operator (A2).
- v_mix
- SGTS's improvement baseline: visit-weighted blend of network value and prior-weighted searched-Q mean. Never a value target.
- completed-Q
- SGTS target construction: searched actions keep measured values, unsearched legal actions are filled with v_mix (zero advantage ⇒ neutral).
- σ transform
- (c_visit + max-visits)·c_scale·q̂ with q̂ min-max normalized — the monotone value bonus added to logits for ranking, acting, and targets.
- searched-Q softmax
- SGSPI target construction: softmax(q/τ) over the (gated) searched support only; unsearched actions undefined.
- completion
- Evaluating a search leaf by letting a classical pivot rule (steepest) finish the suffix and counting pivots — terminal-grounded evidence on demand.
- transposition / reopening
- Reaching a known basis by another path. Worse-or-equal g ⇒ pruned (arrival still credited); strictly better g ⇒ node reopened.
- reanalyse
- Re-running search from stored replay roots with the current network to refresh targets in place, under strict identity verification.
- raw vs search eval
- Deploy the bare network argmax vs deploy planner-amplified decisions. Their delta measures whether search still has anything to teach.