SGSPI — Simplex Graph Search Policy Iteration

1Background and context

The primal simplex algorithm solves a linear program by walking from vertex to vertex of the feasible polytope. At each iteration it must choose an entering variable: a non-basic variable to bring into the basis, which fixes the edge the algorithm traverses next. The choice does not change the optimum the algorithm eventually reaches, but it can change the number of iterations needed to reach it by a large factor. Classical pivot rules make this choice with a fixed heuristic — Dantzig's rule takes the most negative reduced cost; Devex and steepest-edge normalize that by an approximate column norm. SimplexRL asks instead whether a learned policy can choose entering variables so that the solver finishes in fewer pivots, on a target distribution of problems.

The project reduces this to a reinforcement-learning problem: the agent observes the solver's internal state at a pivot and selects an index into the list of eligible entering candidates; the environment wraps the HiGHS LP solver and returns a reward of -1 per pivot, so fewer pivots is better. Because only phase 2 (optimization, once a feasible basis exists) is under the agent's control, all of the methods below concern phase-2 pivoting and measure themselves in phase-2 pivot counts.

Most of the project's other work trains this policy with model-free RL — PPO, A2C, GRPO — or with imitation learning from a classical rule (BC, IQL, A*-supervised). Those methods treat the solver as an opaque environment: they roll out trajectories and adjust the policy from the rewards they happen to observe. SGSPI takes a different stance. It exploits a property that makes this environment unusual for RL:

The key structural fact

The transition dynamics are exact, deterministic, and cheap to query locally — a pivot is a single HiGHS phase-2 step — while neural-network inference is comparatively expensive. We do not need to learn a world model; we already have a perfect one. The bottleneck is deciding which transitions are worth simulating.

This is exactly the regime in which search-based policy improvement — the family that includes AlphaZero and MuZero — pays off. Such methods use the network only as a guide (a prior over actions and an estimate of value), spend a search budget probing the real dynamics to compute a better local decision than the raw network would make, and then train the network to imitate that improved decision. Iterating this is a form of expert iteration: search is the expert; the network is the apprentice; over time the apprentice's prior makes the expert's search cheaper and sharper.

SimplexRL has two methods in this family, developed as siblings under a shared package simplexrl.searchpi ("search policy iteration"):

SGTS — Sampled Gumbel Tree Search. A root simple-regret planner: it samples a subset of root actions with Gumbel-Top-k and races them with Sequential Halving over a sampled tree. It is the simplex specialization of Gumbel AlphaZero / Sampled MuZero.
SGSPI — Simplex Graph Search Policy Iteration. The subject of this note. It treats the pivot problem as a shortest-path search on a graph and uses a learned-heuristic best-first expansion to find good first pivots.

The two share their infrastructure — the solver oracle, observation construction, the network interface, root-action proposals, the replay buffer, the reanalyse machinery, the loss components, and evaluation — but they keep separate search loops, because a Sequential-Halving tree race and a transposition-aware graph search have genuinely different mechanics and correctness concerns. This note focuses on SGSPI; §10 returns to how it differs from SGTS.

2The core idea

SGSPI is one outer algorithm with four repeating parts. The network is never replaced by search at deployment — the policy you ship is the raw network's argmax. Search is purely a training-time operator that manufactures better-than-policy targets.

1 · Search (the expert)

Improve one decision by graph search

From a state encountered during collection, run a bounded best-first search over the exact solver, guided by the network's policy/value/Q heads. The search yields value-space evidence about which entering variables lead to short completions.

2 · Targets

Turn evidence into supervised labels

Convert the per-action search evidence into a policy target (a softmax over searched action values), a scalar value target, and per-candidate Q targets — each tagged with provenance and a quality label.

3 · Replay + distill

Store, then regress the network toward the targets

Push a rich replay sample keyed by problem and basis identity. Sample minibatches and take gradient steps so the network's heads move toward the search-derived targets.

4 · Reanalyse

Refresh old targets with the improved network

Periodically re-run search from stored replay roots using the current (stronger) network or a larger budget, and replace the targets in place. This is what turns one-shot target generation into genuine policy iteration.

The expert-iteration loop. The network guides search (purple); search produces evidence that becomes replay targets (green); minibatches distil those targets back into the network (blue); reanalyse periodically re-runs search on old roots with the improved network and overwrites stale targets (amber). The deployed policy is θ alone.

The rest of this note unpacks each part, but the spine is worth holding onto: exact-dynamics search policy iteration with replay and reanalyse. Everything specific to SGSPI lives in how the search is structured (a transposition-aware graph search, §5) and how its evidence becomes targets (searched-Q softmax, §6).

3The simplex MDP as a shortest-path problem

To see why a graph search is the natural planner, it helps to write the phase-2 decision problem out as a Markov decision process and then notice its special structure.

Element	In phase-2 simplex
State `s`	The current simplex basis (and the derived solver state) for a fixed LP.
Action `a`	Choose one legal entering variable from the current candidate set.
Transition `T`(`s`,`a`)	The exact, deterministic HiGHS pivot that results. No stochasticity, no learned model.
Reward	`-1` per pivot; `0` at an optimal terminal. (Status mismatches are penalized, but on-policy phase-2 search stays on the legal manifold.)
Objective	Minimize the number of remaining phase-2 pivots to reach optimality.

With a per-step reward of -1 and a terminal reward of 0, the optimal value function has a direct combinatorial meaning. The value of a state is the negative of the fewest pivots needed to finish from it, and the action-value is one step worse than the value of the state it leads to:

V^*(s) = − (minimum remaining pivots from s) Q^*(s,a) = −1 + V^*(T(s,a))

Two conventions follow and are used everywhere in the code, so they are worth fixing now:

Values are in negative-pivot units and are non-positive; larger is better. A state two pivots from optimal has value -2. A target of -2 means "two pivots remain," never "+2."
This is a deterministic shortest-path problem. Minimizing pivots from s is exactly finding a shortest path from s to any optimal-basis terminal, on the graph whose nodes are bases and whose unit-cost edges are legal pivots.

The graph is cyclic: pivoting can return to a basis already visited (degeneracy and cycling are real possibilities), and — crucially — different action sequences can reach the same basis. The latter are transpositions. A tree search would explore each path to a shared basis independently, redoing work; a graph search can recognize that two frontier paths have converged and keep only the cheaper one. SGSPI is built around exploiting exactly this.

Action identity

One implementation invariant pervades the search and replay code. The observation presents candidates in candidate slots — local tensor positions that change every pivot. Search paths, replay keys, transposition keys, and per-action evidence are instead keyed by the internal HiGHS entering-variable id, which is stable across pivots. A phase-2 prefix — a tuple of entering var-ids replayed from the phase-2 start — uniquely identifies a state and is how any state is reconstructed for search or reanalyse. Mixing the two identities is the source of most subtle target bugs, so they are kept strictly separate and remapped explicitly at every node.

4The network and its three heads

SGSPI is architecture-agnostic: it works with any of the project's four pivot-selection models (transformer-GNN, column-attention, tableau, solver-feature-attention). What it requires of the model is a fused three-head interface, exposed through a shared SearchModel wrapper and computed in a single encoder pass per state:

Head	Output	Role in SGSPI
Policy	masked logits over candidates → prior π_θ(·\|`s`)	Proposes which actions are worth searching (via Gumbel-Top-k); also a distillation target.
Value	scalar `V`_θ(`s`)	The search heuristic: an estimate of remaining pivots, used to rank the frontier and to evaluate leaves.
Q	per-candidate `Q`_θ(`s`,·)	Proposes high-value actions to search; supervised by searched evidence. Required for SGSPI (`[model.q_head].enabled = true`).

The "fused" requirement matters because search calls the model heavily — once per expanded node — so the cores expose a single policy_value_q path that produces all three heads from one encoder pass rather than three separate forwards. The wrapper also owns evaluation/train-mode boundaries and stamps every inference with a model version, so that replay samples can record which network produced their targets (used later by reanalyse and by priority sampling).

A deliberate design point: the value head learns a heuristic for a shortest-path search, not just a critic baseline. The search reads V_θ as "how many pivots remain," so it is most useful when negated and clamped into a non-negative cost-to-go estimate, which is how it enters the frontier priority next.

5The search: bounded best-first graph search

This is the heart of SGSPI. Given a single root state encountered during collection, the planner runs a bounded, best-first expansion over the real solver to discover, for each candidate root action, the best completion it can cheaply find. The output is not a policy — it is root-action evidence: value-space numbers attached to root var-ids, which §6 turns into targets.

5.1 The frontier priority

The search is an A*-style best-first expansion. Each node n on the frontier is ordered by a priority that combines the exact cost paid to reach it with a learned estimate of the cost still to go:

frontier priority g(n) = exact pivots from the search root to n
h_θ(n) = clamp( −V_θ(n), 0, h_max )
f(n) = g(n) + α_h · h_θ(n)

The node with smallest f is expanded first (a min-heap frontier). g is genuine — it counts real pivots taken from the root. h_θ is the negated value head, clamped to [0, h_max] so a single bad value estimate cannot dominate the queue (h_max defaults to 200). The weight α_h (default 1.0) trades exploration breadth against heuristic trust. The policy prior does not appear in the priority; it is used only to propose which actions get pushed.

Not admissible

In classical A*, an admissible (never-overestimating) heuristic guarantees the first terminal popped is optimal. Here h_θ is a learned estimate with no such guarantee. SGSPI's search is therefore a strong best-first heuristic search, not a proof of the optimal pivot sequence. The early-stopping rule below (§5.5) is correspondingly a budget heuristic, not an optimality certificate.

5.2 Choosing what to search: root and child proposals

Branching factors at a simplex vertex can be large, so SGSPI never expands all candidates. It assembles a small, deduplicated proposal set from several complementary sources, so that the search budget is spent on actions that are plausible under any reasonable lens — the current policy, the current Q estimates, classical pricing rules, or sheer coverage:

root action set A_root = GumbelTopK(π_θ, k_prior) ∪ TopK(Q_θ, k_q) ∪ anchors ∪ incumbent ∪ randomTail(k_rand)

Gumbel-Top-k of the policy draws k_prior actions without replacement from the policy prior — a principled stochastic sample of "what the policy likes" (the same Gumbel machinery SGTS uses at its root).
Top-k of Q adds the actions the Q head rates highest, which may differ from the policy's favorites.
Classical anchors force in the picks of one or more reference rules (Dantzig / Devex / steepest), so the search always at least considers the classical move and can discover whether a different first pivot beats it.
Incumbent includes the Dantzig argmax marker; the random tail adds k_rand uniformly-sampled legal actions for coverage.

At interior (child) nodes the proposal is leaner — GumbelTopK(π_θ, k_child) ∪ TopK(Q_θ, k_child·q) — because most of the search budget should go to comparing root actions, not to deepening any single branch.

5.3 Transpositions: the graph, not a tree

As the frontier expands, the planner maintains a transposition table: a map from a canonical basis identity (the state key, derived from the basic-variable index) to the best g with which that basis has been reached. When a newly generated child lands on a known basis:

if the new path is worse or equal in g, it is pruned — no point re-expanding a basis we already reached at least as cheaply;
if the new path is strictly cheaper, the entry is reopened / replaced with the better g, because everything reachable beyond it just got cheaper too.

This is the mechanism that distinguishes SGSPI from a tree search and is "load-bearing": on the cyclic pivot graph, where shared descendants are common, it prevents the budget from being burned re-deriving the same sub-search down many root actions.

A schematic root search. The frontier expands by smallest f=g+αh, so a₂ (the most promising root action under the heuristic) is deepened first. Basis t is reached from both a₂ and a₃; the second arrival is a transposition and is pruned because it is not cheaper. When a terminal is found through t, the evidence must be attributed to every root action that can reach t — both a₂ and a₃ — not only the path that happened to discover it.

5.4 Evidence attribution and depth

Because transpositions let several root actions share descendants, SGSPI tracks, for each discovered state, which root actions can reach it. When a terminal or a well-evaluated leaf is found, its value-space evidence is propagated back to all qualifying root actions, and each piece of evidence records its provenance — its depth, whether it arrived via a transposition, and its source kind:

terminal — a genuine optimal basis was reached; the evidence is exact (g pivots from the root).
completion — a non-terminal leaf was finished by a classical rule (see below); the suffix is real pivots, so the total is also a grounded count.
estimate / bootstrap — a leaf was scored only by g + α_h·h_θ, i.e. partly by the learned value. Useful for ranking, but not trusted as a supervised label.

Depth is bounded explicitly. With max_depth = 0 the search expands the root actions and evaluates their depth-1 children (root-only search); max_depth = 1 expands depth-1 children and evaluates depth-2 leaves; in general the deepest evaluated leaf sits at about max_depth + 1. Root-only search is the common, cheap configuration; deeper search is available when the budget allows.

5.5 Budgets and classical completion

Two budgets bound the search: a node budget (root_node_budget, e.g. 32 expansions) and a wall-clock budget (wallclock_budget_ms). Optionally, frontier-bound stopping halts early once the smallest f on the frontier is no better than the best terminal already found — a useful pruning rule, though (per the warning above) only a heuristic one.

The most important optional component is classical completion. Early in training the value head is weak, so leaf estimates are unreliable and almost all evidence is the untrusted "estimate" kind. Completion fixes this by finishing a searched non-terminal leaf with a classical pivot rule (steepest / devex / dantzig) and counting the real suffix length. Critically, it completes after taking a searched first action:

take searched root action a, then let a classical rule finish → grounded pivot count for a

This can reveal a first pivot that leads to a shorter classical-rule suffix than the classical rule would choose at the root — i.e. a genuine improvement over the baseline — while grounding the target in a real terminal rather than a value estimate. Completion is a leaf evaluator / evidence source, never an actor override; its use is probabilistic and decays over training as the learned heuristic matures. A small LRU cache and per-search completion caps keep its cost bounded.

6From search evidence to training targets

The search returns, per root action, the best value-space evidence it found. SGSPI's default target builder — searched_q_softmax — converts this into candidate-aligned supervised labels. Let S be the set of searched root actions and q(a) the best searched value for action a∈S (in negative-pivot units).

Q target

The Q target is simply the searched values placed in their candidate slots, with a mask that is true only for searched actions. The Q head is supervised only where search actually produced evidence:

q_tgt[a] = q(a) for a ∈ S; mask[a] = 1 iff a ∈ S

Policy target

The policy target is a temperature-controlled softmax over the searched values — the search's own preference ordering among the actions it examined:

π_tgt[a] = softmax_{a ∈ S}( q(a) / τ )

By default this puts zero mass on unsearched actions — SGSPI treats their values as undefined, not neutral. An optional uniform floor mixes a small uniform distribution over all legal actions back in, to prevent the prior from collapsing support onto only the searched set:

π_tgt ← (1 − ε) · π_tgt + ε / |legal|

Value target

The scalar value target is the expectation of the searched values under the (pre-floor) policy target — the value the network should predict if it were to act according to the search's preferences:

v_tgt = Σ_{a ∈ S} π_tgt[a] · q(a)

Evidence-fidelity gating

The policy and value targets are built only from verified evidence — terminal and completion sources, whose pivot counts are grounded in reality. Pure value-bootstrap ("estimate") evidence may still populate the Q target and mask (it is a reasonable regression signal) but is excluded from the softmax that defines the policy and value targets, so the network is never taught to prefer an action whose advantage came only from its own value head. Each sample also carries quality labels (sample-level and per-action) that downstream loss weighting uses to down-weight weakly-grounded targets.

The searched-Q softmax target builder. Per-action searched values become a policy target (softmax over searched actions), a value target (their expectation), and a masked Q target. Verified (terminal/completion) evidence defines the policy and value targets; the bundle plus its identity and provenance metadata becomes one replay sample.

7Replay and reanalyse

7.1 The replay buffer

SGSPI's replay buffer is deliberately richer than a plain step buffer, because it must support reanalysis — re-running search on stored states later. Each SearchReplaySample stores not only the targets but everything needed to reconstruct the exact root state and to reason about a target's trustworthiness and age:

Group	Fields (representative)
Identity	`problem_index`, `state_key`, `phase2_prefix_var_ids`, the stored dynamic observation.
Action	`action_slot` (local) and `action_var_id` (stable).
Targets	`policy_target`(+mask), `q_target`(+mask), `value_target`; plus, when available, `mc_return_target` and exact `astar_*` labels.
Provenance & quality	`target_source`, `sample_quality`, per-action `action_quality`, transposition flags.
Age / version	`target_network_version`, `created_update_step`, `reanalyse_count`, `target_version`.
Consistency	optional successor metadata for a Bellman-consistency loss term.

The buffer is a FIFO ring with stable replay ids and handles, an index by (problem_index, state_key), optional deduplication by state, and several sampling modes: uniform, quality-stratified, and priority sampling that scores a sample by a configurable mix of its target quality, its age, the gap between its target network version and the current one, and policy entropy. Priority sampling is what lets training spend its compute where the targets are weakest or stalest.

7.2 Reanalyse — the policy-iteration engine

Search produces a fixed target at collection time, reflecting whatever the network knew then. Reanalyse keeps the buffer from going stale by regenerating targets with the improved network:

Sample replay entries (often by priority — prefer stale or low-quality ones).
Reconstruct the root state by replaying (problem_index, phase2_prefix_var_ids) through the phase-2 oracle.
Verify the reconstructed state's key and dynamic observation against what was stored — if the feature schema or state disagrees, fail loudly rather than silently overwrite.
Run a fresh search (with the current network, and possibly a larger reanalyse budget).
Rebuild targets and replace them in place, preserving the sample's replay identity, while keeping episode-derived fields like MC returns and exact A* labels.

This is what makes SGSPI policy iteration rather than one-shot target generation: a state collected early — when the heuristic was poor and the search shallow — can have its target sharpened repeatedly as the network strengthens, without ever re-running the original episode. The strict verification in step 3 is essential; a silent target replacement under a changed observation schema would corrupt the buffer far more insidiously than a crash.

8The loss

A single shared SearchPILoss regresses the network's heads toward the stored targets. It is a sum of masked, modular terms — each term applies only where its target is present, so a sample that lacks (say) a consistency target simply does not contribute to that term, and no method is forced to fabricate labels for another method's loss:

Term	Supervises	SGSPI default
Policy (CE / KL)	policy logits → π_tgt	on
Value (SmoothL1)	`V`_θ → search value target (priority order favors the search target, then A*, then MC return)	on
Q (masked SmoothL1)	`Q`_θ → searched `q`_tgt on the searched mask	on
Rank (pairwise)	relative ordering of candidate Qs	on
Bellman consistency	`Q`(`s`,`a`) ≈ −1 + `V`(`s`′) across replay-linked successors	optional / after warm start
BC anchor	keep the policy near a behavior-cloning reference	optional (often a decaying coefficient)
Reference V	auxiliary value regularization	optional

Two points are worth stressing. First, the loss never infers behavior from the planner name — it reads batch fields and config, so the same loss code serves both SGSPI and SGTS. Second, value supervision priority is configurable rather than hard-coded: SGSPI prefers its search-derived value target (refreshed by reanalyse), whereas SGTS prefers the realized Monte-Carlo return; the order is a config list (e.g. ["search", "astar", "mc_return"]) so neither method's value semantics is bent into the other's. Quality-aware weighting scales each sample's contribution by its target quality, so weakly grounded targets pull on the network less than terminal-grounded ones.

9The training loop and evaluation

The shared trainer ties the pieces together into the loop sketched in §2:

for cycle in training:
    collect fresh search-improved samples        # run SGSPI search at visited states
    push samples to replay

    if reanalyse scheduled:
        refresh old replay targets               # re-search stored roots with current θ

    if replay is warm:
        run gradient updates from replay          # SearchPILoss minibatches

    periodically:
        evaluate raw policy / search policy / classical baselines
        checkpoint

Collection. An episode collector walks a problem with the network, and at chosen states runs the SGSPI search to produce a sample; the acting move can be the search's selected action, or the raw policy, possibly annealed from search-driven to greedy over training. Collection can run single-process (the default, correctness-first path) or, where supported in the shared framework, as an asynchronous multi-worker pool with a learner-side batched inference server — the collector is a swappable service, independent of the planner.

Evaluation is paired and reported in phase-2 pivot counts on the same held-out problems, so the comparison is apples-to-apples:

Raw policy — the deployed network's greedy pivots, with no search. This is the number that ultimately matters.
Search-amplified policy — the network plus SGSPI search at decision time. This is an upper bound on what the raw policy could distil to, and a check that search is actually improving on the prior.
Classical baselines — Dantzig / Devex / steepest on the same problems, the bar to beat.

A healthy run shows the search-amplified policy comfortably ahead of the raw policy (search is finding real improvements) and the raw policy closing the gap over time (those improvements are being distilled), ideally past the classical baselines.

10Positioning, relatives, and caveats

10.1 Lineage

SGSPI sits at the intersection of two traditions. From heuristic search it takes the A*-style best-first frontier, the transposition table, and the shortest-path framing of pivot minimization. From search-based RL (AlphaZero / MuZero, and specifically the expert-iteration view) it takes the structure of using a learned prior and value to guide a bounded search, distilling the search's improved decision back into the network, and amplifying through reanalyse. What it deliberately drops from the MuZero side is the learned dynamics model: the simplex environment hands us exact, cheap transitions, so SGSPI plans against the real solver and spends its modeling capacity entirely on guidance (policy, value, Q) rather than on predicting the next state.

10.2 SGSPI vs. SGTS

The two SearchPI planners answer slightly different questions and are tuned for different regimes. They share all infrastructure but keep separate search loops.

SGTS — Sampled Gumbel Tree Search

Root simple-regret over a Gumbel-sampled subset, raced by Sequential Halving.
Sampled tree with local cycle checks; no global transposition table.
Completed-Q targets: unsearched actions completed to a neutral v_mix baseline, preserving policy support.
Value target prioritizes the realized MC return.
Natural fit for high-throughput async collection with batched leaf inference.

SGSPI — Simplex Graph Search Policy Iteration

Best-first graph search with f=g+αh_θ; budget by frontier expansion.
Mandatory transposition pruning / reopen; evidence shared across root actions.
Searched-Q softmax targets: unsearched actions get zero mass by default (optional uniform floor).
Value target prioritizes the search-derived expectation, refreshed by reanalyse.
Rich replay + reanalyse as a first-class policy-iteration mechanism; classical completion for early grounding.

As a rough rule of thumb: SGTS is the better fit when the branching factor is high and full enumeration is hopeless, so sampled root racing with conservative completed-Q targets shines; SGSPI is the better fit when repeated bases are common enough that transposition reuse pays for itself and the shortest-path structure can be exploited directly.

10.3 Caveats and failure modes

The heuristic is not admissible. Frontier-bound stopping is a budget heuristic, not a proof of the optimal pivot path; SGSPI finds good first pivots, not certified-optimal ones.
Evidence attribution through transpositions is delicate. A terminal reached via a transposition must credit every root action that can reach it; a single state_key → evidence map without root-arrival metadata silently loses credit. This is the easiest correctness property to break.
Reanalyse depends on strict identity. Replay reconstruction must verify the state key and observation; relaxing that risks refreshing a target under a state that is not the one the sample describes.
Early targets can be weak. Before the value head is trained, most leaf evidence is bootstrap-only and is gated out of the policy/value targets; classical completion and exact A* labels are the main sources of grounding until the heuristic catches up.
More moving parts than model-free RL. The oracle, transposition table, proposal mixture, completion operator, rich replay, and reanalyse scheduler are all load-bearing. The payoff is sample-efficiency and a principled improvement operator; the cost is a larger surface to get right.

Source of record: src/simplexrl/searchpi/ — planner planners/sgspi.py, oracle oracle.py, proposals proposals.py, targets targets.py, replay replay.py, reanalyse reanalyse.py, loss loss.py, trainer trainer.py, eval eval.py; design notes under docs/design/unified_search_policy_iteration/; reference configs under configs/2026-05-29/sgspi-*.toml and configs/debug/sgspi.toml. This note describes SGSPI as of the current searchpi branch; if the code and this document disagree, the code wins.