SimplexRL · Method Note
A training method that treats phase-2 simplex pivoting as a deterministic shortest-path problem, uses a learned-heuristic graph search over the exact solver to produce improved decisions, and distills those decisions back into the network through a replay-and-reanalyse loop.
The primal simplex algorithm solves a linear program by walking from vertex to vertex of the feasible polytope. At each iteration it must choose an entering variable: a non-basic variable to bring into the basis, which fixes the edge the algorithm traverses next. The choice does not change the optimum the algorithm eventually reaches, but it can change the number of iterations needed to reach it by a large factor. Classical pivot rules make this choice with a fixed heuristic — Dantzig's rule takes the most negative reduced cost; Devex and steepest-edge normalize that by an approximate column norm. SimplexRL asks instead whether a learned policy can choose entering variables so that the solver finishes in fewer pivots, on a target distribution of problems.
The project reduces this to a reinforcement-learning problem: the agent observes the solver's internal state at a pivot and selects an index into the list of eligible entering candidates; the environment wraps the HiGHS LP solver and returns a reward of -1 per pivot, so fewer pivots is better. Because only phase 2 (optimization, once a feasible basis exists) is under the agent's control, all of the methods below concern phase-2 pivoting and measure themselves in phase-2 pivot counts.
Most of the project's other work trains this policy with model-free RL — PPO, A2C, GRPO — or with imitation learning from a classical rule (BC, IQL, A*-supervised). Those methods treat the solver as an opaque environment: they roll out trajectories and adjust the policy from the rewards they happen to observe. SGSPI takes a different stance. It exploits a property that makes this environment unusual for RL:
The transition dynamics are exact, deterministic, and cheap to query locally — a pivot is a single HiGHS phase-2 step — while neural-network inference is comparatively expensive. We do not need to learn a world model; we already have a perfect one. The bottleneck is deciding which transitions are worth simulating.
This is exactly the regime in which search-based policy improvement — the family that includes AlphaZero and MuZero — pays off. Such methods use the network only as a guide (a prior over actions and an estimate of value), spend a search budget probing the real dynamics to compute a better local decision than the raw network would make, and then train the network to imitate that improved decision. Iterating this is a form of expert iteration: search is the expert; the network is the apprentice; over time the apprentice's prior makes the expert's search cheaper and sharper.
SimplexRL has two methods in this family, developed as siblings under a shared package simplexrl.searchpi ("search policy iteration"):
The two share their infrastructure — the solver oracle, observation construction, the network interface, root-action proposals, the replay buffer, the reanalyse machinery, the loss components, and evaluation — but they keep separate search loops, because a Sequential-Halving tree race and a transposition-aware graph search have genuinely different mechanics and correctness concerns. This note focuses on SGSPI; §10 returns to how it differs from SGTS.
SGSPI is one outer algorithm with four repeating parts. The network is never replaced by search at deployment — the policy you ship is the raw network's argmax. Search is purely a training-time operator that manufactures better-than-policy targets.
From a state encountered during collection, run a bounded best-first search over the exact solver, guided by the network's policy/value/Q heads. The search yields value-space evidence about which entering variables lead to short completions.
Convert the per-action search evidence into a policy target (a softmax over searched action values), a scalar value target, and per-candidate Q targets — each tagged with provenance and a quality label.
Push a rich replay sample keyed by problem and basis identity. Sample minibatches and take gradient steps so the network's heads move toward the search-derived targets.
Periodically re-run search from stored replay roots using the current (stronger) network or a larger budget, and replace the targets in place. This is what turns one-shot target generation into genuine policy iteration.
The rest of this note unpacks each part, but the spine is worth holding onto: exact-dynamics search policy iteration with replay and reanalyse. Everything specific to SGSPI lives in how the search is structured (a transposition-aware graph search, §5) and how its evidence becomes targets (searched-Q softmax, §6).
To see why a graph search is the natural planner, it helps to write the phase-2 decision problem out as a Markov decision process and then notice its special structure.
| Element | In phase-2 simplex |
|---|---|
| State s | The current simplex basis (and the derived solver state) for a fixed LP. |
| Action a | Choose one legal entering variable from the current candidate set. |
| Transition T(s,a) | The exact, deterministic HiGHS pivot that results. No stochasticity, no learned model. |
| Reward | -1 per pivot; 0 at an optimal terminal. (Status mismatches are penalized, but on-policy phase-2 search stays on the legal manifold.) |
| Objective | Minimize the number of remaining phase-2 pivots to reach optimality. |
With a per-step reward of -1 and a terminal reward of 0, the optimal value function has a direct combinatorial meaning. The value of a state is the negative of the fewest pivots needed to finish from it, and the action-value is one step worse than the value of the state it leads to:
Two conventions follow and are used everywhere in the code, so they are worth fixing now:
-2. A target of -2 means "two pivots remain," never "+2."The graph is cyclic: pivoting can return to a basis already visited (degeneracy and cycling are real possibilities), and — crucially — different action sequences can reach the same basis. The latter are transpositions. A tree search would explore each path to a shared basis independently, redoing work; a graph search can recognize that two frontier paths have converged and keep only the cheaper one. SGSPI is built around exploiting exactly this.
One implementation invariant pervades the search and replay code. The observation presents candidates in candidate slots — local tensor positions that change every pivot. Search paths, replay keys, transposition keys, and per-action evidence are instead keyed by the internal HiGHS entering-variable id, which is stable across pivots. A phase-2 prefix — a tuple of entering var-ids replayed from the phase-2 start — uniquely identifies a state and is how any state is reconstructed for search or reanalyse. Mixing the two identities is the source of most subtle target bugs, so they are kept strictly separate and remapped explicitly at every node.
SGSPI is architecture-agnostic: it works with any of the project's four pivot-selection models (transformer-GNN, column-attention, tableau, solver-feature-attention). What it requires of the model is a fused three-head interface, exposed through a shared SearchModel wrapper and computed in a single encoder pass per state:
| Head | Output | Role in SGSPI |
|---|---|---|
| Policy | masked logits over candidates → prior πθ(·|s) | Proposes which actions are worth searching (via Gumbel-Top-k); also a distillation target. |
| Value | scalar Vθ(s) | The search heuristic: an estimate of remaining pivots, used to rank the frontier and to evaluate leaves. |
| Q | per-candidate Qθ(s,·) | Proposes high-value actions to search; supervised by searched evidence. Required for SGSPI ([model.q_head].enabled = true). |
The "fused" requirement matters because search calls the model heavily — once per expanded node — so the cores expose a single policy_value_q path that produces all three heads from one encoder pass rather than three separate forwards. The wrapper also owns evaluation/train-mode boundaries and stamps every inference with a model version, so that replay samples can record which network produced their targets (used later by reanalyse and by priority sampling).
A deliberate design point: the value head learns a heuristic for a shortest-path search, not just a critic baseline. The search reads Vθ as "how many pivots remain," so it is most useful when negated and clamped into a non-negative cost-to-go estimate, which is how it enters the frontier priority next.
This is the heart of SGSPI. Given a single root state encountered during collection, the planner runs a bounded, best-first expansion over the real solver to discover, for each candidate root action, the best completion it can cheaply find. The output is not a policy — it is root-action evidence: value-space numbers attached to root var-ids, which §6 turns into targets.
The search is an A*-style best-first expansion. Each node n on the frontier is ordered by a priority that combines the exact cost paid to reach it with a learned estimate of the cost still to go:
The node with smallest f is expanded first (a min-heap frontier). g is genuine — it counts real pivots taken from the root. hθ is the negated value head, clamped to [0, hmax] so a single bad value estimate cannot dominate the queue (h_max defaults to 200). The weight αh (default 1.0) trades exploration breadth against heuristic trust. The policy prior does not appear in the priority; it is used only to propose which actions get pushed.
In classical A*, an admissible (never-overestimating) heuristic guarantees the first terminal popped is optimal. Here hθ is a learned estimate with no such guarantee. SGSPI's search is therefore a strong best-first heuristic search, not a proof of the optimal pivot sequence. The early-stopping rule below (§5.5) is correspondingly a budget heuristic, not an optimality certificate.
Branching factors at a simplex vertex can be large, so SGSPI never expands all candidates. It assembles a small, deduplicated proposal set from several complementary sources, so that the search budget is spent on actions that are plausible under any reasonable lens — the current policy, the current Q estimates, classical pricing rules, or sheer coverage:
At interior (child) nodes the proposal is leaner — GumbelTopK(πθ, kchild) ∪ TopK(Qθ, kchild·q) — because most of the search budget should go to comparing root actions, not to deepening any single branch.
As the frontier expands, the planner maintains a transposition table: a map from a canonical basis identity (the state key, derived from the basic-variable index) to the best g with which that basis has been reached. When a newly generated child lands on a known basis:
This is the mechanism that distinguishes SGSPI from a tree search and is "load-bearing": on the cyclic pivot graph, where shared descendants are common, it prevents the budget from being burned re-deriving the same sub-search down many root actions.
Because transpositions let several root actions share descendants, SGSPI tracks, for each discovered state, which root actions can reach it. When a terminal or a well-evaluated leaf is found, its value-space evidence is propagated back to all qualifying root actions, and each piece of evidence records its provenance — its depth, whether it arrived via a transposition, and its source kind:
Depth is bounded explicitly. With max_depth = 0 the search expands the root actions and evaluates their depth-1 children (root-only search); max_depth = 1 expands depth-1 children and evaluates depth-2 leaves; in general the deepest evaluated leaf sits at about max_depth + 1. Root-only search is the common, cheap configuration; deeper search is available when the budget allows.
Two budgets bound the search: a node budget (root_node_budget, e.g. 32 expansions) and a wall-clock budget (wallclock_budget_ms). Optionally, frontier-bound stopping halts early once the smallest f on the frontier is no better than the best terminal already found — a useful pruning rule, though (per the warning above) only a heuristic one.
The most important optional component is classical completion. Early in training the value head is weak, so leaf estimates are unreliable and almost all evidence is the untrusted "estimate" kind. Completion fixes this by finishing a searched non-terminal leaf with a classical pivot rule (steepest / devex / dantzig) and counting the real suffix length. Critically, it completes after taking a searched first action:
This can reveal a first pivot that leads to a shorter classical-rule suffix than the classical rule would choose at the root — i.e. a genuine improvement over the baseline — while grounding the target in a real terminal rather than a value estimate. Completion is a leaf evaluator / evidence source, never an actor override; its use is probabilistic and decays over training as the learned heuristic matures. A small LRU cache and per-search completion caps keep its cost bounded.
The search returns, per root action, the best value-space evidence it found. SGSPI's default target builder — searched_q_softmax — converts this into candidate-aligned supervised labels. Let S be the set of searched root actions and q(a) the best searched value for action a∈S (in negative-pivot units).
The Q target is simply the searched values placed in their candidate slots, with a mask that is true only for searched actions. The Q head is supervised only where search actually produced evidence:
The policy target is a temperature-controlled softmax over the searched values — the search's own preference ordering among the actions it examined:
By default this puts zero mass on unsearched actions — SGSPI treats their values as undefined, not neutral. An optional uniform floor mixes a small uniform distribution over all legal actions back in, to prevent the prior from collapsing support onto only the searched set:
The scalar value target is the expectation of the searched values under the (pre-floor) policy target — the value the network should predict if it were to act according to the search's preferences:
The policy and value targets are built only from verified evidence — terminal and completion sources, whose pivot counts are grounded in reality. Pure value-bootstrap ("estimate") evidence may still populate the Q target and mask (it is a reasonable regression signal) but is excluded from the softmax that defines the policy and value targets, so the network is never taught to prefer an action whose advantage came only from its own value head. Each sample also carries quality labels (sample-level and per-action) that downstream loss weighting uses to down-weight weakly-grounded targets.
SGSPI's replay buffer is deliberately richer than a plain step buffer, because it must support reanalysis — re-running search on stored states later. Each SearchReplaySample stores not only the targets but everything needed to reconstruct the exact root state and to reason about a target's trustworthiness and age:
| Group | Fields (representative) |
|---|---|
| Identity | problem_index, state_key, phase2_prefix_var_ids, the stored dynamic observation. |
| Action | action_slot (local) and action_var_id (stable). |
| Targets | policy_target(+mask), q_target(+mask), value_target; plus, when available, mc_return_target and exact astar_* labels. |
| Provenance & quality | target_source, sample_quality, per-action action_quality, transposition flags. |
| Age / version | target_network_version, created_update_step, reanalyse_count, target_version. |
| Consistency | optional successor metadata for a Bellman-consistency loss term. |
The buffer is a FIFO ring with stable replay ids and handles, an index by (problem_index, state_key), optional deduplication by state, and several sampling modes: uniform, quality-stratified, and priority sampling that scores a sample by a configurable mix of its target quality, its age, the gap between its target network version and the current one, and policy entropy. Priority sampling is what lets training spend its compute where the targets are weakest or stalest.
Search produces a fixed target at collection time, reflecting whatever the network knew then. Reanalyse keeps the buffer from going stale by regenerating targets with the improved network:
This is what makes SGSPI policy iteration rather than one-shot target generation: a state collected early — when the heuristic was poor and the search shallow — can have its target sharpened repeatedly as the network strengthens, without ever re-running the original episode. The strict verification in step 3 is essential; a silent target replacement under a changed observation schema would corrupt the buffer far more insidiously than a crash.
A single shared SearchPILoss regresses the network's heads toward the stored targets. It is a sum of masked, modular terms — each term applies only where its target is present, so a sample that lacks (say) a consistency target simply does not contribute to that term, and no method is forced to fabricate labels for another method's loss:
| Term | Supervises | SGSPI default |
|---|---|---|
| Policy (CE / KL) | policy logits → πtgt | on |
| Value (SmoothL1) | Vθ → search value target (priority order favors the search target, then A*, then MC return) | on |
| Q (masked SmoothL1) | Qθ → searched qtgt on the searched mask | on |
| Rank (pairwise) | relative ordering of candidate Qs | on |
| Bellman consistency | Q(s,a) ≈ −1 + V(s′) across replay-linked successors | optional / after warm start |
| BC anchor | keep the policy near a behavior-cloning reference | optional (often a decaying coefficient) |
| Reference V | auxiliary value regularization | optional |
Two points are worth stressing. First, the loss never infers behavior from the planner name — it reads batch fields and config, so the same loss code serves both SGSPI and SGTS. Second, value supervision priority is configurable rather than hard-coded: SGSPI prefers its search-derived value target (refreshed by reanalyse), whereas SGTS prefers the realized Monte-Carlo return; the order is a config list (e.g. ["search", "astar", "mc_return"]) so neither method's value semantics is bent into the other's. Quality-aware weighting scales each sample's contribution by its target quality, so weakly grounded targets pull on the network less than terminal-grounded ones.
The shared trainer ties the pieces together into the loop sketched in §2:
for cycle in training:
collect fresh search-improved samples # run SGSPI search at visited states
push samples to replay
if reanalyse scheduled:
refresh old replay targets # re-search stored roots with current θ
if replay is warm:
run gradient updates from replay # SearchPILoss minibatches
periodically:
evaluate raw policy / search policy / classical baselines
checkpoint
Collection. An episode collector walks a problem with the network, and at chosen states runs the SGSPI search to produce a sample; the acting move can be the search's selected action, or the raw policy, possibly annealed from search-driven to greedy over training. Collection can run single-process (the default, correctness-first path) or, where supported in the shared framework, as an asynchronous multi-worker pool with a learner-side batched inference server — the collector is a swappable service, independent of the planner.
Evaluation is paired and reported in phase-2 pivot counts on the same held-out problems, so the comparison is apples-to-apples:
A healthy run shows the search-amplified policy comfortably ahead of the raw policy (search is finding real improvements) and the raw policy closing the gap over time (those improvements are being distilled), ideally past the classical baselines.
SGSPI sits at the intersection of two traditions. From heuristic search it takes the A*-style best-first frontier, the transposition table, and the shortest-path framing of pivot minimization. From search-based RL (AlphaZero / MuZero, and specifically the expert-iteration view) it takes the structure of using a learned prior and value to guide a bounded search, distilling the search's improved decision back into the network, and amplifying through reanalyse. What it deliberately drops from the MuZero side is the learned dynamics model: the simplex environment hands us exact, cheap transitions, so SGSPI plans against the real solver and spends its modeling capacity entirely on guidance (policy, value, Q) rather than on predicting the next state.
The two SearchPI planners answer slightly different questions and are tuned for different regimes. They share all infrastructure but keep separate search loops.
As a rough rule of thumb: SGTS is the better fit when the branching factor is high and full enumeration is hopeless, so sampled root racing with conservative completed-Q targets shines; SGSPI is the better fit when repeated bases are common enough that transposition reuse pays for itself and the shortest-path structure can be exploited directly.
state_key → evidence map without root-arrival metadata silently loses credit. This is the easiest correctness property to break.
Source of record: src/simplexrl/searchpi/ — planner planners/sgspi.py, oracle oracle.py, proposals proposals.py,
targets targets.py, replay replay.py, reanalyse reanalyse.py, loss loss.py, trainer trainer.py, eval eval.py;
design notes under docs/design/unified_search_policy_iteration/; reference configs under configs/2026-05-29/sgspi-*.toml and configs/debug/sgspi.toml.
This note describes SGSPI as of the current searchpi branch; if the code and this document disagree, the code wins.