New ask Hacker News story: CPU-only PPO solving TSPLIB lin318 in 20 mins (0.08% gap)
CPU-only PPO solving TSPLIB lin318 in 20 mins (0.08% gap)
2 by jivaprime | 0 comments on Hacker News.
Hi all I’ve put together a repo demonstrating how to train PPO directly on a single TSPLIB instance (lin318) from scratch—without pre-training or GPUs. Repo:https://ift.tt/zLPYs4f 1. Experiment Setup Problem: TSPLIB lin318 (Opt: 42,029) & rd400 Hardware: Google Colab (CPU only) Model: Single-instance PPO policy + Value network. Starts from random initialization. Local Search: Light 2-opt during training, Numba-accelerated 3-opt for evaluation. Core Concept: Instead of a "stable average-error minimizer," this policy is designed as a high-variance explorer. The goal isn't to keep the average gap low, but to occasionally "spike" very low-error tours that local search can polish. 2. Results: lin318 Best Shot: 42,064 (Gap ≈ +0.08%) Time: Reached within ~20 minutes on Colab CPU. According to the logs (included in the repo), the sub-0.1% shot appeared around elapsed=0:19:49. While the average error oscillates around 3–4%, the policy successfully locates a deep basin that 3-opt can exploit. 3. Extended Experiment: Smart ILS & rd400 I extended the pipeline with "Smart ILS" (Iterated Local Search) post-processing to see if we could hit the exact optimum. A. lin318 + ILS Took the PPO-generated tour (0.08% gap) as a seed. Ran Smart ILS for ~20 mins. Result: Reached the exact optimal (42,029). B. rd400 + ILS PPO Phase: ~2 hours on CPU. Produced tours with ~1.9% gap. ILS Phase: Used PPO tours as seeds. Ran for ~40 mins. Result: Reached 0.079% gap (Cost 15,293 vs Opt 15,281). Summary The workflow separates concerns effectively: PPO: Drives the search into a high-quality basin (1–2% gap). ILS: Digs deep within that basin to find the optimum. If you are interested in instance-wise RL, CPU-based optimization, or comparing against ML-TSP baselines (POMO, AM, NeuroLKH), feel free to check out the code. Constructive feedback is welcome!
2 by jivaprime | 0 comments on Hacker News.
Hi all I’ve put together a repo demonstrating how to train PPO directly on a single TSPLIB instance (lin318) from scratch—without pre-training or GPUs. Repo:https://ift.tt/zLPYs4f 1. Experiment Setup Problem: TSPLIB lin318 (Opt: 42,029) & rd400 Hardware: Google Colab (CPU only) Model: Single-instance PPO policy + Value network. Starts from random initialization. Local Search: Light 2-opt during training, Numba-accelerated 3-opt for evaluation. Core Concept: Instead of a "stable average-error minimizer," this policy is designed as a high-variance explorer. The goal isn't to keep the average gap low, but to occasionally "spike" very low-error tours that local search can polish. 2. Results: lin318 Best Shot: 42,064 (Gap ≈ +0.08%) Time: Reached within ~20 minutes on Colab CPU. According to the logs (included in the repo), the sub-0.1% shot appeared around elapsed=0:19:49. While the average error oscillates around 3–4%, the policy successfully locates a deep basin that 3-opt can exploit. 3. Extended Experiment: Smart ILS & rd400 I extended the pipeline with "Smart ILS" (Iterated Local Search) post-processing to see if we could hit the exact optimum. A. lin318 + ILS Took the PPO-generated tour (0.08% gap) as a seed. Ran Smart ILS for ~20 mins. Result: Reached the exact optimal (42,029). B. rd400 + ILS PPO Phase: ~2 hours on CPU. Produced tours with ~1.9% gap. ILS Phase: Used PPO tours as seeds. Ran for ~40 mins. Result: Reached 0.079% gap (Cost 15,293 vs Opt 15,281). Summary The workflow separates concerns effectively: PPO: Drives the search into a high-quality basin (1–2% gap). ILS: Digs deep within that basin to find the optimum. If you are interested in instance-wise RL, CPU-based optimization, or comparing against ML-TSP baselines (POMO, AM, NeuroLKH), feel free to check out the code. Constructive feedback is welcome!
Comments
Post a Comment