PPO vs SAC: the two default go-to RL algorithms in continuous control

For continuous control, PPO and SAC are the two algorithms most people reach for by default. They optimize the same goal—maximize return—but make opposite design choices.

PPO — on-policy. Stable, easy to tune, parallelizes well, but discards data after each update (low sample efficiency).
SAC — off-policy, maximum-entropy. Reuses data from a replay buffer, so it’s far more sample-efficient and explores well, but it’s more sensitive to tune.

PPO keeps each update small by clipping how far the new policy can move from the old one. With the probability ratio \(r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)},\) it maximizes

\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t\,\hat{A}_t,\; \text{clip}(r_t,\,1-\epsilon,\,1+\epsilon)\,\hat{A}_t\big)\Big],\]

where \(\hat{A}_t\) is the advantage (how much better an action was than expected) and the clip range \(\epsilon\) caps the step size.

SAC instead maximizes return plus policy entropy \(\mathcal{H}\), so it keeps exploring rather than committing too early:

\[J(\pi) = \sum_t \mathbb{E}\big[\,r(s_t,a_t) + \alpha\,\mathcal{H}(\pi(\cdot\mid s_t))\,\big],\]

where \(\alpha\) controls how much the policy is rewarded for staying random.

How I choose:

Fast parallel sim where steps are cheap → PPO
Expensive steps like real hardware where sample efficiency matters → SAC
Discrete or hybrid action spaces → PPO