RL & PPO Fine-Tuning
Reinforcement Learning & PPO Fine-Tuning
Once the "Brain" has acquired basic market common sense through Supervised Fine-Tuning (SFT), it undergoes Reinforcement Learning (RL) via Proximal Policy Optimization (PPO).
While SFT teaches the model to recognize patterns (e.g., "this looks like a bottom"), the PPO phase teaches the agent execution logic and risk management. It transitions the model from a passive predictor to an active agent that optimizes for long-term Sharpe ratio rather than just next-candle accuracy.
1. The PPO Training Loop
The RL training occurs in a vectorized environment, simulating thousands of trades across historical Nifty 500 data to refine the agent's decision-making.
Main Script: src/train_ppo_optimized.py
How to Run
To start the RL fine-tuning process, ensure you have a pre-trained SFT model checkpoint available:
# Execute the optimized PPO training loop
python -m src.train_ppo_optimized
2. Training Philosophy: SFT to PPO
The agent follows a two-stage evolution:
- Stage 1 (SFT): The model learns from "Golden Labels" (ZigZag hindsight). This establishes a 77% baseline accuracy in trend identification.
- Stage 2 (PPO): The model is "pushed" into a live-market simulator. It receives rewards for profitable trades and penalties for drawdowns or excessive volatility. This stage "fine-tunes" the LSTM weights to prioritize capital preservation.
3. Key RL Components
| Component | Description |
| :--- | :--- |
| Action Space | Discrete (3): 0: SELL/SHORT, 1: HOLD/NEUTRAL, 2: BUY/LONG. |
| Observation Space | A $(50, 4)$ tensor representing the last 50 candles (OHLC + Indicators). |
| Vectorized Env | High-performance GPU-bound environment that simulates multiple tickers simultaneously. |
| Reward Function | A composite score based on Realized PnL, Unrealized Drawdown, and Win Rate. |
4. Reward Shaping (Risk Modeling)
The PPO agent is not just optimized for raw profit. The internal reward mechanism includes:
- Directional Accuracy: Positive reinforcement for entering trades that move in the predicted direction.
- Risk-Adjusted Return: Scaling rewards by the volatility of the asset to prevent "gambling" on high-beta stocks.
- Time Penalty: A small negative bias for holding positions too long without price movement, encouraging capital efficiency.
5. Configuration & Artifacts
Input Weights
The RL loop automatically looks for the latest SFT weights to initialize the policy:
- Path:
checkpoints_sft/final_sft_model.pth
Output Checkpoints
During training, the system monitors the validation reward and saves the best-performing iteration:
- Best RL Model:
checkpoints/best_ppo.ckpt(This is the file used bytrader_alpaca.pyfor live trading).
6. Monitoring Training
You can track the agent's learning progress (Policy Loss, Value Loss, and Mean Reward) using the integrated logging:
- Logs: Saved in
logs/ppo_tensorboard/ - Key Metric: Look for the Mean Episode Reward to trend upward; this indicates the agent is successfully learning to navigate market volatility without blowing up the simulated account.