RL & PPO Fine-Tuning

Reinforcement Learning & PPO Fine-Tuning

Once the "Brain" has acquired basic market common sense through Supervised Fine-Tuning (SFT), it undergoes Reinforcement Learning (RL) via Proximal Policy Optimization (PPO).

While SFT teaches the model to recognize patterns (e.g., "this looks like a bottom"), the PPO phase teaches the agent execution logic and risk management. It transitions the model from a passive predictor to an active agent that optimizes for long-term Sharpe ratio rather than just next-candle accuracy.

1. The PPO Training Loop

The RL training occurs in a vectorized environment, simulating thousands of trades across historical Nifty 500 data to refine the agent's decision-making.

Main Script: src/train_ppo_optimized.py

How to Run

To start the RL fine-tuning process, ensure you have a pre-trained SFT model checkpoint available:

# Execute the optimized PPO training loop
python -m src.train_ppo_optimized

2. Training Philosophy: SFT to PPO

The agent follows a two-stage evolution:

Stage 1 (SFT): The model learns from "Golden Labels" (ZigZag hindsight). This establishes a 77% baseline accuracy in trend identification.
Stage 2 (PPO): The model is "pushed" into a live-market simulator. It receives rewards for profitable trades and penalties for drawdowns or excessive volatility. This stage "fine-tunes" the LSTM weights to prioritize capital preservation.

3. Key RL Components

4. Reward Shaping (Risk Modeling)

The PPO agent is not just optimized for raw profit. The internal reward mechanism includes:

Directional Accuracy: Positive reinforcement for entering trades that move in the predicted direction.
Risk-Adjusted Return: Scaling rewards by the volatility of the asset to prevent "gambling" on high-beta stocks.
Time Penalty: A small negative bias for holding positions too long without price movement, encouraging capital efficiency.

5. Configuration & Artifacts

Input Weights

The RL loop automatically looks for the latest SFT weights to initialize the policy:

Path: checkpoints_sft/final_sft_model.pth

Output Checkpoints

During training, the system monitors the validation reward and saves the best-performing iteration:

Best RL Model: checkpoints/best_ppo.ckpt (This is the file used by trader_alpaca.py for live trading).

6. Monitoring Training

You can track the agent's learning progress (Policy Loss, Value Loss, and Mean Reward) using the integrated logging:

Logs: Saved in logs/ppo_tensorboard/
Key Metric: Look for the Mean Episode Reward to trend upward; this indicates the agent is successfully learning to navigate market volatility without blowing up the simulated account.