Introduction
Can an AI learn to beat the stock market? This project explores that question by building a Deep Reinforcement Learning (DRL) agent that learns optimal portfolio allocation strategies through trial and error - just like how humans learn, but at machine speed.
Starting with $100,000 in virtual capital, I trained an AI agent to manage a diversified portfolio of 10 assets including tech giants (AAPL, MSFT, GOOGL, AMZN, NVDA), financial institutions (JPM, BAC), and safe havens (Gold, Bonds, S&P 500 ETF). The goal: beat the market benchmark (S&P 500) while managing risk.
The result? The best-performing agent achieved an 86.94% return over the 2024-2025 test period, outperforming the S&P 500 by 38.35% with a Sharpe ratio of 1.617 - significantly better than the market's risk-adjusted returns.
Understanding the Challenge
Before diving into the technical details, let me explain what makes this problem difficult and why traditional approaches often fall short.
The Complexity of Financial Markets
Traditional portfolio management relies on human intuition, fundamental analysis, and rule-based strategies. But markets are complex, non-linear systems where patterns shift constantly. Reinforcement Learning offers a different approach: let the agent discover optimal strategies by interacting with historical market data.
1. High-Dimensional State Space (121 Features)
What does this mean? Imagine trying to make a decision while keeping track of 121 different pieces of information simultaneously. That's what the AI agent faces every single day.
In simple terms: Every trading day, the agent needs to consider:
- How much cash it currently has
- How many shares it owns of each of the 10 stocks
- The current price of each stock
- 10 different technical indicators for each stock (like momentum, volatility, trends)
That's 1 (cash) + 10 (holdings) + 10 (prices) + 100 (10 indicators x 10 stocks) = 121 numbers to process before making any decision.
Why is this hard? Humans struggle to process more than 7 pieces of information at once. The AI needs to find patterns in 121 dimensions - something impossible for humans but achievable for neural networks.
2. Continuous Action Space
What does this mean? The agent doesn't just decide "buy" or "sell" - it decides how much to buy or sell for each of the 10 stocks.
In simple terms: Traditional trading bots might say "Buy Apple" or "Sell Google." But our agent outputs something like:
- Apple: +0.7 (buy 70% of maximum allowed)
- Google: -0.3 (sell 30% of holdings)
- NVIDIA: +0.9 (buy aggressively)
- Gold: +0.1 (buy a little)
- ...and so on for all 10 assets
This is much more nuanced than binary decisions. The agent can express confidence levels - a strong buy signal vs. a weak one.
Why is this hard? There are infinite possible combinations of actions. The agent must learn which combinations work best in different market conditions.
3. Non-Stationary Environment
What does this mean? The rules of the game keep changing. What worked yesterday might not work tomorrow.
In simple terms: Markets evolve constantly. A strategy that made money during the 2020 COVID crash might lose money during the 2024 bull run. The agent trained on data from 2019-2023 needs to perform well on 2024-2025 data it has never seen.
Why is this hard? Unlike chess or video games where rules are fixed, financial markets are influenced by news, politics, human psychology, and countless other factors. The agent must learn generalizable patterns, not just memorize past data.
4. Risk Management: The Sharpe Ratio
What does this mean? Making money isn't enough - we need to make moneyconsistently without wild swings.
In simple terms: Imagine two traders:
- Trader A: Makes 50% return but had moments where they lost 40% of their portfolio
- Trader B: Makes 40% return but never lost more than 15%
Which is better? The Sharpe Ratio helps answer this. It measures returnper unit of risk. A higher Sharpe ratio means better risk-adjusted returns.
The formula (simplified): Sharpe Ratio = (Your Returns - Risk-Free Rate) / Volatility of Returns
Target: A Sharpe ratio above 1.0 is considered good. Above 1.5 is excellent. Our best agent achieved 1.617.
5. Maximum Drawdown
What does this mean? The worst peak-to-trough decline in portfolio value.
In simple terms: If your portfolio grows from $100,000 to $150,000, then drops to $120,000, your drawdown is:
- Peak: $150,000
- Trough: $120,000
- Drawdown: ($150,000 - $120,000) / $150,000 = 20%
Why it matters: Even if you end up profitable, a 50% drawdown means you lost half your money at some point. Most investors can't stomach that. Our target was to keep drawdown below 25%.
The Data: Building a Robust Foundation
The project uses 6+ years of market data (2019-2025), including the COVID-19 crash - a crucial stress test for any trading strategy.
Features Engineered (17 Total)
I didn't just feed raw stock prices to the AI. I engineered meaningful features that capture different aspects of market behavior:
Base OHLCV Data (7 features):
- Open: Price at market open
- High: Highest price of the day
- Low: Lowest price of the day
- Close: Price at market close
- Volume: Number of shares traded
- Date: When the trading happened
- Ticker: Which stock (AAPL, MSFT, etc.)
Technical Indicators (8 features):
These are mathematical formulas that traders use to identify patterns:
- MACD (Moving Average Convergence Divergence): Shows momentum - is the stock gaining or losing steam?
- RSI (Relative Strength Index): Measures if a stock is "overbought" (too expensive) or "oversold" (potentially cheap). Ranges from 0-100.
- Bollinger Bands: Creates upper and lower "bands" around the price. When price touches the bands, it might reverse direction.
- CCI (Commodity Channel Index): Identifies cyclical trends - is the stock in an uptrend or downtrend?
- DX (Directional Index): Measures trend strength - how strong is the current trend?
- SMA-30 (30-day Simple Moving Average): Average price over last 30 days - smooths out noise
- SMA-60 (60-day Simple Moving Average): Average price over last 60 days - shows longer-term trend
Market Stress Indicators (2 features):
- VIX (Volatility Index): Known as the "fear gauge." When VIX is high, markets are scared. When low, markets are calm. Range in our data: 11.54 (calm) to 82.69 (COVID panic).
- Turbulence Index: Measures how unusual current market behavior is compared to historical norms. High turbulence = unusual market conditions.
Why these features matter: Raw prices alone don't tell the full story. These indicators help the AI understand context - is the market trending up? Is it volatile? Is a stock overbought? This context helps make better decisions.
The Reinforcement Learning Environment
Now let's look at how the AI actually learns.
Framework: FinRL with Stable-Baselines3
FinRL is a specialized library for financial reinforcement learning.Stable-Baselines3 provides battle-tested RL algorithms. Together, they handle the complex infrastructure so I could focus on the trading logic.
Algorithm: Proximal Policy Optimization (PPO)
What is PPO? It's a state-of-the-art reinforcement learning algorithm developed by OpenAI. Think of it as a careful learner that makes small, stable improvements rather than wild changes.
In simple terms: Imagine learning to ride a bike. You could:
- Option A: Make huge adjustments each time you wobble (might fall)
- Option B: Make small, careful adjustments (slower but safer)
PPO is like Option B. It limits how much the agent can change its behavior in each learning step, preventing catastrophic forgetting of good strategies.
State Space: 121 Dimensions
Every day, the agent observes:
State = [Cash] + [Holdings for 10 stocks] + [Prices for 10 stocks] + [10 indicators x 10 stocks]
= 1 + 10 + 10 + 100 = 121 numbersThis is the agent's "view" of the world - all the information it uses to make decisions.
Action Space: 10 Continuous Values [-1, 1]
The agent outputs 10 numbers, one for each stock:
- -1: Sell maximum allowed shares
- 0: Hold (do nothing)
- +1: Buy maximum allowed shares
- Values in between: Partial buy/sell
Example output:
[AAPL: +0.8, MSFT: +0.3, GOOGL: -0.5, AMZN: +0.1, NVDA: +0.9,
JPM: -0.2, BAC: 0.0, GLD: +0.4, TLT: +0.6, SPY: -0.1]This means: Buy Apple and NVIDIA aggressively, sell some Google, hold Bank of America, etc.
Reward: Portfolio Value Change
How does the agent learn what's good? Through rewards.
After each action, the agent receives a reward based on how much the portfolio value changed. If the portfolio went up, positive reward. If it went down, negative reward.
In simple terms: It's like training a dog. Good behavior (making money) gets treats (positive reward). Bad behavior (losing money) gets no treats (negative reward). Over millions of iterations, the agent learns which actions lead to treats.
The Experimental Journey: From Baseline to Best
This is where it gets interesting. I didn't just train one model - I conducted 10 systematic experiments, each teaching me something new. Here's the journey:
Experiment 1: The Baseline (200K Timesteps)
The Setup: I started with default settings. 200,000 timesteps of training (about 5.5 minutes), standard PPO hyperparameters.
The Question: Can a basic RL agent beat the market?
The Results:
- Total Return: 48.28%
- Sharpe Ratio: 1.016
- Max Drawdown: -26.15%
- vs SPY: -0.31%

What I Learned: The agent matched the market almost exactly but didn't beat it. The Sharpe ratio just barely crossed 1.0, and the drawdown exceeded my 25% target. Not bad for a first attempt, but clearly room for improvement.
The Insight: The agent was learning something, but maybe it needed more training time to discover better strategies.
Experiment 2: More Training (500K Timesteps)
The Hypothesis: If 200K timesteps got us to market-level performance, maybe 500K would push us ahead.
The Results:
- Total Return: 60.93% (up)
- Sharpe Ratio: 1.07 (up)
- Max Drawdown: -30.30% (worse)
- vs SPY: +12.34% (up)

What I Learned: More training definitely helped returns (+12% vs SPY!), but the drawdown got worse. The agent was making more money but taking bigger risks to do it.
The Insight: Simply training longer isn't enough. The agent might be learning aggressive strategies that work most of the time but occasionally blow up.
Experiment 3: Lower Learning Rate
The Hypothesis: Maybe the agent is learning too fast and missing subtle patterns. A lower learning rate (0.0001 instead of 0.0003) might help it learn more carefully.
The Results:
- Total Return: 48.02% (down)
- Sharpe Ratio: 0.982 (down)
- Max Drawdown: -27.70%
- vs SPY: -0.56%

What I Learned: This was a step backward. The agent learned too slowly and ended up at baseline-level performance despite the same 500K timesteps.
The Insight: The original learning rate was already good. Don't fix what isn't broken. The problem wasn't learning speed - it was something else.
Experiment 4: Different Algorithm (A2C)
The Hypothesis: Maybe PPO isn't the best algorithm for this problem. Let me try A2C (Advantage Actor-Critic), which is often more stable.
The Results:
- Total Return: 54.84%
- Sharpe Ratio: 1.018
- Max Drawdown: -28.89%
- vs SPY: +6.25%

What I Learned: A2C was faster to train (27% less time) and beat the market by 6%, but PPO with 500K timesteps still performed better. A2C had slightly lower drawdown though.
The Insight: PPO seems better suited for this environment, but A2C isn't bad. The algorithm choice matters less than I thought - hyperparameters might be more important.
Experiment 5: The Breakthrough - High Entropy
The Hypothesis: I noticed the agent was often putting all its money into just 1-2 stocks. This is risky - if those stocks crash, everything crashes. What if I forced the agent to explore more diverse strategies?
In RL, the "entropy coefficient" controls how much the agent explores vs. exploits known strategies. Higher entropy = more exploration = more diverse actions.
The Change: Increased entropy coefficient from 0.01 (default) to 0.05 (5x higher).
The Results:
- Total Return: 62.91% (up)
- Sharpe Ratio: 1.333 (significantly up)
- Max Drawdown: -23.06% (better!)
- vs SPY: +14.32% (up)

What I Learned: This was the breakthrough! Higher entropy forced the agent to maintain a diversified portfolio instead of betting everything on one stock. The result: better returns AND lower risk.
The Insight: The agent was being too greedy, concentrating in winners. By forcing exploration, it learned that diversification actually leads to better risk-adjusted returns. This matches traditional finance wisdom!
Experiment 6: The Best Configuration (1M Timesteps + High Entropy)
The Hypothesis: If high entropy works so well, what happens if I combine it with even more training?
The Results:
- Total Return: 86.94% (excellent)
- Sharpe Ratio: 1.617 (excellent)
- Max Drawdown: -21.62% (best!)
- vs SPY: +38.35% (excellent)

What I Learned: This was the winning combination. The agent nearly doubled the initial investment, crushed the S&P 500 benchmark, and did it with the lowest drawdown of any experiment.
The Insight: The combination of sufficient training time (1M timesteps) and forced diversification (high entropy) was the key. Neither alone was enough - they worked synergistically.
Experiment 7: Seed Sensitivity Test
The Question: Was the success of Experiment 6 just luck? RL algorithms use random number generators, and different "seeds" can lead to very different results.
The Test: I trained the same configuration with a different random seed (456 instead of 42).
The Results:
- Total Return: 66.62% (vs 86.94%)
- Sharpe Ratio: 1.406 (vs 1.617)
- Max Drawdown: -21.75%
- vs SPY: +18.03%

What I Learned: Still excellent performance, but noticeably different from seed 42. The same configuration can produce returns ranging from 66% to 87% depending on random initialization.
The Insight: Never trust a single training run. Always test multiple seeds and report the range of results. The "best" result might just be lucky.
Experiments 8-9: Ensemble Methods
The Hypothesis: If different seeds produce different results, maybe combining multiple agents would be more robust. An "ensemble" averages the predictions of several agents.
Experiment 8: 3-Agent Ensemble
- Total Return: 54.10%
- Sharpe Ratio: 0.932
- Max Drawdown: -33.02%
- vs SPY: +5.51%
Experiment 9: 5-Agent Ensemble
- Total Return: 48.53%
- Sharpe Ratio: 0.828
- Max Drawdown: -33.34%
- vs SPY: -0.06%

What I Learned: Surprisingly, the ensembles performed WORSE than single agents! The 5-agent ensemble barely matched the market.
Why did this happen? When agents disagree, their signals cancel out:
Strong buy opportunity (NVDA):
Agent 456: BUY (+0.9) <- Best agent sees it
Agent 42: BUY (+0.6)
Agent 789: HOLD (+0.1)
Agent 123: SELL (-0.4)
Agent 1011: SELL (-0.3)
─────────────────────────
Ensemble: WEAK BUY (+0.18) -> Missed the big move!The Insight: Simple averaging doesn't work when one agent is much better than others. The best agent's strong signals get diluted by weaker agents' noise. More sophisticated ensemble techniques (weighted voting, meta-learning) might work better.
Experiment 10: Investigating the Star Performer
The Discovery: During ensemble testing, I noticed Agent 456 (seed=456) achieved an incredible 90.7% return with just 500K timesteps - even better than my "best" 1M timestep model!
The Investigation: I trained seed 456 with 1M timesteps to see if it would improve further.
The Results:
- 500K timesteps: 90.7% return
- 1M timesteps: 66.6% return (worse!)
What I Learned: More training actually hurt this particular seed. It was overfitting - memorizing the training data instead of learning generalizable patterns.
The Insight: There's no universal "best" configuration. Different seeds have different optimal training durations. This is why systematic experimentation is crucial.
Summary of All Experiments
| # | Experiment | Return | Sharpe | Max DD | vs SPY | Key Learning |
|---|---|---|---|---|---|---|
| 1 | Baseline 200K | 48.28% | 1.016 | -26.15% | -0.31% | Starting point |
| 2 | Longer 500K | 60.93% | 1.07 | -30.30% | +12.34% | More training helps returns |
| 3 | Lower LR | 48.02% | 0.98 | -27.70% | -0.56% | Original LR was fine |
| 4 | A2C Algorithm | 54.84% | 1.02 | -28.89% | +6.25% | PPO > A2C here |
| 5 | High Entropy | 62.91% | 1.33 | -23.06% | +14.32% | Breakthrough! |
| 6 | 1M + High Entropy | 86.94% | 1.617 | -21.62% | +38.35% | Best overall |
| 7 | Seed 456 (1M) | 66.62% | 1.41 | -21.75% | +18.03% | Seed matters |
| 8 | Ensemble (3) | 54.10% | 0.93 | -33.02% | +5.51% | Averaging hurts |
| 9 | Ensemble (5) | 48.53% | 0.83 | -33.34% | -0.06% | More agents = worse |
| 10 | Seed 456 (500K) | 90.7%* | 1.41 | -25.5% | +42%* | Overfitting risk |
*From ensemble individual agent results
Key Takeaways
1. Entropy Coefficient is Critical
The most important discovery: higher entropy (0.05 vs 0.01) dramatically improved performance by forcing diversification.
2. More Training Isn't Always Better
1M timesteps with the right config worked well, but some agents performed worse with more training (overfitting).
3. Simple Ensemble Averaging Fails
When one agent is much better, averaging dilutes its strong signals with weaker agents' noise.
4. Seed Sensitivity is Real
Same configuration, different seeds leads to returns ranging from 47% to 90%. Never trust a single run.
5. Domain Knowledge Helps
VIX integration and technical indicators improved performance. The AI benefits from human-engineered features.
Final Results
| Metric | Target | Best Agent | Status |
|---|---|---|---|
| Total Return | Beat SPY | 86.94% vs 48.59% | +38.35% |
| Sharpe Ratio | > 1.0 | 1.617 | Exceeded by 62% |
| Max Drawdown | < 25% | -21.62% | Beat target |
Technologies Used
- FinRL: Financial Reinforcement Learning framework
- Stable-Baselines3: State-of-the-art RL algorithms (PPO, A2C)
- yfinance: Yahoo Finance data downloader
- Docker: Reproducible environment
- Alpaca API: Paper trading deployment
Conclusion
This project demonstrates that Deep Reinforcement Learning can discover profitable trading strategies that outperform traditional benchmarks. The key insight - that higher entropy leads to better diversification and risk management - has implications beyond just trading.
However, it's important to remember that markets are adversarial environments. Strategies that worked in the past may not work in the future. The goal isn't to find a "money printer" but to understand how AI can assist in complex decision-making under uncertainty.
"The market can remain irrational longer than you can remain solvent." - John Maynard Keynes
Always paper trade extensively before using real money. Past backtest performance does not guarantee future results.