Can AI Beat the Stock Market? | Deep RL Portfolio Manager

Introduction

Can an AI learn to beat the stock market? This project explores that question by building a Deep Reinforcement Learning (DRL) agent that learns optimal portfolio allocation strategies through trial and error - just like how humans learn, but at machine speed.

Starting with $100,000 in virtual capital, I trained an AI agent to manage a diversified portfolio of 10 assets including tech giants (AAPL, MSFT, GOOGL, AMZN, NVDA), financial institutions (JPM, BAC), and safe havens (Gold, Bonds, S&P 500 ETF). The goal: beat the market benchmark (S&P 500) while managing risk.

The result? The best-performing agent achieved an 86.94% return over the 2024-2025 test period, outperforming the S&P 500 by 38.35% with a Sharpe ratio of 1.617 - significantly better than the market's risk-adjusted returns.

Understanding the Challenge

Before diving into the technical details, let me explain what makes this problem difficult and why traditional approaches often fall short.

The Complexity of Financial Markets

Traditional portfolio management relies on human intuition, fundamental analysis, and rule-based strategies. But markets are complex, non-linear systems where patterns shift constantly. Reinforcement Learning offers a different approach: let the agent discover optimal strategies by interacting with historical market data.

1. High-Dimensional State Space (121 Features)

What does this mean? Imagine trying to make a decision while keeping track of 121 different pieces of information simultaneously. That's what the AI agent faces every single day.

In simple terms: Every trading day, the agent needs to consider:

How much cash it currently has
How many shares it owns of each of the 10 stocks
The current price of each stock
10 different technical indicators for each stock (like momentum, volatility, trends)

That's 1 (cash) + 10 (holdings) + 10 (prices) + 100 (10 indicators x 10 stocks) = 121 numbers to process before making any decision.

Why is this hard? Humans struggle to process more than 7 pieces of information at once. The AI needs to find patterns in 121 dimensions - something impossible for humans but achievable for neural networks.

2. Continuous Action Space

What does this mean? The agent doesn't just decide "buy" or "sell" - it decides how much to buy or sell for each of the 10 stocks.

In simple terms: Traditional trading bots might say "Buy Apple" or "Sell Google." But our agent outputs something like:

Apple: +0.7 (buy 70% of maximum allowed)
Google: -0.3 (sell 30% of holdings)
NVIDIA: +0.9 (buy aggressively)
Gold: +0.1 (buy a little)
...and so on for all 10 assets

This is much more nuanced than binary decisions. The agent can express confidence levels - a strong buy signal vs. a weak one.

Why is this hard? There are infinite possible combinations of actions. The agent must learn which combinations work best in different market conditions.

3. Non-Stationary Environment

What does this mean? The rules of the game keep changing. What worked yesterday might not work tomorrow.

In simple terms: Markets evolve constantly. A strategy that made money during the 2020 COVID crash might lose money during the 2024 bull run. The agent trained on data from 2019-2023 needs to perform well on 2024-2025 data it has never seen.

Why is this hard? Unlike chess or video games where rules are fixed, financial markets are influenced by news, politics, human psychology, and countless other factors. The agent must learn generalizable patterns, not just memorize past data.

4. Risk Management: The Sharpe Ratio

What does this mean? Making money isn't enough - we need to make moneyconsistently without wild swings.

In simple terms: Imagine two traders:

Trader A: Makes 50% return but had moments where they lost 40% of their portfolio
Trader B: Makes 40% return but never lost more than 15%

Which is better? The Sharpe Ratio helps answer this. It measures returnper unit of risk. A higher Sharpe ratio means better risk-adjusted returns.

The formula (simplified): Sharpe Ratio = (Your Returns - Risk-Free Rate) / Volatility of Returns

Target: A Sharpe ratio above 1.0 is considered good. Above 1.5 is excellent. Our best agent achieved 1.617.

5. Maximum Drawdown

What does this mean? The worst peak-to-trough decline in portfolio value.

In simple terms: If your portfolio grows from $100,000 to $150,000, then drops to $120,000, your drawdown is:

Peak: $150,000
Trough: $120,000
Drawdown: ($150,000 - $120,000) / $150,000 = 20%

Why it matters: Even if you end up profitable, a 50% drawdown means you lost half your money at some point. Most investors can't stomach that. Our target was to keep drawdown below 25%.

The Data: Building a Robust Foundation

The project uses 6+ years of market data (2019-2025), including the COVID-19 crash - a crucial stress test for any trading strategy.

Features Engineered (17 Total)

I didn't just feed raw stock prices to the AI. I engineered meaningful features that capture different aspects of market behavior:

Base OHLCV Data (7 features):

Open: Price at market open
High: Highest price of the day
Low: Lowest price of the day
Close: Price at market close
Volume: Number of shares traded
Date: When the trading happened
Ticker: Which stock (AAPL, MSFT, etc.)

Technical Indicators (8 features):

These are mathematical formulas that traders use to identify patterns:

MACD (Moving Average Convergence Divergence): Shows momentum - is the stock gaining or losing steam?
RSI (Relative Strength Index): Measures if a stock is "overbought" (too expensive) or "oversold" (potentially cheap). Ranges from 0-100.
Bollinger Bands: Creates upper and lower "bands" around the price. When price touches the bands, it might reverse direction.
CCI (Commodity Channel Index): Identifies cyclical trends - is the stock in an uptrend or downtrend?
DX (Directional Index): Measures trend strength - how strong is the current trend?
SMA-30 (30-day Simple Moving Average): Average price over last 30 days - smooths out noise
SMA-60 (60-day Simple Moving Average): Average price over last 60 days - shows longer-term trend

Market Stress Indicators (2 features):

VIX (Volatility Index): Known as the "fear gauge." When VIX is high, markets are scared. When low, markets are calm. Range in our data: 11.54 (calm) to 82.69 (COVID panic).
Turbulence Index: Measures how unusual current market behavior is compared to historical norms. High turbulence = unusual market conditions.

Why these features matter: Raw prices alone don't tell the full story. These indicators help the AI understand context - is the market trending up? Is it volatile? Is a stock overbought? This context helps make better decisions.

The Reinforcement Learning Environment

Now let's look at how the AI actually learns.

Framework: FinRL with Stable-Baselines3

FinRL is a specialized library for financial reinforcement learning.Stable-Baselines3 provides battle-tested RL algorithms. Together, they handle the complex infrastructure so I could focus on the trading logic.

Algorithm: Proximal Policy Optimization (PPO)

What is PPO? It's a state-of-the-art reinforcement learning algorithm developed by OpenAI. Think of it as a careful learner that makes small, stable improvements rather than wild changes.

In simple terms: Imagine learning to ride a bike. You could:

Option A: Make huge adjustments each time you wobble (might fall)
Option B: Make small, careful adjustments (slower but safer)

PPO is like Option B. It limits how much the agent can change its behavior in each learning step, preventing catastrophic forgetting of good strategies.

State Space: 121 Dimensions

Every day, the agent observes:

State = [Cash] + [Holdings for 10 stocks] + [Prices for 10 stocks] + [10 indicators x 10 stocks]
      = 1 + 10 + 10 + 100 = 121 numbers

This is the agent's "view" of the world - all the information it uses to make decisions.

Action Space: 10 Continuous Values [-1, 1]

The agent outputs 10 numbers, one for each stock:

-1: Sell maximum allowed shares
0: Hold (do nothing)
+1: Buy maximum allowed shares
Values in between: Partial buy/sell

Example output:

[AAPL: +0.8, MSFT: +0.3, GOOGL: -0.5, AMZN: +0.1, NVDA: +0.9, 
 JPM: -0.2, BAC: 0.0, GLD: +0.4, TLT: +0.6, SPY: -0.1]

This means: Buy Apple and NVIDIA aggressively, sell some Google, hold Bank of America, etc.

Reward: Portfolio Value Change

How does the agent learn what's good? Through rewards.

After each action, the agent receives a reward based on how much the portfolio value changed. If the portfolio went up, positive reward. If it went down, negative reward.

In simple terms: It's like training a dog. Good behavior (making money) gets treats (positive reward). Bad behavior (losing money) gets no treats (negative reward). Over millions of iterations, the agent learns which actions lead to treats.

The Experimental Journey: From Baseline to Best

This is where it gets interesting. I didn't just train one model - I conducted 10 systematic experiments, each teaching me something new. Here's the journey:

Experiment 1: The Baseline (200K Timesteps)

The Setup: I started with default settings. 200,000 timesteps of training (about 5.5 minutes), standard PPO hyperparameters.

The Question: Can a basic RL agent beat the market?

The Results:

Total Return: 48.28%
Sharpe Ratio: 1.016
Max Drawdown: -26.15%
vs SPY: -0.31%

What I Learned: The agent matched the market almost exactly but didn't beat it. The Sharpe ratio just barely crossed 1.0, and the drawdown exceeded my 25% target. Not bad for a first attempt, but clearly room for improvement.

The Insight: The agent was learning something, but maybe it needed more training time to discover better strategies.

Experiment 2: More Training (500K Timesteps)

The Hypothesis: If 200K timesteps got us to market-level performance, maybe 500K would push us ahead.

The Results:

Total Return: 60.93% (up)
Sharpe Ratio: 1.07 (up)
Max Drawdown: -30.30% (worse)
vs SPY: +12.34% (up)

What I Learned: More training definitely helped returns (+12% vs SPY!), but the drawdown got worse. The agent was making more money but taking bigger risks to do it.

The Insight: Simply training longer isn't enough. The agent might be learning aggressive strategies that work most of the time but occasionally blow up.

Experiment 3: Lower Learning Rate

The Hypothesis: Maybe the agent is learning too fast and missing subtle patterns. A lower learning rate (0.0001 instead of 0.0003) might help it learn more carefully.

The Results:

Total Return: 48.02% (down)
Sharpe Ratio: 0.982 (down)
Max Drawdown: -27.70%
vs SPY: -0.56%

What I Learned: This was a step backward. The agent learned too slowly and ended up at baseline-level performance despite the same 500K timesteps.

The Insight: The original learning rate was already good. Don't fix what isn't broken. The problem wasn't learning speed - it was something else.

Experiment 4: Different Algorithm (A2C)

The Hypothesis: Maybe PPO isn't the best algorithm for this problem. Let me try A2C (Advantage Actor-Critic), which is often more stable.

The Results:

Total Return: 54.84%
Sharpe Ratio: 1.018
Max Drawdown: -28.89%
vs SPY: +6.25%

What I Learned: A2C was faster to train (27% less time) and beat the market by 6%, but PPO with 500K timesteps still performed better. A2C had slightly lower drawdown though.

The Insight: PPO seems better suited for this environment, but A2C isn't bad. The algorithm choice matters less than I thought - hyperparameters might be more important.

Experiment 5: The Breakthrough - High Entropy

The Hypothesis: I noticed the agent was often putting all its money into just 1-2 stocks. This is risky - if those stocks crash, everything crashes. What if I forced the agent to explore more diverse strategies?

In RL, the "entropy coefficient" controls how much the agent explores vs. exploits known strategies. Higher entropy = more exploration = more diverse actions.

The Change: Increased entropy coefficient from 0.01 (default) to 0.05 (5x higher).

The Results:

Total Return: 62.91% (up)
Sharpe Ratio: 1.333 (significantly up)
Max Drawdown: -23.06% (better!)
vs SPY: +14.32% (up)

What I Learned: This was the breakthrough! Higher entropy forced the agent to maintain a diversified portfolio instead of betting everything on one stock. The result: better returns AND lower risk.

The Insight: The agent was being too greedy, concentrating in winners. By forcing exploration, it learned that diversification actually leads to better risk-adjusted returns. This matches traditional finance wisdom!

Experiment 6: The Best Configuration (1M Timesteps + High Entropy)

The Hypothesis: If high entropy works so well, what happens if I combine it with even more training?

The Results:

Total Return: 86.94% (excellent)
Sharpe Ratio: 1.617 (excellent)
Max Drawdown: -21.62% (best!)
vs SPY: +38.35% (excellent)

PPO 1M High Entropy Experiment Results - Best Configuration

What I Learned: This was the winning combination. The agent nearly doubled the initial investment, crushed the S&P 500 benchmark, and did it with the lowest drawdown of any experiment.

The Insight: The combination of sufficient training time (1M timesteps) and forced diversification (high entropy) was the key. Neither alone was enough - they worked synergistically.

Experiment 7: Seed Sensitivity Test

The Question: Was the success of Experiment 6 just luck? RL algorithms use random number generators, and different "seeds" can lead to very different results.

The Test: I trained the same configuration with a different random seed (456 instead of 42).

The Results:

Total Return: 66.62% (vs 86.94%)
Sharpe Ratio: 1.406 (vs 1.617)
Max Drawdown: -21.75%
vs SPY: +18.03%

What I Learned: Still excellent performance, but noticeably different from seed 42. The same configuration can produce returns ranging from 66% to 87% depending on random initialization.

The Insight: Never trust a single training run. Always test multiple seeds and report the range of results. The "best" result might just be lucky.

Experiments 8-9: Ensemble Methods

The Hypothesis: If different seeds produce different results, maybe combining multiple agents would be more robust. An "ensemble" averages the predictions of several agents.

Experiment 8: 3-Agent Ensemble

Total Return: 54.10%
Sharpe Ratio: 0.932
Max Drawdown: -33.02%
vs SPY: +5.51%

Experiment 9: 5-Agent Ensemble

Total Return: 48.53%
Sharpe Ratio: 0.828
Max Drawdown: -33.34%
vs SPY: -0.06%

What I Learned: Surprisingly, the ensembles performed WORSE than single agents! The 5-agent ensemble barely matched the market.

Why did this happen? When agents disagree, their signals cancel out:

Strong buy opportunity (NVDA):
  Agent 456: BUY  (+0.9)  <- Best agent sees it
  Agent 42:  BUY  (+0.6)
  Agent 789: HOLD (+0.1)
  Agent 123: SELL (-0.4)
  Agent 1011: SELL (-0.3)
  ─────────────────────────
  Ensemble:  WEAK BUY (+0.18) -> Missed the big move!

The Insight: Simple averaging doesn't work when one agent is much better than others. The best agent's strong signals get diluted by weaker agents' noise. More sophisticated ensemble techniques (weighted voting, meta-learning) might work better.

Experiment 10: Investigating the Star Performer

The Discovery: During ensemble testing, I noticed Agent 456 (seed=456) achieved an incredible 90.7% return with just 500K timesteps - even better than my "best" 1M timestep model!

The Investigation: I trained seed 456 with 1M timesteps to see if it would improve further.

The Results:

500K timesteps: 90.7% return
1M timesteps: 66.6% return (worse!)

What I Learned: More training actually hurt this particular seed. It was overfitting - memorizing the training data instead of learning generalizable patterns.

The Insight: There's no universal "best" configuration. Different seeds have different optimal training durations. This is why systematic experimentation is crucial.

Summary of All Experiments

#	Experiment	Return	Sharpe	Max DD	vs SPY	Key Learning
1	Baseline 200K	48.28%	1.016	-26.15%	-0.31%	Starting point
2	Longer 500K	60.93%	1.07	-30.30%	+12.34%	More training helps returns
3	Lower LR	48.02%	0.98	-27.70%	-0.56%	Original LR was fine
4	A2C Algorithm	54.84%	1.02	-28.89%	+6.25%	PPO > A2C here
5	High Entropy	62.91%	1.33	-23.06%	+14.32%	Breakthrough!
6	1M + High Entropy	86.94%	1.617	-21.62%	+38.35%	Best overall
7	Seed 456 (1M)	66.62%	1.41	-21.75%	+18.03%	Seed matters
8	Ensemble (3)	54.10%	0.93	-33.02%	+5.51%	Averaging hurts
9	Ensemble (5)	48.53%	0.83	-33.34%	-0.06%	More agents = worse
10	Seed 456 (500K)	90.7%*	1.41	-25.5%	+42%*	Overfitting risk

*From ensemble individual agent results

Key Takeaways

1. Entropy Coefficient is Critical

The most important discovery: higher entropy (0.05 vs 0.01) dramatically improved performance by forcing diversification.

2. More Training Isn't Always Better

1M timesteps with the right config worked well, but some agents performed worse with more training (overfitting).

3. Simple Ensemble Averaging Fails

When one agent is much better, averaging dilutes its strong signals with weaker agents' noise.

4. Seed Sensitivity is Real

Same configuration, different seeds leads to returns ranging from 47% to 90%. Never trust a single run.

5. Domain Knowledge Helps

VIX integration and technical indicators improved performance. The AI benefits from human-engineered features.

Final Results

Metric	Target	Best Agent	Status
Total Return	Beat SPY	86.94% vs 48.59%	+38.35%
Sharpe Ratio	> 1.0	1.617	Exceeded by 62%
Max Drawdown	< 25%	-21.62%	Beat target

Technologies Used

FinRL: Financial Reinforcement Learning framework
Stable-Baselines3: State-of-the-art RL algorithms (PPO, A2C)
yfinance: Yahoo Finance data downloader
Docker: Reproducible environment
Alpaca API: Paper trading deployment

Conclusion

This project demonstrates that Deep Reinforcement Learning can discover profitable trading strategies that outperform traditional benchmarks. The key insight - that higher entropy leads to better diversification and risk management - has implications beyond just trading.

However, it's important to remember that markets are adversarial environments. Strategies that worked in the past may not work in the future. The goal isn't to find a "money printer" but to understand how AI can assist in complex decision-making under uncertainty.

"The market can remain irrational longer than you can remain solvent." - John Maynard Keynes

Always paper trade extensively before using real money. Past backtest performance does not guarantee future results.

Building an AI Trading Agent: A Deep Reinforcement Learning Approach to Portfolio Management

Introduction

Understanding the Challenge

The Complexity of Financial Markets

1. High-Dimensional State Space (121 Features)

2. Continuous Action Space

3. Non-Stationary Environment

4. Risk Management: The Sharpe Ratio

5. Maximum Drawdown

The Data: Building a Robust Foundation

Features Engineered (17 Total)

Base OHLCV Data (7 features):

Technical Indicators (8 features):

Market Stress Indicators (2 features):

The Reinforcement Learning Environment

Framework: FinRL with Stable-Baselines3

Algorithm: Proximal Policy Optimization (PPO)

State Space: 121 Dimensions

Action Space: 10 Continuous Values [-1, 1]

Reward: Portfolio Value Change

The Experimental Journey: From Baseline to Best

Experiment 1: The Baseline (200K Timesteps)

Experiment 2: More Training (500K Timesteps)

Experiment 3: Lower Learning Rate

Experiment 4: Different Algorithm (A2C)

Experiment 5: The Breakthrough - High Entropy

Experiment 6: The Best Configuration (1M Timesteps + High Entropy)

Experiment 7: Seed Sensitivity Test

Experiments 8-9: Ensemble Methods

Experiment 10: Investigating the Star Performer

Summary of All Experiments

Key Takeaways

1. Entropy Coefficient is Critical

2. More Training Isn't Always Better

3. Simple Ensemble Averaging Fails

4. Seed Sensitivity is Real

5. Domain Knowledge Helps

Final Results

Technologies Used

Conclusion

Interested in the Code?