Introduction
Can a model trained specifically on financial text actually outperform general-purpose language models? And if so, can we turn those predictions into real trading signals?
These questions drove this project. I took four transformer architectures - BERT, RoBERTa, DistilBERT, and FinBERT - and put them through a head-to-head comparison on financial sentiment classification. But I didn't stop there. I built a complete pipeline that takes raw news headlines and converts them into backtested trading signals.
The result? FinBERT achieved 87.2% accuracy, beating general-purpose BERT by 3.9 percentage points. I also built a proof-of-concept trading pipeline that converts sentiment predictions into backtested trading signals.
Understanding the Challenge
Before diving into the models, let me explain why financial sentiment is surprisingly difficult - even for state-of-the-art NLP.
Financial Text is Deceptively Hard
Consider this sentence:
"Only Lannen Tehtaat showed a loss, but it has only recently started streamlining..."
Is this negative (mentions "loss"), neutral (factual reporting), or positive (implies improvement ahead)? Even humans disagree. In my experiments, models disagreed on 17.9% of test samples - these represent genuinely ambiguous financial statements.
Why General Models Struggle
Standard language models like BERT are trained on Wikipedia and books. They understand general English but miss financial nuances:
- "Bearish" isn't about animals
- "Correction" isn't fixing a mistake
- "Exposure" isn't about photography
This is where domain-specific models like FinBERT shine - they've seen millions of financial documents during pre-training.
The Models: A Fair Comparison
I compared four transformer architectures, each representing a different approach:
BERT (bert-base-uncased)
The original transformer that started it all. Google released BERT in 2018, and it revolutionized NLP by learning to understand context from both directions (left and right) simultaneously. With 110M parameters, it's trained on Wikipedia and books - great for general text, but it has never seen a financial document. I used it as my baseline.
RoBERTa (roberta-base)
Facebook's "Robustly Optimized BERT" - same architecture as BERT, but trained smarter. They removed the next-sentence prediction task, trained on 10x more data, and used dynamic masking instead of static. The result? Better performance without changing the model itself. I included it to see if training procedure alone could close the gap with domain-specific models.
DistilBERT (distilbert-base-uncased)
A smaller, faster version of BERT created through knowledge distillation - essentially, a student model that learns to mimic BERT's behavior. With only 66M parameters (40% smaller), it runs 2x faster while retaining 97% of BERT's language understanding. Perfect for when you need speed over marginal accuracy gains.
FinBERT (ProsusAI/finbert)
The domain specialist. FinBERT starts with BERT's architecture but is further pre-trained on millions of financial documents - SEC filings, earnings calls, financial news. It knows that "bearish" means pessimistic, not related to bears. This is the model I expected to win, and I wanted to quantify exactly how much domain knowledge helps.
| Model | Parameters | Key Feature | Why Include |
|---|---|---|---|
| BERT | 110M | The original transformer | Baseline everyone knows |
| RoBERTa | 125M | Better training procedure | Shows training matters |
| DistilBERT | 66M | 40% smaller, 2x faster | Production efficiency |
| FinBERT | 110M | Pre-trained on financial text | Domain adaptation |
The Critical Design Decision: Fair Comparison
To ensure a fair comparison, all models used identical hyperparameters:
TrainingConfig(
learning_rate=2e-5,
batch_size=16,
epochs=3,
warmup_ratio=0.1, # 10% of steps for warmup
weight_decay=0.01,
max_grad_norm=1.0, # Gradient clipping
early_stopping_patience=2,
)This is crucial. Without standardization, you can't tell if Model A beats Model B because it's better, or because you accidentally gave it better hyperparameters.
The Data: Financial PhraseBank
I used the Financial PhraseBank dataset - approximately 4,845 financial news sentences labeled as negative, neutral, or positive.
The Class Imbalance Problem
| Class | Count | Percentage |
|---|---|---|
| Neutral | 2,879 | 59.4% |
| Positive | 1,363 | 28.1% |
| Negative | 603 | 12.4% |

Bar chart showing the class distribution. Neutral sentences dominate at nearly 60%, while negative samples are the minority at just 12%.

Pie chart visualization of the same distribution. The imbalance is clear - any model could achieve 59% accuracy by always predicting "neutral."
Neutral dominates. A naive model could achieve 59% accuracy by always predicting "neutral." To handle this, I used:
- Stratified splits: Maintain class ratios in train/val/test
- Weighted loss function: Penalize mistakes on minority classes more heavily
- Per-class metrics: Track precision/recall for each class, not just overall accuracy

Stratified split preserving class ratios: 80% training, 10% validation, 10% test. Each split maintains the same proportion of negative/neutral/positive samples.
Dataset Characteristics
Financial sentences tend to be concise - most are under 40 words:

Histogram of sentence lengths. Most financial headlines are between 15-35 words, which is well within BERT's 512 token limit.

Box plot comparing word counts across sentiment classes. Interestingly, all three classes have similar length distributions - sentiment isn't correlated with verbosity.
Results: Domain Adaptation Wins
After training all four models on a Google Colab T4 GPU (total time: ~15 minutes), here are the results:
| Model | Accuracy | F1 Score | Inference (ms) | Training Time |
|---|---|---|---|---|
| FinBERT | 87.2% | 0.873 | 0.7ms | 4.2min |
| RoBERTa | 84.5% | 0.846 | 0.5ms | 4.3min |
| BERT | 84.1% | 0.842 | 0.5ms | 4.1min |
| DistilBERT | 81.4% | 0.816 | 0.2ms | 2.1min |

Confusion matrices for all four models. FinBERT (bottom right) shows the darkest diagonal - meaning more correct predictions. Notice how all models struggle most with the negative class (smallest sample size).
Key Finding 1: Domain Adaptation Provides 3-6% Improvement
FinBERT outperforms all general-purpose models. The financial pre-training gives it an understanding of domain-specific language that BERT simply doesn't have.
Key Finding 2: DistilBERT Offers Compelling Trade-offs
DistilBERT achieves 93% of FinBERT's accuracy with:
- 40% fewer parameters (66M vs 110M)
- 3.5x faster inference (0.2ms vs 0.7ms)
- Half the training time (2.1min vs 4.2min)
For cost-sensitive production deployments, DistilBERT is often the right choice.
Key Finding 3: Training Procedure Matters
RoBERTa slightly outperforms BERT despite having the same architecture. The difference? RoBERTa was trained with more data and better masking strategies. Architecture isn't everything.
Error Analysis: Where Models Disagree
I analyzed the 87 samples (17.9%) where models disagreed to understand their failure modes.

Venn diagram showing where models agree and disagree. 82.1% of samples have unanimous agreement across all four models. The remaining 17.9% represent genuinely ambiguous cases.
FinBERT vs BERT: Head-to-Head
| Scenario | Count |
|---|---|
| FinBERT correct, BERT wrong | 30 |
| BERT correct, FinBERT wrong | 11 |
| Both correct | 393 |
| Both wrong | 51 |
FinBERT correctly classified 30 samples that BERT missed, while only failing on 11 that BERT got right. This is strong evidence that domain-specific pre-training provides real value.
Example Disagreement
"Key shareholders of Finnish IT services provider TietoEnator Oyj on Friday rejected..."
- True label: Positive
- BERT: Negative (triggered by "rejected")
- FinBERT: Positive (understands shareholder context)
FinBERT understands that shareholders rejecting something can be positive for the company - a nuance that general-purpose BERT misses.
From Sentiment to Trading Signals
With a working sentiment model, I built a complete pipeline to generate trading signals from real news:
Step 1: News Ingestion
Fetch news headlines from Yahoo Finance for 10 tickers:
STOCK_TICKERS = ['AAPL', 'AMZN', 'BAC', 'GLD', 'GOOGL',
'JPM', 'MSFT', 'NVDA', 'SPY', 'TLT']Step 2: Sentiment Prediction
Run each headline through FinBERT to get sentiment scores (-1 to +1).
Step 3: Signal Aggregation
Aggregate daily sentiment with rolling averages:
- 3-day rolling average: Short-term sentiment trend
- 7-day rolling average: Medium-term sentiment trend
- Sentiment momentum: Difference between short and long MA
Step 4: Backtesting
Test four trading strategies:
| Strategy | Rule | Return | Sharpe Ratio |
|---|---|---|---|
| Sentiment Threshold | Long if > 0.2, Short if < -0.2 | +0.03% | 0.59 |
| Long Only | Long if > 0.1, else flat | +0.14% | 2.75 |
| Momentum | Trade on sentiment momentum | +0.18% | 1.87 |
| Rolling 3d | Long if 3d rolling > 0.15 | +0.32% | 9.07 |
Best Strategy: Rolling 3-Day Sentiment
The rolling 3-day strategy achieved the best risk-adjusted returns:
- Total Return: +0.32%
- Sharpe Ratio: 9.07
- Beat Market By: +0.25%
The smoothing effect of the rolling average filters out noise from individual headlines, capturing genuine sentiment shifts.
Important Caveat: Why the 9.07 Sharpe Ratio is Misleading
A Sharpe ratio of 9.07 is extraordinarily high and warrants scrutiny. The Sharpe ratio measures risk-adjusted return: (Return - Risk-Free Rate) / Standard Deviation. A 9.07 means the strategy's excess return is ~9x its volatility.
Typical Benchmarks
| Sharpe | Interpretation |
|---|---|
| < 1.0 | Suboptimal |
| 1.0 - 2.0 | Good |
| 2.0 - 3.0 | Very good |
| > 3.0 | Excellent / Suspicious |
Why This Sharpe is Likely Inflated
A Sharpe of 9+ in a real trading context is almost certainly due to one or more of these issues:
- Short backtest period: Only 7 days of data - far too small for reliable statistics
- Low trade frequency: Only 2 trades executed - cannot establish statistical significance
- 100% win rate: Lucky streak, not demonstrable skill
- Zero drawdown: Never experienced a loss (unrealistic in real trading)
- No transaction costs: Slippage, commissions, and spreads ignored
The Math Problem
The Sharpe calculation itself is correct (annualized volatility = std * sqrt(252), annualized return = mean * 252). However, with only 7 days and 2 trades, the standard deviation is artificially low, and the annualization amplifies a tiny edge into an absurd number.
Essentially, the 9.07 Sharpe is saying: "If this 7-day pattern repeated perfectly for a full year, it would be amazing." But that's a massive extrapolation from almost no data.
Honest Assessment
The sentiment model itself (87.2% accuracy) is validated on a proper test set. The trading strategy results, however, should be treated as a proof-of-concept demonstrationrather than evidence of a profitable trading system. A proper validation would require:
- At least 1 year of backtest data
- Hundreds of trades for statistical significance
- Realistic transaction costs and slippage
- Out-of-sample testing on unseen time periods
The Dashboard: Making It Interactive
I built a Streamlit dashboard with three pages:
1. Model Comparison
Interactive charts comparing accuracy, speed, and the accuracy-speed trade-off across all four models.
2. Live Sentiment
Select a ticker and see:
- Average sentiment score
- Sentiment distribution (pie chart)
- Recent headlines with sentiment labels
3. Backtest Results
Strategy comparison with returns, Sharpe ratios, and detailed metrics.
Try it yourself: Live Demo on Hugging Face Spaces
Technical Implementation
Project Structure
financial_sentiment_analysis/
├── src/
│ ├── data/ # Data loading, tokenization
│ ├── models/ # Classifier, trainer
│ ├── analysis/ # Error analysis
│ ├── news/ # News fetching, processing
│ ├── signals/ # Sentiment aggregation
│ └── backtesting/ # Trading strategies, metrics
├── scripts/ # CLI scripts for each pipeline
├── tests/ # Comprehensive test suite
├── app/ # Streamlit dashboard
└── outputs/ # Results, charts, modelsKey Design Decisions
1. Modular Architecture
Each component is independent and testable:
- SentimentClassifier: Works with any HuggingFace transformer
- Trainer: Handles scheduler, early stopping, metrics
- SentimentPredictor: Inference wrapper for production
2. Hybrid Compute Strategy
| Task | Environment | Reason |
|---|---|---|
| Development | Local (CPU) | Fast iteration |
| Model training | Google Colab (GPU) | Free T4 GPU |
| Inference | Local (CPU) | Fast enough |
| Dashboard | Hugging Face Spaces | Free hosting |
The Journey: From Baseline to Production
This project evolved through four distinct phases. Here's how I approached it:
Phase 1: Getting the Baseline Right
I started by loading the Financial PhraseBank dataset and building a tokenization pipeline that works with any HuggingFace model. My first experiment was training BERT with a frozen encoder - only updating the classification head. Result: 62.4% accuracy. Not great, but it confirmed the pipeline worked.
Phase 2: Fine-tuning All Four Models
Next, I unfroze the encoders and fine-tuned all layers. This is where the magic happens - the pre-trained language understanding adapts to our specific task. I set up class weights to handle the imbalanced data, added a learning rate scheduler with warmup, and implemented early stopping to prevent overfitting. Training all four models took about 15 minutes on a Google Colab T4 GPU.
Phase 3: Error Analysis and Model Selection
With trained models in hand, I dug into where they disagreed. I built confusion matrices, analyzed the 87 samples where models gave different predictions, and compared FinBERT vs BERT head-to-head. This analysis confirmed that FinBERT's domain knowledge translates to real performance gains - not just on average, but on the hardest examples.
Phase 4: Building the Trading Pipeline
The final phase extended sentiment classification into a complete trading system. I built a news fetcher for Yahoo Finance, aggregated daily sentiment scores with rolling averages, and backtested four different trading strategies. The rolling 3-day strategy emerged as the winner with a 9.07 Sharpe ratio.
Lessons Learned
1. Domain Adaptation is Worth It
FinBERT's 3-6% improvement over general models is significant. For financial applications, always consider domain-specific models.
2. Fair Comparison Requires Discipline
Without identical hyperparameters, you can't draw valid conclusions. This seems obvious but is often overlooked.
3. Efficiency Matters in Production
DistilBERT's 3.5x speed advantage makes it viable for real-time applications where FinBERT might be too slow.
4. Sentiment Alone Isn't Enough
Raw sentiment scores are noisy. Rolling averages and momentum indicators significantly improve signal quality.
Conclusion
This project demonstrates that:
- Domain-specific pre-training provides measurable improvements - FinBERT's 87.2% accuracy vs BERT's 84.1%
- Efficiency and accuracy are trade-offs - DistilBERT offers 93% of the accuracy at 3.5x the speed
- Sentiment can generate trading signals - The rolling 3-day strategy achieved a 9.07 Sharpe ratio
- End-to-end pipelines are achievable - From raw text to trading signals with a live dashboard
The code is open source. The dashboard is live. Try it yourself and let me know what you find.
Resources
- GitHub: github.com/nimeshk03/financial-sentiment-transformers
- Live Demo: Hugging Face Spaces
- Dataset: Financial PhraseBank on Kaggle
Disclaimer: This project is for educational purposes. Past backtest performance does not guarantee future results. Always do your own research before making investment decisions.