Back to Blog

From 59% to 72%: How We Rebuilt Our Bitcoin Signal Model

Walk-forward validation revealed uncomfortable truths about our model. Two classifiers were performing worse than a coin flip. Here's how we fixed it.

71.9%
Two-Stage Directional Accuracy
Walk-forward validated on 230+ out-of-sample trades
59.2%
v3.4.0 Accuracy
71.9%
v3.5.1 Accuracy
+12.7%
Improvement

The Problem We Discovered

In our previous blog post, we described our Wild Bootstrap OLS regression model achieving 59.2% directional accuracy. We were proud of that number. It represented a statistically significant edge over random chance.

Then we ran walk-forward validation.

Walk-forward validation is the gold standard for testing trading models. Unlike backtesting (which can overfit to historical data), walk-forward validation simulates real trading: train on past data, predict the future, record the outcome, roll forward, repeat. No peeking at future information.

The results were sobering.

What Walk-Forward Validation Revealed

Two of our six classifiers were performing worse than a coin flip. The Weekend Effect classifier, which we believed had 72% accuracy based on in-sample testing, actually achieved only 29.9% in live conditions. The Tail Risk classifier managed 39.1%.

These weren't minor underperformers. These were classifiers actively hurting our signal by voting in the wrong direction more often than not. Every time they cast a vote, they were more likely to be wrong than right.

Why Did This Happen?

The answer lies in the difference between in-sample and out-of-sample performance.

When we built v3.4.0, we validated using traditional cross-validation: randomly split the data, train on some, test on others, aggregate results. This approach has a fatal flaw for time-series data: it leaks future information into the training process.

The Weekend Effect pattern, for example, appeared highly significant when we could "see" both Friday and the subsequent Monday in our training data. But in real trading, you only know Friday's close. You don't know what Monday will bring. The pattern that seemed crystal clear in hindsight became noise in real-time.

The Classifiers That Failed

ClassifierExpected AccuracyActual AccuracyGapAction
WeekendEffect72.0%29.9%-42.1%DISABLED
TailRiskContrarian62.0%39.1%-22.9%DISABLED
SentimentContrarian67.0%51.1%-15.9%REDUCED

The Weekend Effect classifier was our worst offender. A 29.9% accuracy means it was wrong 70% of the time. If we had simply inverted its signal, we would have achieved 70% accuracy from that classifier alone. But that would be curve-fitting. The honest conclusion: the pattern doesn't work in live trading.

The Solution: Two-Stage Prediction + Calibration

We didn't just disable the failing classifiers and call it a day. We rebuilt the model architecture from scratch, implementing two major innovations:

1. Two-Stage Prediction

Stage 1: Direction Classification — A 4-classifier ensemble votes on whether BTC will go up or down. Each classifier specializes in a different signal type.

Stage 2: Magnitude Regression — Once direction is established with high confidence, we estimate how much BTC will move using OLS regression.

2. Calibrated Weights

Instead of using theoretical weights from in-sample analysis, we calibrated each classifier's weight based on its actual walk-forward performance.

High performers (64%+ accuracy) got boosted. Near-random performers (50%) got reduced. Below-random performers got disabled.

3. On-Chain Valuation

We added SOPR (Spent Output Profit Ratio) and MVRV Z-Score as a new classifier. On-chain metrics showed 68.7% accuracy in walk-forward validation — our second-best performer.

4. Live Monitoring

The model now tracks its own performance in real-time. If accuracy drops below thresholds, automated alerts fire. We'll know immediately if the model starts degrading.

The New Classifier Weights

After calibration, here's how the weights changed:

ClassifierValidated AccuracyOld WeightNew WeightChange
TechnicalMomentum64.3%0.250.36+44%
OnChainValuation68.7%0.220.317+44%
FlowDynamics59.2%0.150.18+20%
SentimentContrarian51.1%0.200.12-40%
TailRiskContrarian39.1%0.100.012DISABLED
WeekendEffect29.9%0.080.012DISABLED

The top two classifiers—TechnicalMomentum and OnChainValuation—now control 67.7% of the signal weight. These are the patterns that actually work in live trading.

Regime-Specific Calibration

We also discovered that the model performs very differently across market regimes:

78.4%
Bull Market
72.1%
Accumulation
58.3%
Bear Market
16.0%
Capitulation

The model excels in bull markets and accumulation phases. It struggles during capitulation. This makes intuitive sense: capitulation events are characterized by extreme, unpredictable moves that defy historical patterns.

In v3.5.1, we apply regime-specific position multipliers:

  • Bull/Accumulation: 1.1x position sizing (high confidence)
  • Neutral: 1.0x position sizing (standard)
  • Bear: 0.7x position sizing (reduced confidence)
  • Capitulation/Correction: 0.3x position sizing (defensive)

Why This Should Work Better

The improvements in v3.5.1 aren't based on finding new alpha. They're based on removing sources of negative alpha and concentrating weight on what actually works.

The Key Insight

A model that correctly identifies and removes broken components will outperform a model that averages across both working and broken components. We didn't make the good classifiers better—we stopped letting the bad ones vote.

The 71.9% accuracy isn't a theoretical number from in-sample testing. It's the actual hit rate from walk-forward validation across 230+ trades spanning multiple market regimes. The model predicted the direction correctly 166 times out of 230.

What We're Monitoring

V3.5.1 includes a live monitoring system that tracks:

  • 7-day rolling accuracy — Alert if drops below 55%
  • Classifier drift — Alert if any classifier's accuracy shifts 15%+ from baseline
  • Losing streaks — Alert after 5 consecutive incorrect predictions
  • Calibration error — Alert if predicted confidence doesn't match actual win rate
  • Regime-specific degradation — Alert if performance drops in specific market conditions

If the model starts failing, we'll know within days, not months. This is the difference between quantitative trading and blind faith.

Explore the Updated Platform

The new model is live at mischa0x.com/bitvault2. You can see:

  • The two-stage prediction output (direction + magnitude)
  • Classifier votes and confidence levels
  • Real-time model health status
  • Historical accuracy tracking

The original dashboard at /bitvault remains available for comparison.


Try the New Model

See how v3.5.1's two-stage prediction compares to the original. The model health badge in the header shows real-time accuracy.


Lessons Learned

Building this model taught us several hard lessons:

  1. In-sample results lie. A pattern that looks perfect in backtesting may not survive contact with reality. Always use walk-forward validation.
  2. Popular indicators can be noise. The Weekend Effect and derivatives data (funding rates, open interest) showed zero or negative predictive power. Don't trade on narratives—trade on validated signals.
  3. Subtraction beats addition. Removing two broken classifiers improved accuracy more than any single new feature could have.
  4. Monitor relentlessly. Markets change. Patterns decay. A model that works today may fail tomorrow. Build monitoring from day one.

The 71.9% accuracy in v3.5.1 isn't the end of this journey. It's a checkpoint. We'll continue validating, calibrating, and improving. When the model starts degrading—and it will eventually—we'll catch it early and adapt.

That's the difference between quantitative trading and gambling.


Model version: v3.5.1 | Calibration date: January 28, 2026 | Validation trades: 230+ | Walk-forward accuracy: 71.9%