An in-depth look at building SabiScore from scratch—the architecture, ML models, production challenges, and lessons learned from serving 350+ real users.
Here's the thing about sports prediction: everyone thinks they can do it, but very few can prove it consistently.
When I started building SabiScore in early 2023, I had a simple goal: create an AI system that could predict sports outcomes with enough accuracy that real people would trust it with their betting decisions. Not a toy project. Not a Kaggle competition. A production system that actually ships and drives ROI.
Fast forward to today:
This isn't just another "I built an ML model" post. This is the complete story of taking a prediction model from notebook to production—including the mistakes, architectural decisions, and hard-won lessons that actually matter.
Let's dive in.
Before we get to the solution, let's talk about why this problem is genuinely challenging (and why most sports prediction models fail).
Sports data is messy. Really messy.
The lesson: You can't just scrape ESPN and expect good results. Data engineering is 60% of the work.
Raw stats like "points scored" or "win-loss record" are table stakes. Everyone has those. The edge comes from engineered features:
Here's where 90% of data scientists stop. They build features, train a model, and wonder why accuracy plateaus at 55-60%.
The secret? It's not about having 100 features. It's about having the right 20 features that capture non-linear relationships.
Sports outcomes have inherent randomness. A referee's bad call, a lucky bounce, a player having a career night—these are non-predictable variance.
If your model is 95% accurate on historical data, you've overfit. The real world will punish you. Hard.
My target was always 70-75% accuracy on out-of-sample data. Anything higher is a red flag.
Here's the high-level architecture:
[Frontend: Next.js/React]
↓
[API Gateway: FastAPI]
↓
[Prediction Service: XGBoost Ensemble]
↓
[Data Layer: PostgreSQL + Redis]
↓
[ETL Pipeline: Airflow + Python]
Frontend: Next.js + React + Tailwind CSS
Backend: FastAPI + Python
ML Model: XGBoost Ensemble
Database: PostgreSQL for storage, Redis for caching
Deployment: Docker + DigitalOcean
Here's the truth: No single model gets you to 71%.
I started with XGBoost alone. Topped out at 64% accuracy.
The breakthrough came from model ensembling—combining multiple models to leverage their diverse strengths.
# Simplified version of the ensemble approach
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
# Base models
xgb_model = XGBClassifier(
max_depth=6,
learning_rate=0.05,
n_estimators=200,
objective='binary:logistic'
)
lgbm_model = LGBMClassifier(
max_depth=6,
learning_rate=0.05,
n_estimators=200
)
# Meta-learner (logistic regression for probability calibration)
meta_model = LogisticRegression()
# Ensemble using soft voting (averages predicted probabilities)
ensemble = VotingClassifier(
estimators=[
('xgb', xgb_model),
('lgbm', lgbm_model)
],
voting='soft',
weights=[0.6, 0.4] # XGBoost gets more weight
)
ensemble.fit(X_train, y_train)
predictions = ensemble.predict_proba(X_test)
Result: 7% accuracy improvement over single XGBoost model.
These 5 feature categories drive 80% of prediction power:
Rolling momentum metrics (weighted by recency)
df['points_last_5_weighted'] = (
df.groupby('team')['points']
.transform(lambda x: x.rolling(5).apply(
lambda y: np.average(y, weights=range(1, len(y)+1))
))
)
Head-to-head historical performance
Contextual game state
Player impact metrics (aggregated to team level)
Market sentiment (betting lines as a feature)
Pro tip: Use betting lines as a feature, not a target. The wisdom of the crowd is powerful.
Building the model is 40% of the work. Deploying it reliably is the other 60%.
Problem: Initial model took 800ms to generate a prediction. Unacceptable for a responsive UI.
Solution:
# FastAPI async prediction endpoint
@app.post("/predict")
async def predict(request: PredictionRequest):
# Check Redis cache first
cache_key = generate_cache_key(request.features)
cached_result = await redis_client.get(cache_key)
if cached_result:
return json.loads(cached_result)
# Model inference (async to avoid blocking)
prediction = await asyncio.to_thread(
model.predict_proba,
request.features
)
# Cache result for 5 minutes
await redis_client.setex(
cache_key,
300,
json.dumps(prediction)
)
return prediction
Result: 87ms average latency (p95: 140ms)
Problem: Need to retrain models weekly as new data comes in. Can't afford downtime.
Solution: Blue-green deployment for ML models
# Model version management
class ModelRegistry:
def __init__(self):
self.active_model = load_model("v1")
self.shadow_model = None
def update_model(self, new_model_path):
# Load new model as shadow
self.shadow_model = load_model(new_model_path)
# A/B test for 1000 predictions
if self.validate_shadow_model():
# Atomic swap
self.active_model = self.shadow_model
self.shadow_model = None
def predict(self, features):
return self.active_model.predict(features)
Problem: Model accuracy degrades over time as team dynamics change.
Solution: Real-time accuracy tracking + automated alerts
# Simplified monitoring logic
def check_model_health():
recent_predictions = get_predictions_last_7_days()
accuracy = calculate_accuracy(recent_predictions)
if accuracy < 0.68:
trigger_retraining_pipeline()
send_alert("Model accuracy dropped to {:.1f}%".format(accuracy * 100))
Result: 99.9% uptime (only 43 minutes downtime in 12 months, all planned maintenance)
| Metric | Value | Industry Benchmark | |--------|-------|-------------------| | Prediction Accuracy | 71% | 55-60% (most models) | | High-Confidence Rate | 76.5% (predictions >70% confidence) | N/A | | Simulated ROI | +12.8% | -5% (typical bettor) | | Active Users | 350+ | N/A | | System Uptime | 99.9% | 99% (industry standard) | | API Latency | 87ms (avg), 140ms (p95) | <200ms acceptable | | Monthly Infrastructure Cost | $48 | N/A |
Want to build something similar? Here's the exact stack I use, with cost breakdown:
| Service | Purpose | Cost | |---------|---------|------| | DigitalOcean Droplet | Main application server (4GB RAM, 2 vCPUs) | $24/month | | PostgreSQL Managed DB | Primary data store | $15/month | | Redis Cloud | Prediction caching | $5/month | | CloudFlare CDN | Static assets + DDoS protection | $0 (free tier) | | Vercel | Frontend hosting (Next.js) | $0 (hobby tier) | | Uptime Robot | Monitoring & alerts | $0 (free tier) | | Sentry | Error tracking | $0 (free tier, <10K events/month) |
Scaling costs: At 1,000+ users, expect ~$150-200/month (mainly database and compute scaling)
# requirements.txt (core dependencies)
fastapi==0.104.1
uvicorn[standard]==0.24.0
xgboost==2.0.2
lightgbm==4.1.0
scikit-learn==1.3.2
pandas==2.1.3
numpy==1.26.2
redis==5.0.1
psycopg2-binary==2.9.9
python-dotenv==1.0.0
pydantic==2.5.0
Why these versions matter: ML library versions MUST be pinned. Model serialization breaks across versions.
For Aspiring ML Engineers:
For Business-Minded Developers:
The biggest lesson from building SabiScore: Done is better than perfect.
I spent 3 months perfecting the model before launching. In hindsight, I should've shipped at 65% accuracy and iterated in public.
Why?
My advice if you're building something similar:
Building production ML systems? I'd love to hear what you're working on.
Reach me at:
Looking for consulting or technical partnerships? Schedule a call | View my portfolio
Found this helpful? Share it with someone building production ML systems.