CrackTrader Large Dataset Performance Analysis¶
The challenge: historical data volume¶
Data Volume Reality Check¶
Timeframe | 1 Month | 3 Months | 1 Year | 3 Years |
---|---|---|---|---|
1s candles | 2.6M | 7.8M | 31.5M | 94.6M |
1m candles | 43.2K | 129.6K | 525.6K | 1.6M |
5m candles | 8.6K | 25.9K | 105.1K | 315.4K |
1h candles | 720 | 2,160 | 8,760 | 26,280 |
Multi-symbol impact example: 10 symbols Γ 1 year Γ 1m β 5.25M candles
Current caching system¶
Cracktrader includes a caching system:
# HistoricalDataCache (already implemented!)
cache = HistoricalDataCache(base_dir="data", enabled=True)
# Automatic caching structure:
# data/binance/spot/BTC_USDT/1m/2024-01.pkl
# data/binance/spot/BTC_USDT/1m/2024-02.pkl
Features: - Monthly file segmentation - Automatic deduplication - Multiple exchange support - Configurable cache directory - Incremental updates (only fetches new data) - Thread-safe operations - Metadata tracking
Performance Bottleneck Analysis¶
Problem 1: Pandas CSV Reading¶
# Typical user approach (SLOW for large datasets)
import pandas as pd
df = pd.read_csv("btc_1m_2024.csv") # 525K rows = ~2-5 seconds
df['sma_20'] = df['close'].rolling(20).mean() # Another 1-2 seconds
For multiβmillion rows, CSV and pandas operations can add up. Prefer cached, columnar formats and streaming.
Problem 2: Memory Usage¶
# Memory footprint
1M candles Γ 6 columns Γ 8 bytes (float64) = 48MB per symbol
10 symbols = 480MB just for raw OHLCV data
+ pandas overhead = ~1GB total memory usage
Problem 3: Backtrader DataFrame Ingestion¶
# Current approach in strategies
class MyStrategy(bt.Strategy):
def __init__(self):
# This loads ALL data into memory at once
self.sma = bt.indicators.SMA(period=20) # Calculates for all historical data
Enhancement Roadmap¶
Phase 1: Enable caching¶
Enable caching for repeated backtests and large ranges.
# In CCXTStore.__init__ - ADD THIS:
def __init__(self, exchange_config, cache_enabled=True, cache_dir="./data"):
# Enable caching by default for better UX
if cache_enabled:
self._cache_enabled = True
self._data_cache = HistoricalDataCache(base_dir=cache_dir, enabled=True)
Example
# First run: Fetches from API and caches
store = CCXTStore(exchange_config, cache_enabled=True)
feed = CCXTDataFeed(store, symbol="BTC/USDT", timeframe="1m", historical_limit=100000)
# Subsequent runs: Instant load from cache!
# Only fetches new data since last update
Phase 2: Optimize data pipeline¶
2A: Prefer Parquet over pickle/CSV¶
# Current: Monthly pickle files (~2-5MB per month)
# Upgrade: Monthly parquet files (~500KB-1MB per month, 5x smaller!)
import pyarrow as pa
import pyarrow.parquet as pq
# Parquet benefits:
# - Smaller file sizes with compression
# - Faster reads than CSV in many cases
# - Column-oriented, cross-language
2B: Lazy loading and streaming¶
class StreamingCCXTDataFeed(CCXTDataFeed):
"""Memory-efficient streaming data feed for large datasets."""
def __init__(self, *args, streaming_chunk_size=10000, **kwargs):
self.chunk_size = streaming_chunk_size
self.data_iterator = None
super().__init__(*args, **kwargs)
def start(self):
# Don't load all data at once!
self.data_iterator = self._create_data_stream()
def _create_data_stream(self):
# Yield data in chunks instead of loading all at once
for month_file in self.get_cache_files():
chunk = load_parquet_chunk(month_file, chunk_size=self.chunk_size)
yield from chunk
2C: Vectorized operations (potential native acceleration)¶
# Current pandas approach (slow for large datasets)
def calculate_technical_indicators(df):
df['sma_20'] = df['close'].rolling(20).mean()
df['rsi'] = calculate_rsi(df['close'], 14)
return df
# Future Rust-optimized approach
import cracktrader_fast
def calculate_technical_indicators_fast(ohlcv_array):
# Process directly on numpy arrays, 50x faster
sma_20 = cracktrader_fast.sma(ohlcv_array[:, 4], window=20)
rsi = cracktrader_fast.rsi(ohlcv_array[:, 4], window=14)
return sma_20, rsi
Phase 3: User-friendly configuration¶
3A: Smart cache management¶
# Auto-enable caching for large datasets
class SmartCCXTStore(CCXTStore):
def __init__(self, *args, auto_cache_threshold=50000, **kwargs):
# Automatically enable caching for large historical requests
if kwargs.get('historical_limit', 0) > auto_cache_threshold:
kwargs['cache_enabled'] = True
print(f"Auto-enabling cache for large dataset ({kwargs['historical_limit']} candles)")
super().__init__(*args, **kwargs)
3B: Configurable cache locations¶
# Support for custom cache directories
store = CCXTStore(
exchange_config,
cache_dir="~/trading_data", # User-configurable
cache_enabled=True,
cache_format="parquet" # parquet, pickle, or hdf5
)
3C: Cache statistics and management¶
# Add cache management utilities
store.cache.get_cache_info("BTC/USDT", "1m")
# Returns: {
# "total_candles": 525600,
# "size_on_disk": "45MB",
# "date_range": "2023-01-01 to 2024-01-01",
# "last_updated": "2024-01-01 12:00:00"
# }
store.cache.update_cache("BTC/USDT", "1m") # Fetch only new data
store.cache.clear_cache("BTC/USDT") # Clean up old data
Potential native optimizations for large data¶
Priority 1: Data Processing Pipeline¶
// Rust implementation for technical indicators
use numpy::PyArray1;
use pyo3::prelude::*;
#[pyfunction]
fn sma(prices: &PyArray1<f64>, window: usize) -> PyResult<Vec<f64>> {
// Example native implementation outline
let prices = prices.as_slice()?;
let mut sma_values = Vec::with_capacity(prices.len());
for i in window..prices.len() {
let sum: f64 = prices[i-window..i].iter().sum();
sma_values.push(sum / window as f64);
}
Ok(sma_values)
}
Priority 2: Data I/O Optimization¶
// Rust-based parquet reader (could be 10x faster than Python)
#[pyfunction]
fn load_ohlcv_fast(file_path: &str, start_date: Option<&str>, end_date: Option<&str>) -> PyResult<Vec<Vec<f64>>> {
// Use rust parquet crate for blazing fast I/O
// Filter by date range in Rust instead of Python
// Return directly to numpy arrays
}
Priority 3: Memory Management¶
// Streaming data processor to avoid loading everything into memory
#[pyclass]
struct OHLCVStreamer {
chunk_size: usize,
current_chunk: Vec<Vec<f64>>,
}
#[pymethods]
impl OHLCVStreamer {
fn next_chunk(&mut self) -> Option<Vec<Vec<f64>>> {
// Stream data in chunks to keep memory usage constant
}
}
Example performance outline¶
Before optimization (large dataset)¶
Loading 1M candles from CSV: 5-10 seconds
Calculating SMA(20): 2-5 seconds
Calculating RSI(14): 3-7 seconds
Memory usage: ~500MB-1GB
Total time: 10-22 seconds
After enabling caching¶
First run: 5-10 seconds (same as before, but cached)
Subsequent runs: 0.1-0.5 seconds (loaded from cache!)
Memory usage: Same
90%+ time reduction for repeat usage β
After Phase 2 (Parquet + Streaming)¶
Loading 1M candles from parquet: 0.5-1 seconds (10x faster)
Streaming processing: Constant memory usage
Memory usage: 50-100MB (10x reduction)
80% time reduction, 90% memory reduction β
After Phase 3 (Rust Pipeline)¶
Loading: 0.1-0.2 seconds (50x faster than CSV)
Calculating SMA(20): 0.01-0.05 seconds (100x faster)
Calculating RSI(14): 0.02-0.1 seconds (50x faster)
Memory usage: <50MB (minimal allocations)
Total time: 0.13-0.35 seconds (50-100x overall improvement!) β
Implementation Timeline¶
Phase 1: Enable Cache (1-2 days)¶
- β Cache system already exists!
- π Enable by default for large datasets
- π Add user configuration options
- π Document usage examples
Phase 2: Optimize Data Pipeline (1-2 weeks)¶
- π Add Parquet support to HistoricalDataCache
- π Implement streaming/lazy loading
- π Add smart chunk management
- π Performance benchmarks
Phase 3: Rust Optimization (2-4 weeks, post-docs)¶
- π Set up Rust build pipeline
- π Implement core technical indicators in Rust
- π Add fast data I/O operations
- π Integration testing and benchmarks
Immediate Actions (High Impact, Low Effort)¶
- Enable caching by default for
historical_limit > 10000
- Add cache configuration to CCXTStore constructor
- Document cache usage in examples
- Add cache management utilities
π‘ Key Insights¶
- Cache system already exists - just needs to be enabled and documented!
- Parquet format could give 5x storage reduction + 10x loading speed
- Streaming approach prevents memory issues with multi-million candle datasets
- Rust pipeline could provide 50-100x performance improvement for technical analysis
- Network latency (200-1000ms) is still the biggest bottleneck for live trading
The foundation is solid - we just need to enable and optimize what's already there!