Skip to content

CrackTrader Large Dataset Performance Analysis

The challenge: historical data volume

Data Volume Reality Check

Timeframe 1 Month 3 Months 1 Year 3 Years
1s candles 2.6M 7.8M 31.5M 94.6M
1m candles 43.2K 129.6K 525.6K 1.6M
5m candles 8.6K 25.9K 105.1K 315.4K
1h candles 720 2,160 8,760 26,280

Multi-symbol impact example: 10 symbols Γ— 1 year Γ— 1m β‰ˆ 5.25M candles

Current caching system

Cracktrader includes a caching system:

# HistoricalDataCache (already implemented!)
cache = HistoricalDataCache(base_dir="data", enabled=True)

# Automatic caching structure:
# data/binance/spot/BTC_USDT/1m/2024-01.pkl
# data/binance/spot/BTC_USDT/1m/2024-02.pkl

Features: - Monthly file segmentation - Automatic deduplication - Multiple exchange support - Configurable cache directory - Incremental updates (only fetches new data) - Thread-safe operations - Metadata tracking

Performance Bottleneck Analysis

Problem 1: Pandas CSV Reading

# Typical user approach (SLOW for large datasets)
import pandas as pd
df = pd.read_csv("btc_1m_2024.csv")  # 525K rows = ~2-5 seconds
df['sma_20'] = df['close'].rolling(20).mean()  # Another 1-2 seconds

For multi‑million rows, CSV and pandas operations can add up. Prefer cached, columnar formats and streaming.

Problem 2: Memory Usage

# Memory footprint
1M candles Γ— 6 columns Γ— 8 bytes (float64) = 48MB per symbol
10 symbols = 480MB just for raw OHLCV data
+ pandas overhead = ~1GB total memory usage

Problem 3: Backtrader DataFrame Ingestion

# Current approach in strategies
class MyStrategy(bt.Strategy):
    def __init__(self):
        # This loads ALL data into memory at once
        self.sma = bt.indicators.SMA(period=20)  # Calculates for all historical data

Enhancement Roadmap

Phase 1: Enable caching

Enable caching for repeated backtests and large ranges.

# In CCXTStore.__init__ - ADD THIS:
def __init__(self, exchange_config, cache_enabled=True, cache_dir="./data"):
    # Enable caching by default for better UX
    if cache_enabled:
        self._cache_enabled = True
        self._data_cache = HistoricalDataCache(base_dir=cache_dir, enabled=True)

Example

# First run: Fetches from API and caches
store = CCXTStore(exchange_config, cache_enabled=True)
feed = CCXTDataFeed(store, symbol="BTC/USDT", timeframe="1m", historical_limit=100000)

# Subsequent runs: Instant load from cache!
# Only fetches new data since last update

Phase 2: Optimize data pipeline

2A: Prefer Parquet over pickle/CSV

# Current: Monthly pickle files (~2-5MB per month)
# Upgrade: Monthly parquet files (~500KB-1MB per month, 5x smaller!)

import pyarrow as pa
import pyarrow.parquet as pq

# Parquet benefits:
# - Smaller file sizes with compression
# - Faster reads than CSV in many cases
# - Column-oriented, cross-language

2B: Lazy loading and streaming

class StreamingCCXTDataFeed(CCXTDataFeed):
    """Memory-efficient streaming data feed for large datasets."""

    def __init__(self, *args, streaming_chunk_size=10000, **kwargs):
        self.chunk_size = streaming_chunk_size
        self.data_iterator = None
        super().__init__(*args, **kwargs)

    def start(self):
        # Don't load all data at once!
        self.data_iterator = self._create_data_stream()

    def _create_data_stream(self):
        # Yield data in chunks instead of loading all at once
        for month_file in self.get_cache_files():
            chunk = load_parquet_chunk(month_file, chunk_size=self.chunk_size)
            yield from chunk

2C: Vectorized operations (potential native acceleration)

# Current pandas approach (slow for large datasets)
def calculate_technical_indicators(df):
    df['sma_20'] = df['close'].rolling(20).mean()
    df['rsi'] = calculate_rsi(df['close'], 14)
    return df

# Future Rust-optimized approach
import cracktrader_fast

def calculate_technical_indicators_fast(ohlcv_array):
    # Process directly on numpy arrays, 50x faster
    sma_20 = cracktrader_fast.sma(ohlcv_array[:, 4], window=20)
    rsi = cracktrader_fast.rsi(ohlcv_array[:, 4], window=14)
    return sma_20, rsi

Phase 3: User-friendly configuration

3A: Smart cache management

# Auto-enable caching for large datasets
class SmartCCXTStore(CCXTStore):
    def __init__(self, *args, auto_cache_threshold=50000, **kwargs):
        # Automatically enable caching for large historical requests
        if kwargs.get('historical_limit', 0) > auto_cache_threshold:
            kwargs['cache_enabled'] = True
            print(f"Auto-enabling cache for large dataset ({kwargs['historical_limit']} candles)")
        super().__init__(*args, **kwargs)

3B: Configurable cache locations

# Support for custom cache directories
store = CCXTStore(
    exchange_config,
    cache_dir="~/trading_data",  # User-configurable
    cache_enabled=True,
    cache_format="parquet"  # parquet, pickle, or hdf5
)

3C: Cache statistics and management

# Add cache management utilities
store.cache.get_cache_info("BTC/USDT", "1m")
# Returns: {
#   "total_candles": 525600,
#   "size_on_disk": "45MB",
#   "date_range": "2023-01-01 to 2024-01-01",
#   "last_updated": "2024-01-01 12:00:00"
# }

store.cache.update_cache("BTC/USDT", "1m")  # Fetch only new data
store.cache.clear_cache("BTC/USDT")  # Clean up old data

Potential native optimizations for large data

Priority 1: Data Processing Pipeline

// Rust implementation for technical indicators
use numpy::PyArray1;
use pyo3::prelude::*;

#[pyfunction]
fn sma(prices: &PyArray1<f64>, window: usize) -> PyResult<Vec<f64>> {
    // Example native implementation outline
    let prices = prices.as_slice()?;
    let mut sma_values = Vec::with_capacity(prices.len());

    for i in window..prices.len() {
        let sum: f64 = prices[i-window..i].iter().sum();
        sma_values.push(sum / window as f64);
    }

    Ok(sma_values)
}

Priority 2: Data I/O Optimization

// Rust-based parquet reader (could be 10x faster than Python)
#[pyfunction]
fn load_ohlcv_fast(file_path: &str, start_date: Option<&str>, end_date: Option<&str>) -> PyResult<Vec<Vec<f64>>> {
    // Use rust parquet crate for blazing fast I/O
    // Filter by date range in Rust instead of Python
    // Return directly to numpy arrays
}

Priority 3: Memory Management

// Streaming data processor to avoid loading everything into memory
#[pyclass]
struct OHLCVStreamer {
    chunk_size: usize,
    current_chunk: Vec<Vec<f64>>,
}

#[pymethods]
impl OHLCVStreamer {
    fn next_chunk(&mut self) -> Option<Vec<Vec<f64>>> {
        // Stream data in chunks to keep memory usage constant
    }
}

Example performance outline

Before optimization (large dataset)

Loading 1M candles from CSV: 5-10 seconds
Calculating SMA(20): 2-5 seconds
Calculating RSI(14): 3-7 seconds
Memory usage: ~500MB-1GB
Total time: 10-22 seconds

After enabling caching

First run: 5-10 seconds (same as before, but cached)
Subsequent runs: 0.1-0.5 seconds (loaded from cache!)
Memory usage: Same
90%+ time reduction for repeat usage βœ…

After Phase 2 (Parquet + Streaming)

Loading 1M candles from parquet: 0.5-1 seconds (10x faster)
Streaming processing: Constant memory usage
Memory usage: 50-100MB (10x reduction)
80% time reduction, 90% memory reduction βœ…

After Phase 3 (Rust Pipeline)

Loading: 0.1-0.2 seconds (50x faster than CSV)
Calculating SMA(20): 0.01-0.05 seconds (100x faster)
Calculating RSI(14): 0.02-0.1 seconds (50x faster)
Memory usage: <50MB (minimal allocations)
Total time: 0.13-0.35 seconds (50-100x overall improvement!) βœ…

Implementation Timeline

Phase 1: Enable Cache (1-2 days)

  • βœ… Cache system already exists!
  • πŸ”„ Enable by default for large datasets
  • πŸ”„ Add user configuration options
  • πŸ”„ Document usage examples

Phase 2: Optimize Data Pipeline (1-2 weeks)

  • πŸ”„ Add Parquet support to HistoricalDataCache
  • πŸ”„ Implement streaming/lazy loading
  • πŸ”„ Add smart chunk management
  • πŸ”„ Performance benchmarks

Phase 3: Rust Optimization (2-4 weeks, post-docs)

  • πŸ”„ Set up Rust build pipeline
  • πŸ”„ Implement core technical indicators in Rust
  • πŸ”„ Add fast data I/O operations
  • πŸ”„ Integration testing and benchmarks

Immediate Actions (High Impact, Low Effort)

  1. Enable caching by default for historical_limit > 10000
  2. Add cache configuration to CCXTStore constructor
  3. Document cache usage in examples
  4. Add cache management utilities

πŸ’‘ Key Insights

  1. Cache system already exists - just needs to be enabled and documented!
  2. Parquet format could give 5x storage reduction + 10x loading speed
  3. Streaming approach prevents memory issues with multi-million candle datasets
  4. Rust pipeline could provide 50-100x performance improvement for technical analysis
  5. Network latency (200-1000ms) is still the biggest bottleneck for live trading

The foundation is solid - we just need to enable and optimize what's already there!