Historical Data Caching Guide¶
How to use the built-in caching system to speed up repeated backtests.
Overview¶
CrackTrader includes a caching system that is disabled by default. Enabling it can significantly reduce repeated backtest load times by avoiding redundant downloads.
Quick Enable¶
from cracktrader import CCXTStore
# Enable caching
store = CCXTStore(
exchange='binance',
cache_enabled=True, # Enable caching
cache_dir="./trading_data" # Custom location (optional)
)
How It Works¶
File Structure¶
trading_data/
└── binance/
└── spot/
└── BTC_USDT/
└── 1m/
├── 2024-01.pkl # January 2024 data
├── 2024-02.pkl # February 2024 data
├── 2024-03.pkl # March 2024 data
└── metadata.json # Cache statistics
Automatic Features¶
- Monthly segmentation: Efficient storage and retrieval
- Deduplication: Same timestamps are merged automatically
- Incremental updates: Only fetches new data since last update
- Multi-exchange: Separate caches per exchange
- Thread-safe: Concurrent strategy support
Cache Behavior¶
First Run (Cache Miss)¶
feed = CCXTDataFeed(
store=store,
symbol="BTC/USDT",
ccxt_timeframe="1m",
historical_limit=100000 # 100K candles
)
# Fetches from API and writes cache on first run
Subsequent Runs (Cache Hit)¶
# Same code, but now cached
feed = CCXTDataFeed(store=store, symbol="BTC/USDT", ccxt_timeframe="1m", historical_limit=100000)
# Loads from cache on subsequent runs
Incremental Updates¶
# Later, when you run again:
feed = CCXTDataFeed(store=store, symbol="BTC/USDT", ccxt_timeframe="1m", historical_limit=100000)
# Only fetches data since last update
# Cache grows incrementally
Cache Management (advanced)¶
Get Cache Info¶
cache_info = store._data_cache.get_cache_info(
exchange="binance",
symbol="BTC/USDT",
timeframe="1m"
)
print(f"Total candles: {cache_info['total_candles']}")
print(f"Date range: {cache_info['earliest_timestamp']} to {cache_info['latest_timestamp']}")
print(f"Files: {cache_info['files']}")
Clear Cache¶
# Clear entire cache or specific subsets
store._data_cache.clear_cache() # all
store._data_cache.clear_cache(exchange="binance")
store._data_cache.clear_cache(exchange="binance", symbol="BTC/USDT")
store._data_cache.clear_cache(exchange="binance", symbol="BTC/USDT", timeframe="1m")
Notes on performance¶
Cache effectiveness depends on data volume, disk speed, and how often you re-run the same ranges. Use it when iterating frequently on the same symbols/timeframes.
Large Dataset Strategy¶
For multi-year, multi-symbol backtests:
1. Enable Caching by Default¶
# Enable by default for larger ranges
store = CCXTStore(exchange='binance', cache_enabled=True, cache_dir='./trading_data')
2. Batch Data Loading¶
symbols = ["BTC/USDT", "ETH/USDT", "ADA/USDT", "SOL/USDT"]
timeframes = ["1m", "5m", "1h"]
# Pre-load all data (runs once, then cached)
for symbol in symbols:
for timeframe in timeframes:
feed = CCXTDataFeed(
store=store,
symbol=symbol,
ccxt_timeframe=timeframe,
historical_limit=500000 # ~1 year of 1m data
)
# First run: slow (fetches and caches)
# Subsequent runs: instant
3. Monitor Cache Size¶
# Check cache directory size
du -sh trading_data/
# Typical sizes:
# 100K candles: ~5-10MB per symbol/timeframe
# 1M candles: ~50-100MB per symbol/timeframe
Cache Configuration¶
Custom Directories¶
# User-specific cache location
store = CCXTStore(
exchange_config,
cache_enabled=True,
cache_dir=os.path.expanduser("~/trading_data") # ~/trading_data
)
# Project-specific cache
store = CCXTStore(
exchange_config,
cache_enabled=True,
cache_dir="./data/backtest_cache" # Relative to project
)
Multiple Cache Locations¶
# Separate caches for different strategies
backtest_store = CCXTStore(config, cache_dir="./data/backtesting")
live_store = CCXTStore(config, cache_dir="./data/live_trading")
Troubleshooting¶
Cache Not Working¶
# Check if caching is enabled
print(f"Cache enabled: {store._cache_enabled}")
print(f"Cache object: {hasattr(store, '_data_cache')}")
# Verify cache directory exists
import os
print(f"Cache dir exists: {os.path.exists(store._data_cache.base_dir)}")
Performance Still Slow¶
# Check if data is actually cached
cache_info = store._data_cache.get_cache_info("binance", "BTC/USDT", "1m")
if cache_info["total_candles"] == 0:
print("No data cached yet - first run will be slow")
else:
print(f"Cached: {cache_info['total_candles']} candles")
Disk Space Issues¶
# Check cache size
import shutil
cache_size = shutil.disk_usage(store._data_cache.base_dir)
print(f"Cache size: {cache_size.used / (1024**3):.2f} GB")
# Clear old data if needed
store._data_cache.clear_cache(exchange="binance", symbol="OLD/SYMBOL")
Best Practices¶
- Enable caching for any backtesting with >10K candles
- Use consistent timeframes to maximize cache hits
- Monitor disk usage for large multi-symbol strategies
- Clear old caches periodically to save disk space
- Use monthly segmentation (already default) for efficient storage
Future Enhancements¶
Planned improvements: - Parquet format: 5x smaller files, 10x faster loading - Compression: Automatic data compression - Cloud storage: S3/GCS backend for shared caches - Streaming: Memory-efficient processing of large datasets
See: Large Datasets | Benchmarking