Features

The Aleatoric Engine: Features

Core Differentiators

1. Exchange-Accurate Cadence Simulation

Unlike generic synthetic data generators that emit perfect metronome ticks, The Aleatoric Engine replicates real exchange behavior:

Rate Limit Simulation

market = HyperSynthReactor(
    symbol="SOL",
    book_update_interval_ms=100,  # Match Binance 100ms snapshot rate
    trade_intensity_base=2.0,      # ~2 trades/sec baseline
)

Burst Mode

Simulate exchange message avalanches during high volatility:

config = SimulationManifest(
    burst_probability=0.05,        # 5% chance of entering burst
    burst_intensity_factor=10.0,   # 10x message rate during burst
)

Real-world scenario: During a liquidation cascade, message rates spike 10-100x. The Aleatoric Engine captures this.

Staleness Simulation

Network jitter and exchange processing delays:

config = SimulationManifest(
    staleness_ms=50.0,  # Mean 50ms lag with log-normal distribution
)

Includes:

Log-normal jitter distribution (realistic network behavior)
Occasional massive lag spikes (0.5% probability of 100-1000ms delays)
Separate timestamp_ms (exchange time) and capture_time_ms (receive time)

2. Microstructure-Correct L2/L3 Behavior

Order Book Replenishment

Liquidity providers react to aggressive flow:

# After a large sell hits the bid, bid-side depth depletes
# and slowly replenishes over ~5-10 seconds
market = HyperSynthReactor(
    adverse_selection_strength=0.15,  # 15% depth depletion
    impact_decay_halflife_ms=5000.0,  # 5s recovery
)

Queue Dynamics

Exponential depth profile: Size decreases exponentially away from best bid/ask
Volatility-dependent liquidity withdrawal: Higher vol → lower depth
Toxicity-aware spread widening: Informed flow detection → spreads widen

# Spreads widen under high toxicity (informed trading)
spread_vol_sensitivity=0.15,        # Volatility impact
spread_toxicity_sensitivity=0.25,   # Informed flow impact

Book Shape Distributions

Real orderbooks don’t have uniform depth. The Aleatoric Engine models:

Power-law depth decay (depth_decay_rate=0.85)
Volume-weighted microprice calculation
Realistic bid/ask imbalances

Trade Size Heteroskedasticity

trade_size_alpha=2.5,  # Pareto exponent (lower = fatter tail)
min_trade_size=0.1,
max_trade_size=50.0,

Result: Many small trades (~0.1-1.0 size), occasional whale trades (10-50 size).

3. Spot-Perp-Funding Triangular Modeling

Complete Crypto Derivatives Framework

Spot Price Process:

Geometric Brownian Motion (GBM) with configurable drift/volatility
Jump diffusion for tail events (liquidations, news shocks)

Funding Rate Dynamics:

Ornstein-Uhlenbeck (OU) mean-reverting process
Hard bounds to prevent unrealistic rates
White noise component for micro-jitter

Perpetual Pricing:

Perp Price = Spot × (1 + Funding Rate + Basis Deviation)

Example configuration:

from aleatoric.gen.feed import SyntheticTelemetryUplink

feed = SyntheticTelemetryUplink(
    clock=clock,
    gbm_drift_annual=0.0,
    gbm_vol_annual=0.80,
    funding_mean_bps_hr=0.0,
    funding_kappa=1.5,          # Mean reversion speed
    funding_sigma=2.0,          # Volatility in bps/√hour
    funding_bounds_bps_hr=(-8.0, 8.0),
)

Funding Settlement

Configurable settlement intervals (1h, 8h, etc.)
Price convergence simulation near settlement
TWAP funding rate calculation

4. Built-in Multi-Exchange Normalizer

Problem: The N+1 Integration Nightmare

Building a trading system that works across Binance, HyperLiquid, OKX, Bybit requires:

N different WebSocket parsers
N different data models
N different edge cases

Solution: Emit Exchange-Specific + Normalize

Step 1: Generate exchange-accurate raw data

from aleatoric.gen.hyperliquid_format import stream_hyperliquid_format

for channel, data in stream_hyperliquid_format(market, duration_seconds=60):
    if channel == 'l2Book':
        # Exact HyperLiquid WsBook format
        bids, asks = data['levels']
        print(bids[0])  # {'px': '100.50', 'sz': '10.5000', 'n': 3}

Step 2: Normalize to canonical schema

from aleatoric.process.normalizer import CanonicalizationEngine

normalizer = CanonicalizationEngine(enable_cache=True)
norm_event = normalizer.normalize_synthetic(event_type, event)

# Result: NormalizedBookEvent with standardized structure
# Works identically for HyperLiquid, Binance, synthetic data

Step 3: Cache for reusability

df = normalizer.normalize_and_cache(
    source="synthetic",
    symbol="SOL",
    start_date="2025-08-01",
    end_date="2025-08-07",
    seed=42,
)
# Cached to ~/.hft_cache/ as LZ4-compressed Parquet

Use Cases

For AI/ML Training

# Generate 1 year of high-fidelity data in minutes
config = SimulationManifest(
    symbol="BTC",
    volatility_annual=0.8,
    burst_probability=0.10,  # More frequent bursts for stress testing
    seed=42  # Reproducible
)

market = HyperSynthReactor.from_config(config)
events = market.stream(duration_seconds=365*24*3600)

# Train your model on realistic microstructure

For Strategy Backtesting

# Test your orderbook-based strategy
for event_type, event in market.stream(duration_seconds=3600):
    if event_type == 'book':
        # Your strategy logic
        imbalance = calculate_imbalance(event.bids, event.asks)
        if imbalance > threshold:
            send_order()

For Infrastructure Testing

# Stress test with burst mode
config = SimulationManifest(
    burst_probability=0.20,       # 20% chance
    burst_intensity_factor=50.0,  # 50x message rate
    staleness_ms=100.0,           # 100ms mean lag
)

# Validate your WebSocket client handles:
# - Message bursts
# - Out-of-order timestamps
# - Stale data detection

For Data Vendor Development

# Generate historical-like datasets
normalizer = CanonicalizationEngine(enable_cache=True)

for symbol in ["BTC", "ETH", "SOL"]:
    df = normalizer.normalize_and_cache(
        source="synthetic",
        symbol=symbol,
        start_date="2024-01-01",
        end_date="2024-12-31",
        force_regenerate=True
    )
    # Sell to customers as "backtesting dataset"

📊 Validation & Quality

Statistical Properties

The Aleatoric Engine matches real market signatures:

Volatility clustering: GARCH-like behavior via realized vol feedback
Heavy tails: Jump diffusion + power-law trade sizes
Autocorrelation: Impulse responses create realistic serial correlation
Volume clustering: Temporal trade intensity bursts

Validation Tools

from aleatoric.gen.feed_eda import validate_bid_ask_consistency

df = build_eda_dataframe(steps=5000)
results = validate_bid_ask_consistency(df)

# Automated checks:
# ✓ Bid/ask ordering (bids descending, asks ascending)
# ✓ Spread consistency
# ✓ Quantity distributions
# ✓ Funding skew correlation

Performance

Caching System

LZ4 compression: 10-20x reduction, 500+ MB/s decompression
Parquet columnar storage: Efficient time-series queries
Metadata tracking: Full reproducibility

Example:

✅ Cache hit: a3f2b9c8d1e4f5a6
   Loaded 1,000,000 events (800,000 books, 200,000 trades)
   Compression: 15.2x, Size: 12.4 MB

Generation Speed

Book updates: ~100,000/sec (single-threaded)
Trade generation: ~50,000/sec
Full day (86,400s) of 100ms books: Generated in ~10 seconds