Core Specifications

Dual-Driver Architecture

The Aleatoric Engine utilizes a Dual-Driver Architecture to ensure that data generation is identical regardless of consumption mode. This guarantees datasets used for training ML models (Batch) are mathematically identical to data streams used for live testing (Stream).

Driver Functions

Located in src/aleatoric/drivers.py:

`run_batch()`

High-throughput batch generation with optional multiprocessing.

def run_batch(
    config: SimulationManifest,
    duration_seconds: float,
    chunk_size: int = 1000,
    multiprocess: bool = False,
    workers: Optional[int] = None,
    window_seconds: Optional[float] = None,
    max_retries: int = 3,
    backoff_seconds: float = 0.5
) -> Tuple[str, int]:
    """Returns: (file_path, row_count)"""

`run_stream()`

Real-time streaming with optional wall-clock timing.

async def run_stream(
    config: SimulationManifest,
    duration_seconds: Optional[float] = None,
    real_time: bool = True
) -> AsyncGenerator[Tuple[str, dict], None]:
    """Yields: (event_type, event_data) tuples"""

Driver Modes

1. Batch Mode

Purpose: High-throughput generation for historical analysis and ML training
Mechanism: Iterates through market generator as fast as CPU allows, writes chunks to Parquet
Performance: Millions of events per second (CPU bound)
Output: File path to generated Parquet file and row count

2. Batch Mode with Multiprocessing

Purpose: Scale batch generation across multiple CPU cores
Mechanism: Divides duration into time windows, spawns ProcessPool workers
Configuration:
- multiprocess=True: Enable multiprocessing
- workers: Number of parallel workers (default: min(4, CPU count))
- window_seconds: Duration per window (auto-calculated if not specified)
Determinism: Preserved via per-window seed derivation from base seed

3. Stream Mode

Purpose: Real-time simulation for bot testing, UI development, system integration
Mechanism: Injects asyncio.sleep() to match wall-clock time
Behavior: “Plays back” the simulation in real-time
Output: AsyncGenerator yielding event tuples

Environment Configuration

Variable	Default	Description
`ALEATORIC_DRIVER_ENABLE_MULTIPROCESS`	`false`	Enable multiprocessing by default
`ALEATORIC_DRIVER_MAX_WORKERS`	auto	Maximum worker processes
`ALEATORIC_DRIVER_WINDOW_SECONDS`	auto	Window duration for multiprocess
`ALEATORIC_DRIVER_MAX_RETRIES`	`3`	Retries for failed windows
`ALEATORIC_DRIVER_BACKOFF_SECONDS`	`0.5`	Backoff between retries
`ALEATORIC_BATCH_CHUNK_SIZE`	`1000`	Events per chunk before flush

Determinism Guarantee

Both batch and stream modes guarantee bit-for-bit reproducibility given the same seed:

# These produce identical event sequences:
run_batch(SimulationManifest(seed=42), duration_seconds=100)
run_stream(SimulationManifest(seed=42), duration_seconds=100, real_time=False)

# Multiprocess also preserves determinism:
run_batch(config, duration_seconds=100, multiprocess=False)
run_batch(config, duration_seconds=100, multiprocess=True, workers=4)
# ^ Both produce identical output

L2 Order Book

The engine maintains a full double-sided order book.

Feature	Specification
Depth	Unlimited (configurable)
Matching Engine	FIFO (First-In-First-Out)
Order Types	Limit, Market, IOC, FOK, Post-Only
Precision	18 decimal places (floating point safe)

Data Retention

Local: Parquet artifacts written to artifact_storage_dir
Object Storage: S3-compatible backend via ALEATORIC_ARTIFACT_* env vars
Provenance: Cache manifests include seed, preset, and manifest hashes for deterministic replay