Batch Processing Use Cases
While streaming handles real-time updates, batch processing serves historical context extraction, periodic full syncs, and large-scale transformations. ECM Protocol defines batch processing standards.
Job Specification
Batch Job Definition
{
"job_id": "historical-sync-001",
"job_type": "full_extraction",
"source": {
"connector": "data_warehouse",
"query": "SELECT * FROM customer_history WHERE updated_at >= :last_sync"
},
"transform": {
"processor": "customer_transformer",
"config": {...}
},
"sink": {
"store": "context_store",
"mode": "upsert"
},
"schedule": "0 2 * * *",
"parameters": {
"last_sync": "${previous_job.end_time}"
}
}
Partitioning Strategy
Partition Specification
{
"partitioning": {
"strategy": "hash",
"key": "customer_id",
"partitions": 16
}
}
Partition Types
Hash partitioning for even distribution. Range partitioning for ordered processing. List partitioning for categorical splits.
Checkpoint Protocol
Checkpoint Management
Batch jobs checkpoint progress for resumption:
{
"checkpoint": {
"job_id": "historical-sync-001",
"run_id": "run-abc",
"partition": 5,
"offset": 150000,
"timestamp": "2024-01-15T03:45:00Z",
"status": "in_progress"
}
}
Recovery Behavior
Resume from last checkpoint on failure. Skip completed partitions. Replay partial partitions from checkpoint.
Quality Validation
Validation Rules
{
"validation": {
"rules": [
{"type": "row_count", "min": 100000, "max": 200000},
{"type": "null_check", "fields": ["customer_id"], "threshold": 0},
{"type": "freshness", "field": "updated_at", "max_age": "24 hours"}
],
"on_failure": "abort_and_alert"
}
}
Scheduling Protocol
Schedule Types
Cron for periodic execution. Event-triggered for dependencies. Manual for ad-hoc runs. Backfill for historical ranges.
Dependency Management
{
"dependencies": [
{"job": "reference-data-sync", "window": "24 hours"},
{"job": "customer-extract", "status": "success"}
]
}
Conclusion
The Batch Processing Protocol enables large-scale context extraction and transformation. Implement partitioning for parallelism, checkpointing for reliability, and validation for quality assurance.