AWS-native Spark Control Plane

Govern Spark runs on AWS
without building a control plane from scratch.

SparkPilot is built for platform teams running Spark on AWS. You get preflight safety checks, run diagnostics, and cost visibility in one place while keeping infrastructure and data inside your AWS account.

20+Preflight checks before every dispatch
3Spark runtimes in one control plane
5-stepGuided pilot to production
Pilot Evaluation

What you can review before rollout

Start with a live demo, then use pilot artifacts to align technical, security, and buyer stakeholders.

Live guided demoAvailable now

Walk through submission, preflight, run tracking, and diagnostics with your team and workload shape.

Request pilot
Pilot screenshot packIn beta

Redacted screenshots and run summaries are shared during active pilot evaluations.

Request pilot evidence pack
The Problem

What teams replace in week one

Platform teams running Spark on shared EKS clusters hit the same operational bottlenecks. SparkPilot replaces manual run prep with a governed workflow.

Manually validate IAM trust policy and OIDC association before every job
SparkPilot checks 20+ conditions automatically and fails fast with remediation steps
Hunt through 6 different CloudWatch log groups to find your job output
Each run links directly to the correct log stream
Estimate cost in a spreadsheet after the job finishes
Estimated cost is recorded per run, and actuals can be reconciled against CUR billing data when configured
Engineers share a cluster namespace with no isolation or quota enforcement
Each team gets scoped access, resource quotas, and budget guardrails
Capabilities

Core capabilities for pilot and rollout

These are the capabilities teams use first. Each capability includes an availability label so teams can plan rollout clearly.

Preflight Safety Gates

IAM, IRSA, OIDC, resource quota, and Spot capacity checks run before a single byte moves. Lake Formation permission checks are in beta when enabled. Bad configs are blocked with clear remediation steps so teams can fix issues before dispatch.

CUR-Aligned Cost Attribution

SparkPilot provides per-run cost estimates before dispatch and can reconcile against AWS Cost and Usage Report data in Athena when CUR integration is configured.

Multi-Tenant Isolation

Tenants, teams, environments, and runs are fully scoped. Each environment gets its own namespace, IRSA bindings, and resource quotas. Teams share a cluster without interference.

Governance and Audit

Role-based access is enforced across SparkPilot APIs, with team-environment scopes, budget guardrails, and audit events for key control-plane actions.

Bring Your Own Cloud

SparkPilot runs inside your AWS account. Your VPC, S3 buckets, and IAM policies stay under your control. BYOC-Lite is designed for fast connection to an existing EKS cluster, depending on IAM and OIDC readiness.

Runtime Management

Three background workers, Scheduler, Reconciler, and Provisioner, manage the run lifecycle. SparkPilot dispatches queued jobs to AWS, tracks state transitions, and links each run to its exact logs. You track runs in a dashboard, not a CloudWatch stream.

Structured Diagnostics

When a run fails, SparkPilot classifies the cause such as OOM kill, Spot interruption, S3 access denied, timeout, or user error. Engineers get a clear starting point for remediation.

Guided Onboarding

A step-by-step wizard validates cross-account trust, OIDC federation, namespace prerequisites, and execution role bindings, with actionable guidance for misconfigurations.

Run Lifecycle

How SparkPilot handles run operations

Teams submit through API, CLI, Airflow, or Dagster. SparkPilot manages dispatch, state reconciliation, and diagnostics so operators do not stitch together raw AWS calls.

queueddispatchingacceptedrunningsucceeded · failed · cancelled · timed_out
Scheduler

Polls for queued runs and dispatches them to AWS EMR, EMR Serverless, or EMR on EC2. Manages concurrency limits and environment-level queueing.

Reconciler

Continuously polls EMR for job state changes and writes structured transitions from accepted to running to succeeded or failed. Detects stalled runs and triggers timeout handling.

Provisioner

Manages environment lifecycle, including BYOC-Lite and Full BYOC (in beta) provisioning, checkpoint recovery across Terraform stages, and environment teardown.

Supported Engines

One control plane, four Spark runtimes

SparkPilot routes submissions to EMR on EKS today, with beta coverage for EMR Serverless and EMR on EC2. Databricks routing is planned as a coming-soon extension.

EMR on EKSAvailable now

Native EMR virtual cluster on your EKS cluster for production Spark workloads.

EMR ServerlessIn beta

Submit to an EMR Serverless application for fully managed capacity. No EKS cluster required.

EMR on EC2In beta

Dispatch to existing EMR on EC2 clusters via step submission. Integrates with your current EC2-based Spark estate.

Databricks on AWSComing soon

Planned support for Databricks Jobs API routing from the SparkPilot control plane.

How It Works

From pilot kickoff to rollout in five steps

1

Define pilot scope

Align on one workload family, success criteria, and owner roles before setup starts. This keeps pilot scope clear and measurable.

Open the pilot guide
2

Connect your AWS account

Create the cross-account IAM role and OIDC association. SparkPilot validates trust, permissions, and namespace prerequisites with clear remediation steps.

3

Choose deployment model

BYOC-Lite connects to your existing EKS cluster quickly. Full BYOC is in beta for teams that need VPC, EKS, and EMR provisioning from Terraform modules.

4

Submit your first governed run

Encode submission patterns as versioned templates, including Spot configurations, Graviton instance preferences, S3 Express paths, container images, and Spark configuration baselines.

5

Review outcomes and decide rollout

Compare pilot results against your success criteria, including reliability, diagnostics, and cost visibility. Then move to production rollout with the same control plane.

Integrations and Interfaces

Use SparkPilot from orchestrators, terminal, or API

SparkPilot supports workflow engines and engineer-first interfaces, so teams can adopt it through existing DAGs, CI pipelines, and terminal-driven operations.

Apache Airflow

SparkPilotSubmitRunOperator with full deferrable trigger support. Drop into any existing DAG - sync or async.

Operator | Hook | Sensor | Async Trigger
Dagster

Native @asset definitions and ops for run submission, polling, and cancellation. Works with Dagster Cloud and OSS.

Assets | Ops | Config Schema
SparkPilot CLI

Engineers can submit, inspect, cancel, and tail runs from terminal workflows without opening the dashboard.

run-submit | run-list | run-logs | usage-get
SparkPilot API

Teams can integrate SparkPilot into internal portals and automation jobs through authenticated REST endpoints.

REST API | RBAC | Audit Trail
Airflow and Dagster providers are installable from source today. API and CLI interfaces are provided for active pilot workflows and automation.
Why SparkPilot

What you don't get with DIY or EMR Serverless

DIY gives you primitives. EMR Serverless removes cluster management. Neither gives you a multi-tenant control plane with built-in governance.

CapabilityDIY on AWSEMR ServerlessSparkPilot
Preflight IAM/OIDC validation
Multi-tenant namespace isolation on EKS
CUR-aligned cost attribution per team
Budget guardrails with hard-block
Spot diversification validation at preflight
Airflow and Dagster native integrations
Kubernetes-native control plane
No infra management required
On mobile, swipe horizontally to view the full table.
This table shows what SparkPilot adds beyond base AWS primitives. DIY rows reflect capabilities you can build yourself, while SparkPilot ships them configured and enforced.
Still evaluating?

Common questions, honest answers

We outline tradeoffs so your team can choose the right path.

Start with a guided Spark pilot

Share your workload profile and we will map a practical pilot plan with clear success criteria, owner responsibilities, and rollout options.