Skip to main content
Apache Spark is constantly evolving, and keeping your data pipelines up to date is essential for performance, security, and access to new features. OpenHands can help you analyze, migrate, and validate Spark applications.

Overview

Spark version upgrades are deceptively difficult. The Spark 3.0 migration guide alone documents hundreds of behavioral changes, deprecated APIs, and removed features, and many of these changes are semantic. That means the same code compiles and runs but produces different results across different Spark versions: for example, a date parsing expression that worked correctly in Spark 2.4 may silently return different values in Spark 3.x due to the switch from the Julian calendar to the Gregorian calendar. Version upgrades are also made difficult due to the scale of typical enterprise Spark codebases. When you have dozens of jobs across ETL, reporting, and ML pipelines, each with its own combination of DataFrame operations, UDFs, and configuration, manual migration stops scaling well and becomes prone to subtle regressions. Spark migration requires careful analysis, targeted code changes, and thorough validation to ensure that migrated pipelines produce identical results. The migration needs to be driven by an experienced data engineering team, but even that isn’t sufficient to ensure the job is done quickly or without regressions. This is where OpenHands comes in. Such migrations need to be driven by experienced data engineering teams that understand how your Spark pipelines interact, but even that isn’t sufficient to ensure the job is done quickly or without regression. This is where OpenHands comes in. OpenHands assists in migrating Spark applications along every step of the process:
  1. Understanding: Analyze the existing codebase to identify what needs to change and why
  2. Migration: Apply targeted code transformations that address API changes and behavioral differences
  3. Validation: Verify that migrated pipelines produce identical results to the originals
In this document, we will explore how OpenHands contributes to Spark migrations, with example prompts and techniques to use in your own efforts. While the examples focus on Spark 2.x to 3.x upgrades, the same principles apply to cloud platform migrations, framework conversions (MapReduce, Hive, Pig to Spark), and upgrades between Spark 3.x minor versions.

Understanding

Before changin any code, it helps to build a clear picture of what is affected and where the risk is concentrated. Spark migrations touch a large surface area, between API deprecations, behavioral changes, configuration defaults, and dependency versions, and the interactions between them are hard to reason about manually. Apache releases detailed lists of changes between each major and minor version of Spark. OpenHands can utilize this list of changes while scanning your codebase to produce a structured inventory of everything that needs attention. This inventory becomes the foundation for the migration itself, helping you prioritize work and track progress. If your Spark project is in /src and you’re migrating from 2.4 to 3.0, the following prompt will generate this inventory:
Analyze the Spark application in `/src` for a migration from Spark 2.4 to Spark 3.0.

Examine the migration guidelines at https://spark.apache.org/docs/latest/migration-guide.html.

Then, for each source file, identify

1. Deprecated or removed API usages (e.g., `registerTempTable`, `unionAll`, `SQLContext`)
2. Behavioral changes that could affect output (e.g., date/time parsing, CSV parsing, CAST semantics)
3. Configuration properties that have changed defaults or been renamed
4. Dependencies that need version updates

Save the results in `migration_inventory.json` in the following format:

{
  ...,
  "src/main/scala/etl/TransformJob.scala": {
    "deprecated_apis": [
      {"line": 42, "current": "df.registerTempTable(\"temp\")", "replacement": "df.createOrReplaceTempView(\"temp\")"}
    ],
    "behavioral_changes": [
      {"line": 78, "description": "to_date() uses proleptic Gregorian calendar in Spark 3.x; verify date handling with test data"}
    ],
    "config_changes": [],
    "risk": "medium"
  },
  ...
}
Tools like grep and find (both used by OpenHands) are helpful for identifying where APIs are used, but the real value comes from OpenHands’ ability to understand the context around each usage. A simple registerTempTable call is migrated via a rename, but a date parsing expression requires understanding how the surrounding pipeline uses the result. This contextual analysis helps developers distinguish between mechanical fixes and changes that need careful testing.

Migration

With a clear inventory of what needs to change, the next step is applying the transformations. Spark migrations involve a mix of straightforward API renames and subtler behavioral adjustments, and it’s important to handle them differently. To handle simple renames, we prompt OpenHands to use tools like grep and ast-grep instead of manually manipulating source code. This saves tokens and also simplifies future migrations, as agents can reliably re-run the tools via a script. The main risk in migration is that many Spark 3.x behavioral changes are silent. The migrated code will compile and run without errors, but may produce different results. Date and timestamp handling is the most common source of these silent failures: Spark 3.x switched to the Gregorian calendar by default, which changes how dates before 1582-10-15 are interpreted. CSV and JSON parsing also became stricter in Spark 3.x, rejecting malformed inputs that Spark 2.x would silently accept. An example prompt is below:
Migrate the Spark application in `/src` from Spark 2.4 to Spark 3.0.

Use `migration_inventory.json` to guide the changes.

For all low-risk changes (minor syntax changes, updated APIs, etc.), use tools like `grep` or `ast-grep`. Make sure you write the invocations to a `migration.sh` script for future use.

Requirements:
1. Replace all deprecated APIs with their Spark 3.0 equivalents
2. For behavioral changes (especially date handling and CSV parsing), add explicit configuration to preserve Spark 2.4 behavior where needed (e.g., spark.sql.legacy.timeParserPolicy=LEGACY)
3. Update build.sbt / pom.xml dependencies to Spark 3.0 compatible versions
4. Replace RDD-based operations with DataFrame/Dataset equivalents where practical
5. Replace UDFs with built-in Spark SQL functions where a direct equivalent exists
6. Update import statements for any relocated classes
7. Preserve all existing business logic and output schemas
Note the inclusion of the known problems in requirement 2. We plan to catch the silent failures associated with these systems in the validation step, but including them explicitly while migrating helps avoid them altogether.

Validation

Spark migrations are particularly prone to silent regressions: jobs appear to run successfully but produce subtly different output. Jobs dealing with dates, CSVs, or using CAST semantics are all vulnerable, especially when migrating between major versions of Spark. The most reliable way to ensure silent regressions do not exist is by data-level comparison, where both the new and old pipelines are run on the same input data and their outputs directly compared. This catches subtle errors that unit tests might miss, especially in complex pipelines where a behavioral change in one stage propagates through downstream transformations. An example prompt for data-level comparison:
Validate the migrated Spark application in `/src` against the original.

1. For each job, run both the Spark 2.4 and 3.0 versions on the test data in `/test_data`
2. Compare outputs:
   - Row counts must match exactly
   - Perform column-level comparison using checksums for numeric columns and exact match for string/date columns
   - Flag any NULL handling differences
3. For any discrepancies, trace them back to specific migration changes using the MIGRATION comments
4. Generate a performance comparison: job duration, shuffle bytes, and peak executor memory

Save the results in `validation_report.json` in the following format:

{
  "jobs": [
    {
      "name": "daily_etl",
      "data_match": true,
      "row_count": {"v2": 1000000, "v3": 1000000},
      "column_diffs": [],
      "performance": {
        "duration_seconds": {"v2": 340, "v3": 285},
        "shuffle_bytes": {"v2": "2.1GB", "v3": "1.8GB"}
      }
    },
    ...
  ]
}
Note this prompt relies on existing data in /test_data. This can be generated by standard fuzzing tools, but in a pinch OpenHands can also help construct synthetic data that stresses the potential corner cases in the relevant systems. Every migration is unique, and developer experience is crucial to ensure the testing strategy covers your organization’s requirements. Pay particular attention to jobs that involve date arithmetic, decimal precision in financial calculations, or custom UDFs that may depend on Spark internals. A solid validation suite not only ensures the migrated code works as expected, but also builds the organizational confidence needed to deploy the new version to production.

Beyond Version Upgrades

While this document focuses on Spark version upgrades, the same Understanding → Migration → Validation workflow applies to other Spark migration scenarios:
  • Cloud platform migrations (e.g., EMR to Databricks, on-premises to Dataproc): The “understanding” step inventories platform-specific code (S3 paths, IAM roles, EMR bootstrap scripts), the migration step converts them to the target platform’s equivalents, and validation confirms that jobs produce identical output in the new environment.
  • Framework migrations (MapReduce, Hive, or Pig to Spark): The “understanding” step maps the existing framework’s operations to Spark equivalents, the migration step performs the conversion, and validation compares outputs between the old and new frameworks.
In each case, the key principle is the same: build a structured inventory of what needs to change, apply targeted transformations, and validate rigorously before deploying.