ANSI Migration Guide - Pandas API on Spark#

ANSI mode is now on by default for Pandas API on Spark. This guide helps you understand the key behavior differences you’ll see. In short, with ANSI mode on, Pandas API on Spark behavior matches native pandas in cases where Pandas API on Spark with ANSI off did not.

Behavior Change#

String Number Comparison#

ANSI off: Spark implicitly casts numbers and strings, so 1 and '1' are considered equal.

ANSI on: behaves like pandas, 1 == '1' is False.

Examples are as shown below:

>>> pdf = pd.DataFrame({"int": [1, 2], "str": ["1", "2"]})
>>> psdf = ps.from_pandas(pdf)

# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> psdf["int"] == psdf["str"]
0    False
1    False
dtype: bool

# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psdf["int"] == psdf["str"]
0    True
1    True
dtype: bool

# Pandas
>>> pdf["int"] == pdf["str"]
0    False
1    False
dtype: bool

Strict Casting#

ANSI off: invalid casts (e.g., 'a' int) quietly became NULL.

ANSI on: the same casts raise errors.

Examples are as shown below:

>>> pdf = pd.DataFrame({"str": ["a"]})
>>> psdf = ps.from_pandas(pdf)

# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> psdf["str"].astype(int)
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] ...

# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psdf["str"].astype(int)
0   NaN
Name: str, dtype: float64

# Pandas
>>> pdf["str"].astype(int)
Traceback (most recent call last):
...
ValueError: invalid literal for int() with base 10: 'a'

MultiIndex.to_series Return#

ANSI off: Each row is returned as an ArrayType value, e.g. [1, red].

ANSI on: Each row is returned as a StructType value, which appears as a tuple (e.g., (1, red)) if the Runtime SQL Configuration spark.sql.execution.pandas.structHandlingMode is set to 'row'. Otherwise, the result may vary depending on whether Arrow is used. See more in the Spark Runtime SQL Configuration docs.

Examples are as shown below:

>>> arrays = [[1,  2], ["red", "blue"]]
>>> pidx = pd.MultiIndex.from_arrays(arrays, names=("number", "color"))
>>> psidx = ps.from_pandas(pidx)

# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> spark.conf.set("spark.sql.execution.pandas.structHandlingMode", "row")
>>> psidx.to_series()
number  color
1       red       (1, red)
2       blue     (2, blue)
dtype: object

# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psidx.to_series()
number  color
1       red       [1, red]
2       blue     [2, blue]
dtype: object

# Pandas
>>> pidx.to_series()
number  color
1       red       (1, red)
2       blue     (2, blue)
dtype: object

Invalid Mixed-Type Operations#

ANSI off: Spark implicitly coerces so these operations succeed.

ANSI on: Behaves like pandas, such operations are disallowed and raise errors.

Operation types that show behavior changes under ANSI mode:

  • Decimal–Float Arithmetic: /, //, *, %

  • Boolean vs. None: |, &, ^

Example: Decimal–Float Arithmetic

>>> import decimal
>>> pser = pd.Series([decimal.Decimal(1), decimal.Decimal(2)])
>>> psser = ps.from_pandas(pser)

# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> psser * 0.1
Traceback (most recent call last):
...
TypeError: Multiplication can not be applied to given types.

# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psser * 0.1
0    0.1
1    0.2
dtype: float64

# Pandas
>>> pser * 0.1
...
TypeError: unsupported operand type(s) for *: 'decimal.Decimal' and 'float'

Example: Boolean vs. None

# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> ps.Series([True, False]) | None
Traceback (most recent call last):
...
TypeError: OR can not be applied to given types.

# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> ps.Series([True, False]) | None
0    False
1    False
dtype: bool

# Pandas
>>> pd.Series([True, False]) | None
...
TypeError: unsupported operand type(s) for |: 'bool' and 'NoneType'