pyspark.pandas.DataFrame.interpolate

DataFrame.interpolate(method: str = 'linear', limit: Optional[int] = None, limit_direction: Optional[str] = None, limit_area: Optional[str] = None) → pyspark.pandas.frame.DataFrame

Fill NaN values using an interpolation method.

Note

the current implementation of interpolate uses Spark’s Window without specifying partition specification. This leads to move all data into single partition in single machine and could cause serious performance degradation. Avoid this method against very large dataset.

Parameters
methodstr, default ‘linear’

Interpolation technique to use. One of:

  • ‘linear’: Ignore the index and treat the values as equally spaced.

limitint, optional

Maximum number of consecutive NaNs to fill. Must be greater than 0.

limit_directionstr, default None

Consecutive NaNs will be filled in this direction. One of {{‘forward’, ‘backward’, ‘both’}}.

limit_areastr, default None

If limit is specified, consecutive NaNs will be filled with this restriction. One of:

  • None: No fill restriction.

  • ‘inside’: Only fill NaNs surrounded by valid values (interpolate).

  • ‘outside’: Only fill NaNs outside valid values (extrapolate).

Returns
Series or DataFrame or None

Returns the same object type as the caller, interpolated at some or all NA values.

See also

fillna

Fill missing values using different methods.

Examples

Filling in NA via linear interpolation.

>>> s = ps.Series([0, 1, np.nan, 3])
>>> s
0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64
>>> s.interpolate()  
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

Fill the DataFrame forward (that is, going down) along each column using linear interpolation.

Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains NA, because there is no entry before it to use for interpolation.

>>> df = ps.DataFrame([(0.0, np.nan, -1.0, 1.0),
...                    (np.nan, 2.0, np.nan, np.nan),
...                    (2.0, 3.0, np.nan, 9.0),
...                    (np.nan, 4.0, -4.0, 16.0)],
...                   columns=list('abcd'))
>>> df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0
>>> df.interpolate(method='linear')  
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3. 2.0  4.0 -4.0  16.0