pyspark.pandas.DataFrame.interpolate¶
-
DataFrame.
interpolate
(method: str = 'linear', limit: Optional[int] = None, limit_direction: Optional[str] = None, limit_area: Optional[str] = None) → pyspark.pandas.frame.DataFrame¶ Fill NaN values using an interpolation method.
Note
the current implementation of interpolate uses Spark’s Window without specifying partition specification. This leads to move all data into single partition in single machine and could cause serious performance degradation. Avoid this method against very large dataset.
- Parameters
- methodstr, default ‘linear’
Interpolation technique to use. One of:
‘linear’: Ignore the index and treat the values as equally spaced.
- limitint, optional
Maximum number of consecutive NaNs to fill. Must be greater than 0.
- limit_directionstr, default None
Consecutive NaNs will be filled in this direction. One of {{‘forward’, ‘backward’, ‘both’}}.
- limit_areastr, default None
If limit is specified, consecutive NaNs will be filled with this restriction. One of:
None: No fill restriction.
‘inside’: Only fill NaNs surrounded by valid values (interpolate).
‘outside’: Only fill NaNs outside valid values (extrapolate).
- Returns
- Series or DataFrame or None
Returns the same object type as the caller, interpolated at some or all NA values.
See also
fillna
Fill missing values using different methods.
Examples
Filling in NA via linear interpolation.
>>> s = ps.Series([0, 1, np.nan, 3]) >>> s 0 0.0 1 1.0 2 NaN 3 3.0 dtype: float64 >>> s.interpolate() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains NA, because there is no entry before it to use for interpolation.
>>> df = ps.DataFrame([(0.0, np.nan, -1.0, 1.0), ... (np.nan, 2.0, np.nan, np.nan), ... (2.0, 3.0, np.nan, 9.0), ... (np.nan, 4.0, -4.0, 16.0)], ... columns=list('abcd')) >>> df a b c d 0 0.0 NaN -1.0 1.0 1 NaN 2.0 NaN NaN 2 2.0 3.0 NaN 9.0 3 NaN 4.0 -4.0 16.0 >>> df.interpolate(method='linear') a b c d 0 0.0 NaN -1.0 1.0 1 1.0 2.0 -2.0 5.0 2 2.0 3.0 -3.0 9.0 3. 2.0 4.0 -4.0 16.0