DataFrame.agg (*exprs)
|
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg() ). |
DataFrame.alias (alias)
|
Returns a new DataFrame with an alias set. |
DataFrame.approxQuantile (col, probabilities, …)
|
Calculates the approximate quantiles of numerical columns of a DataFrame . |
DataFrame.cache ()
|
Persists the DataFrame with the default storage level (MEMORY_AND_DISK). |
DataFrame.checkpoint ([eager])
|
Returns a checkpointed version of this DataFrame . |
DataFrame.coalesce (numPartitions)
|
Returns a new DataFrame that has exactly numPartitions partitions. |
DataFrame.colRegex (colName)
|
Selects column based on the column name specified as a regex and returns it as Column . |
DataFrame.collect ()
|
Returns all the records as a list of Row . |
DataFrame.columns
|
Returns all column names as a list. |
DataFrame.corr (col1, col2[, method])
|
Calculates the correlation of two columns of a DataFrame as a double value. |
DataFrame.count ()
|
Returns the number of rows in this DataFrame . |
DataFrame.cov (col1, col2)
|
Calculate the sample covariance for the given columns, specified by their names, as a double value. |
DataFrame.createGlobalTempView (name)
|
Creates a global temporary view with this DataFrame . |
DataFrame.createOrReplaceGlobalTempView (name)
|
Creates or replaces a global temporary view using the given name. |
DataFrame.createOrReplaceTempView (name)
|
Creates or replaces a local temporary view with this DataFrame . |
DataFrame.createTempView (name)
|
Creates a local temporary view with this DataFrame . |
DataFrame.crossJoin (other)
|
Returns the cartesian product with another DataFrame . |
DataFrame.crosstab (col1, col2)
|
Computes a pair-wise frequency table of the given columns. |
DataFrame.cube (*cols)
|
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them. |
DataFrame.describe (*cols)
|
Computes basic statistics for numeric and string columns. |
DataFrame.distinct ()
|
Returns a new DataFrame containing the distinct rows in this DataFrame . |
DataFrame.drop (*cols)
|
Returns a new DataFrame that drops the specified column. |
DataFrame.dropDuplicates ([subset])
|
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. |
DataFrame.drop_duplicates ([subset])
|
drop_duplicates() is an alias for dropDuplicates() .
|
DataFrame.dropna ([how, thresh, subset])
|
Returns a new DataFrame omitting rows with null values. |
DataFrame.dtypes
|
Returns all column names and their data types as a list. |
DataFrame.exceptAll (other)
|
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates. |
DataFrame.explain ([extended, mode])
|
Prints the (logical and physical) plans to the console for debugging purpose. |
DataFrame.fillna (value[, subset])
|
Replace null values, alias for na.fill() . |
DataFrame.filter (condition)
|
Filters rows using the given condition. |
DataFrame.first ()
|
Returns the first row as a Row . |
DataFrame.foreach (f)
|
Applies the f function to all Row of this DataFrame . |
DataFrame.foreachPartition (f)
|
Applies the f function to each partition of this DataFrame . |
DataFrame.freqItems (cols[, support])
|
Finding frequent items for columns, possibly with false positives. |
DataFrame.groupBy (*cols)
|
Groups the DataFrame using the specified columns, so we can run aggregation on them. |
DataFrame.head ([n])
|
Returns the first n rows. |
DataFrame.hint (name, *parameters)
|
Specifies some hint on the current DataFrame . |
DataFrame.inputFiles ()
|
Returns a best-effort snapshot of the files that compose this DataFrame . |
DataFrame.intersect (other)
|
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame . |
DataFrame.intersectAll (other)
|
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates. |
DataFrame.isEmpty ()
|
Returns True if this DataFrame is empty. |
DataFrame.isLocal ()
|
Returns True if the collect() and take() methods can be run locally (without any Spark executors). |
DataFrame.isStreaming
|
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives. |
DataFrame.join (other[, on, how])
|
Joins with another DataFrame , using the given join expression. |
DataFrame.limit (num)
|
Limits the result count to the number specified. |
DataFrame.localCheckpoint ([eager])
|
Returns a locally checkpointed version of this DataFrame . |
DataFrame.mapInPandas (func, schema)
|
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame . |
DataFrame.mapInArrow (func, schema)
|
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a DataFrame . |
DataFrame.na
|
Returns a DataFrameNaFunctions for handling missing values. |
DataFrame.observe (observation, *exprs)
|
Define (named) metrics to observe on the DataFrame. |
DataFrame.orderBy (*cols, **kwargs)
|
Returns a new DataFrame sorted by the specified column(s). |
DataFrame.persist ([storageLevel])
|
Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed. |
DataFrame.printSchema ()
|
Prints out the schema in the tree format. |
DataFrame.randomSplit (weights[, seed])
|
Randomly splits this DataFrame with the provided weights. |
DataFrame.rdd
|
Returns the content as an pyspark.RDD of Row . |
DataFrame.registerTempTable (name)
|
Registers this DataFrame as a temporary table using the given name. |
DataFrame.repartition (numPartitions, *cols)
|
Returns a new DataFrame partitioned by the given partitioning expressions. |
DataFrame.repartitionByRange (numPartitions, …)
|
Returns a new DataFrame partitioned by the given partitioning expressions. |
DataFrame.replace (to_replace[, value, subset])
|
Returns a new DataFrame replacing a value with another value. |
DataFrame.rollup (*cols)
|
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. |
DataFrame.sameSemantics (other)
|
Returns True when the logical query plans inside both DataFrame s are equal and therefore return same results. |
DataFrame.sample ([withReplacement, …])
|
Returns a sampled subset of this DataFrame . |
DataFrame.sampleBy (col, fractions[, seed])
|
Returns a stratified sample without replacement based on the fraction given on each stratum. |
DataFrame.schema
|
Returns the schema of this DataFrame as a pyspark.sql.types.StructType . |
DataFrame.select (*cols)
|
Projects a set of expressions and returns a new DataFrame . |
DataFrame.selectExpr (*expr)
|
Projects a set of SQL expressions and returns a new DataFrame . |
DataFrame.semanticHash ()
|
Returns a hash code of the logical query plan against this DataFrame . |
DataFrame.show ([n, truncate, vertical])
|
Prints the first n rows to the console. |
DataFrame.sort (*cols, **kwargs)
|
Returns a new DataFrame sorted by the specified column(s). |
DataFrame.sortWithinPartitions (*cols, **kwargs)
|
Returns a new DataFrame with each partition sorted by the specified column(s). |
DataFrame.sparkSession
|
Returns Spark session that created this DataFrame . |
DataFrame.stat
|
Returns a DataFrameStatFunctions for statistic functions. |
DataFrame.storageLevel
|
Get the DataFrame ’s current storage level. |
DataFrame.subtract (other)
|
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame . |
DataFrame.summary (*statistics)
|
Computes specified statistics for numeric and string columns. |
DataFrame.tail (num)
|
Returns the last num rows as a list of Row . |
DataFrame.take (num)
|
Returns the first num rows as a list of Row . |
DataFrame.toDF (*cols)
|
Returns a new DataFrame that with new specified column names |
DataFrame.toJSON ([use_unicode])
|
Converts a DataFrame into a RDD of string. |
DataFrame.toLocalIterator ([prefetchPartitions])
|
Returns an iterator that contains all of the rows in this DataFrame . |
DataFrame.toPandas ()
|
Returns the contents of this DataFrame as Pandas pandas.DataFrame . |
DataFrame.to_pandas_on_spark ([index_col])
|
|
DataFrame.transform (func, *args, **kwargs)
|
Returns a new DataFrame . |
DataFrame.union (other)
|
Return a new DataFrame containing union of rows in this and another DataFrame . |
DataFrame.unionAll (other)
|
Return a new DataFrame containing union of rows in this and another DataFrame . |
DataFrame.unionByName (other[, …])
|
Returns a new DataFrame containing union of rows in this and another DataFrame . |
DataFrame.unpersist ([blocking])
|
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
DataFrame.where (condition)
|
where() is an alias for filter() .
|
DataFrame.withColumn (colName, col)
|
Returns a new DataFrame by adding a column or replacing the existing column that has the same name. |
DataFrame.withColumns (*colsMap)
|
Returns a new DataFrame by adding multiple columns or replacing the existing columns that has the same names. |
DataFrame.withColumnRenamed (existing, new)
|
Returns a new DataFrame by renaming an existing column. |
DataFrame.withMetadata (columnName, metadata)
|
Returns a new DataFrame by updating an existing column with metadata. |
DataFrame.withWatermark (eventTime, …)
|
Defines an event time watermark for this DataFrame . |
DataFrame.write
|
Interface for saving the content of the non-streaming DataFrame out into external storage. |
DataFrame.writeStream
|
Interface for saving the content of the streaming DataFrame out into external storage. |
DataFrame.writeTo (table)
|
Create a write configuration builder for v2 sources. |
DataFrame.pandas_api ([index_col])
|
Converts the existing DataFrame into a pandas-on-Spark DataFrame. |
DataFrameNaFunctions.drop ([how, thresh, subset])
|
Returns a new DataFrame omitting rows with null values. |
DataFrameNaFunctions.fill (value[, subset])
|
Replace null values, alias for na.fill() . |
DataFrameNaFunctions.replace (to_replace[, …])
|
Returns a new DataFrame replacing a value with another value. |
DataFrameStatFunctions.approxQuantile (col, …)
|
Calculates the approximate quantiles of numerical columns of a DataFrame . |
DataFrameStatFunctions.corr (col1, col2[, method])
|
Calculates the correlation of two columns of a DataFrame as a double value. |
DataFrameStatFunctions.cov (col1, col2)
|
Calculate the sample covariance for the given columns, specified by their names, as a double value. |
DataFrameStatFunctions.crosstab (col1, col2)
|
Computes a pair-wise frequency table of the given columns. |
DataFrameStatFunctions.freqItems (cols[, support])
|
Finding frequent items for columns, possibly with false positives. |
DataFrameStatFunctions.sampleBy (col, fractions)
|
Returns a stratified sample without replacement based on the fraction given on each stratum. |