pyspark.sql.DataFrame¶
-
class
pyspark.sql.
DataFrame
(jdf: py4j.java_gateway.JavaObject, sql_ctx: Union[SQLContext, SparkSession])¶ A distributed collection of data grouped into named columns.
A
DataFrame
is equivalent to a relational table in Spark SQL, and can be created using various functions inSparkSession
:people = spark.read.parquet("...")
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in:
DataFrame
,Column
.To select a column from the
DataFrame
, use the apply method:ageCol = people.age
A more concrete example:
# To create DataFrame using SparkSession people = spark.read.parquet("...") department = spark.read.parquet("...") people.filter(people.age > 30).join(department, people.deptId == department.id) \ .groupBy(department.name, "gender").agg({"salary": "avg", "age": "max"})
Methods
agg
(*exprs)Aggregate on the entire
DataFrame
without groups (shorthand fordf.groupBy().agg()
).alias
(alias)Returns a new
DataFrame
with an alias set.approxQuantile
(col, probabilities, relativeError)Calculates the approximate quantiles of numerical columns of a
DataFrame
.cache
()Persists the
DataFrame
with the default storage level (MEMORY_AND_DISK).checkpoint
([eager])Returns a checkpointed version of this
DataFrame
.coalesce
(numPartitions)Returns a new
DataFrame
that has exactly numPartitions partitions.colRegex
(colName)Selects column based on the column name specified as a regex and returns it as
Column
.collect
()Returns all the records as a list of
Row
.corr
(col1, col2[, method])Calculates the correlation of two columns of a
DataFrame
as a double value.count
()Returns the number of rows in this
DataFrame
.cov
(col1, col2)Calculate the sample covariance for the given columns, specified by their names, as a double value.
createGlobalTempView
(name)Creates a global temporary view with this
DataFrame
.Creates or replaces a global temporary view using the given name.
createOrReplaceTempView
(name)Creates or replaces a local temporary view with this
DataFrame
.createTempView
(name)Creates a local temporary view with this
DataFrame
.crossJoin
(other)Returns the cartesian product with another
DataFrame
.crosstab
(col1, col2)Computes a pair-wise frequency table of the given columns.
cube
(*cols)Create a multi-dimensional cube for the current
DataFrame
using the specified columns, so we can run aggregations on them.describe
(*cols)Computes basic statistics for numeric and string columns.
distinct
()Returns a new
DataFrame
containing the distinct rows in thisDataFrame
.drop
(*cols)Returns a new
DataFrame
that drops the specified column.dropDuplicates
([subset])Return a new
DataFrame
with duplicate rows removed, optionally only considering certain columns.drop_duplicates
([subset])drop_duplicates()
is an alias fordropDuplicates()
.dropna
([how, thresh, subset])Returns a new
DataFrame
omitting rows with null values.exceptAll
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
while preserving duplicates.explain
([extended, mode])Prints the (logical and physical) plans to the console for debugging purpose.
fillna
(value[, subset])Replace null values, alias for
na.fill()
.filter
(condition)Filters rows using the given condition.
first
()Returns the first row as a
Row
.foreach
(f)Applies the
f
function to each partition of thisDataFrame
.freqItems
(cols[, support])Finding frequent items for columns, possibly with false positives.
groupBy
(*cols)Groups the
DataFrame
using the specified columns, so we can run aggregation on them.groupby
(*cols)groupby()
is an alias forgroupBy()
.head
([n])Returns the first
n
rows.hint
(name, *parameters)Specifies some hint on the current
DataFrame
.Returns a best-effort snapshot of the files that compose this
DataFrame
.intersect
(other)Return a new
DataFrame
containing rows only in both thisDataFrame
and anotherDataFrame
.intersectAll
(other)Return a new
DataFrame
containing rows in both thisDataFrame
and anotherDataFrame
while preserving duplicates.isEmpty
()Returns
True
if thisDataFrame
is empty.isLocal
()Returns
True
if thecollect()
andtake()
methods can be run locally (without any Spark executors).join
(other[, on, how])Joins with another
DataFrame
, using the given join expression.limit
(num)Limits the result count to the number specified.
localCheckpoint
([eager])Returns a locally checkpointed version of this
DataFrame
.mapInArrow
(func, schema)Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as aDataFrame
.mapInPandas
(func, schema)Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a pandas DataFrame, and returns the result as aDataFrame
.observe
(observation, *exprs)Define (named) metrics to observe on the DataFrame.
orderBy
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).pandas_api
([index_col])Converts the existing DataFrame into a pandas-on-Spark DataFrame.
persist
([storageLevel])Sets the storage level to persist the contents of the
DataFrame
across operations after the first time it is computed.Prints out the schema in the tree format.
randomSplit
(weights[, seed])Randomly splits this
DataFrame
with the provided weights.registerTempTable
(name)Registers this
DataFrame
as a temporary table using the given name.repartition
(numPartitions, *cols)Returns a new
DataFrame
partitioned by the given partitioning expressions.repartitionByRange
(numPartitions, *cols)Returns a new
DataFrame
partitioned by the given partitioning expressions.replace
(to_replace[, value, subset])Returns a new
DataFrame
replacing a value with another value.rollup
(*cols)Create a multi-dimensional rollup for the current
DataFrame
using the specified columns, so we can run aggregation on them.sameSemantics
(other)Returns True when the logical query plans inside both
DataFrame
s are equal and therefore return same results.sample
([withReplacement, fraction, seed])Returns a sampled subset of this
DataFrame
.sampleBy
(col, fractions[, seed])Returns a stratified sample without replacement based on the fraction given on each stratum.
select
(*cols)Projects a set of expressions and returns a new
DataFrame
.selectExpr
(*expr)Projects a set of SQL expressions and returns a new
DataFrame
.Returns a hash code of the logical query plan against this
DataFrame
.show
([n, truncate, vertical])Prints the first
n
rows to the console.sort
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).sortWithinPartitions
(*cols, **kwargs)Returns a new
DataFrame
with each partition sorted by the specified column(s).subtract
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
.summary
(*statistics)Computes specified statistics for numeric and string columns.
tail
(num)Returns the last
num
rows as alist
ofRow
.take
(num)Returns the first
num
rows as alist
ofRow
.toDF
(*cols)Returns a new
DataFrame
that with new specified column namestoJSON
([use_unicode])Converts a
DataFrame
into aRDD
of string.toLocalIterator
([prefetchPartitions])Returns an iterator that contains all of the rows in this
DataFrame
.toPandas
()Returns the contents of this
DataFrame
as Pandaspandas.DataFrame
.to_koalas
([index_col])to_pandas_on_spark
([index_col])transform
(func, *args, **kwargs)Returns a new
DataFrame
.union
(other)Return a new
DataFrame
containing union of rows in this and anotherDataFrame
.unionAll
(other)Return a new
DataFrame
containing union of rows in this and anotherDataFrame
.unionByName
(other[, allowMissingColumns])Returns a new
DataFrame
containing union of rows in this and anotherDataFrame
.unpersist
([blocking])Marks the
DataFrame
as non-persistent, and remove all blocks for it from memory and disk.where
(condition)withColumn
(colName, col)Returns a new
DataFrame
by adding a column or replacing the existing column that has the same name.withColumnRenamed
(existing, new)Returns a new
DataFrame
by renaming an existing column.withColumns
(*colsMap)Returns a new
DataFrame
by adding multiple columns or replacing the existing columns that has the same names.withMetadata
(columnName, metadata)Returns a new
DataFrame
by updating an existing column with metadata.withWatermark
(eventTime, delayThreshold)Defines an event time watermark for this
DataFrame
.writeTo
(table)Create a write configuration builder for v2 sources.
Attributes
Returns all column names as a list.
Returns all column names and their data types as a list.
Returns
True
if thisDataFrame
contains one or more sources that continuously return data as it arrives.Returns a
DataFrameNaFunctions
for handling missing values.Returns the content as an
pyspark.RDD
ofRow
.Returns the schema of this
DataFrame
as apyspark.sql.types.StructType
.Returns Spark session that created this
DataFrame
.sql_ctx
Returns a
DataFrameStatFunctions
for statistic functions.Get the
DataFrame
’s current storage level.Interface for saving the content of the non-streaming
DataFrame
out into external storage.Interface for saving the content of the streaming
DataFrame
out into external storage.