pyspark.RDD¶

class pyspark.RDD(jrdd: JavaObject, ctx: SparkContext, jrdd_deserializer: pyspark.serializers.Serializer = AutoBatchedSerializer(CloudPickleSerializer()))¶

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Methods

`aggregate`(zeroValue, seqOp, combOp)	Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”
`aggregateByKey`(zeroValue, seqFunc, combFunc)	Aggregate the values of each key, using given combine functions and a neutral “zero value”.
`barrier`()	Marks the current stage as a barrier stage, where Spark must launch all tasks together.
`cache`()	Persist this RDD with the default storage level (MEMORY_ONLY).
`cartesian`(other)	Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements `(a, b)` where `a` is in self and `b` is in other.
`checkpoint`()	Mark this RDD for checkpointing.
`cleanShuffleDependencies`([blocking])	Removes an RDD’s shuffles and it’s non-persisted ancestors.
`coalesce`(numPartitions[, shuffle])	Return a new RDD that is reduced into numPartitions partitions.
`cogroup`(other[, numPartitions])	For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
`collect`()	Return a list that contains all of the elements in this RDD.
`collectAsMap`()	Return the key-value pairs in this RDD to the master as a dictionary.
`collectWithJobGroup`(groupId, description[, …])	When collect rdd, use this method to specify job group.
`combineByKey`(createCombiner, mergeValue, …)	Generic function to combine the elements for each key using a custom set of aggregation functions.
`count`()	Return the number of elements in this RDD.
`countApprox`(timeout[, confidence])	Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
`countApproxDistinct`([relativeSD])	Return approximate number of distinct elements in the RDD.
`countByKey`()	Count the number of elements for each key, and return the result to the master as a dictionary.
`countByValue`()	Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.
`distinct`([numPartitions])	Return a new RDD containing the distinct elements in this RDD.
`filter`(f)	Return a new RDD containing only the elements that satisfy a predicate.
`first`()	Return the first element in this RDD.
`flatMap`(f[, preservesPartitioning])	Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
`flatMapValues`(f)	Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD’s partitioning.
`fold`(zeroValue, op)	Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value.”
`foldByKey`(zeroValue, func[, numPartitions, …])	Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).
`foreach`(f)	Applies a function to all elements of this RDD.
`foreachPartition`(f)	Applies a function to each partition of this RDD.
`fullOuterJoin`(other[, numPartitions])	Perform a right outer join of self and other.
`getCheckpointFile`()	Gets the name of the file to which this RDD was checkpointed
`getNumPartitions`()	Returns the number of partitions in RDD
`getResourceProfile`()	Get the `pyspark.resource.ResourceProfile` specified with this RDD or None if it wasn’t specified.
`getStorageLevel`()	Get the RDD’s current storage level.
`glom`()	Return an RDD created by coalescing all elements within each partition into a list.
`groupBy`(f[, numPartitions, partitionFunc])	Return an RDD of grouped items.
`groupByKey`([numPartitions, partitionFunc])	Group the values for each key in the RDD into a single sequence.
`groupWith`(other, *others)	Alias for cogroup but with support for multiple RDDs.
`histogram`(buckets)	Compute a histogram using the provided buckets.
`id`()	A unique ID for this RDD (within its SparkContext).
`intersection`(other)	Return the intersection of this RDD and another one.
`isCheckpointed`()	Return whether this RDD is checkpointed and materialized, either reliably or locally.
`isEmpty`()	Returns true if and only if the RDD contains no elements at all.
`isLocallyCheckpointed`()	Return whether this RDD is marked for local checkpointing.
`join`(other[, numPartitions])	Return an RDD containing all pairs of elements with matching keys in self and other.
`keyBy`(f)	Creates tuples of the elements in this RDD by applying f.
`keys`()	Return an RDD with the keys of each tuple.
`leftOuterJoin`(other[, numPartitions])	Perform a left outer join of self and other.
`localCheckpoint`()	Mark this RDD for local checkpointing using Spark’s existing caching layer.
`lookup`(key)	Return the list of values in the RDD for key key.
`map`(f[, preservesPartitioning])	Return a new RDD by applying a function to each element of this RDD.
`mapPartitions`(f[, preservesPartitioning])	Return a new RDD by applying a function to each partition of this RDD.
`mapPartitionsWithIndex`(f[, …])	Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
`mapPartitionsWithSplit`(f[, …])	Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
`mapValues`(f)	Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.
`max`([key])	Find the maximum item in this RDD.
`mean`()	Compute the mean of this RDD’s elements.
`meanApprox`(timeout[, confidence])	Approximate operation to return the mean within a timeout or meet the confidence.
`min`([key])	Find the minimum item in this RDD.
`name`()	Return the name of this RDD.
`partitionBy`(numPartitions[, partitionFunc])	Return a copy of the RDD partitioned using the specified partitioner.
`persist`([storageLevel])	Set this RDD’s storage level to persist its values across operations after the first time it is computed.
`pipe`(command[, env, checkCode])	Return an RDD created by piping elements to a forked external process.
`randomSplit`(weights[, seed])	Randomly splits this RDD with the provided weights.
`reduce`(f)	Reduces the elements of this RDD using the specified commutative and associative binary operator.
`reduceByKey`(func[, numPartitions, partitionFunc])	Merge the values for each key using an associative and commutative reduce function.
`reduceByKeyLocally`(func)	Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.
`repartition`(numPartitions)	Return a new RDD that has exactly numPartitions partitions.
`repartitionAndSortWithinPartitions`([…])	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
`rightOuterJoin`(other[, numPartitions])	Perform a right outer join of self and other.
`sample`(withReplacement, fraction[, seed])	Return a sampled subset of this RDD.
`sampleByKey`(withReplacement, fractions[, seed])	Return a subset of this RDD sampled by key (via stratified sampling).
`sampleStdev`()	Compute the sample standard deviation of this RDD’s elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N).
`sampleVariance`()	Compute the sample variance of this RDD’s elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N).
`saveAsHadoopDataset`(conf[, keyConverter, …])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package).
`saveAsHadoopFile`(path, outputFormatClass[, …])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package).
`saveAsNewAPIHadoopDataset`(conf[, …])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package).
`saveAsNewAPIHadoopFile`(path, outputFormatClass)	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package).
`saveAsPickleFile`(path[, batchSize])	Save this RDD as a SequenceFile of serialized objects.
`saveAsSequenceFile`(path[, compressionCodecClass])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the “org.apache.hadoop.io.Writable” types that we convert from the RDD’s key and value types.
`saveAsTextFile`(path[, compressionCodecClass])	Save this RDD as a text file, using string representations of elements.
`setName`(name)	Assign a name to this RDD.
`sortBy`(keyfunc[, ascending, numPartitions])	Sorts this RDD by the given keyfunc
`sortByKey`([ascending, numPartitions, keyfunc])	Sorts this RDD, which is assumed to consist of (key, value) pairs.
`stats`()	Return a `StatCounter` object that captures the mean, variance and count of the RDD’s elements in one operation.
`stdev`()	Compute the standard deviation of this RDD’s elements.
`subtract`(other[, numPartitions])	Return each value in self that is not contained in other.
`subtractByKey`(other[, numPartitions])	Return each (key, value) pair in self that has no pair with matching key in other.
`sum`()	Add up the elements in this RDD.
`sumApprox`(timeout[, confidence])	Approximate operation to return the sum within a timeout or meet the confidence.
`take`(num)	Take the first num elements of the RDD.
`takeOrdered`(num[, key])	Get the N elements from an RDD ordered in ascending order or as specified by the optional key function.
`takeSample`(withReplacement, num[, seed])	Return a fixed-size sampled subset of this RDD.
`toDF`([schema, sampleRatio])
`toDebugString`()	A description of this RDD and its recursive dependencies for debugging.
`toLocalIterator`([prefetchPartitions])	Return an iterator that contains all of the elements in this RDD.
`top`(num[, key])	Get the top N elements from an RDD.
`treeAggregate`(zeroValue, seqOp, combOp[, depth])	Aggregates the elements of this RDD in a multi-level tree pattern.
`treeReduce`(f[, depth])	Reduces the elements of this RDD in a multi-level tree pattern.
`union`(other)	Return the union of this RDD and another one.
`unpersist`([blocking])	Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
`values`()	Return an RDD with the values of each tuple.
`variance`()	Compute the variance of this RDD’s elements.
`withResources`(profile)	Specify a `pyspark.resource.ResourceProfile` to use when calculating this RDD.
`zip`(other)	Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc.
`zipWithIndex`()	Zips this RDD with its element indices.
`zipWithUniqueId`()	Zips this RDD with generated unique Long ids.

Attributes

context

The SparkContext that this RDD was created on.

pyspark.SparkContext

pyspark.Broadcast