RandomRDDs

class pyspark.mllib.random.RandomRDDs

Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

Methods

exponentialRDD(sc, mean, size[, …])

Generates an RDD comprised of i.i.d.

exponentialVectorRDD(sc, mean, numRows, numCols)

Generates an RDD comprised of vectors containing i.i.d.

gammaRDD(sc, shape, scale, size[, …])

Generates an RDD comprised of i.i.d.

gammaVectorRDD(sc, shape, scale, numRows, …)

Generates an RDD comprised of vectors containing i.i.d.

logNormalRDD(sc, mean, std, size[, …])

Generates an RDD comprised of i.i.d.

logNormalVectorRDD(sc, mean, std, numRows, …)

Generates an RDD comprised of vectors containing i.i.d.

normalRDD(sc, size[, numPartitions, seed])

Generates an RDD comprised of i.i.d.

normalVectorRDD(sc, numRows, numCols[, …])

Generates an RDD comprised of vectors containing i.i.d.

poissonRDD(sc, mean, size[, numPartitions, seed])

Generates an RDD comprised of i.i.d.

poissonVectorRDD(sc, mean, numRows, numCols)

Generates an RDD comprised of vectors containing i.i.d.

uniformRDD(sc, size[, numPartitions, seed])

Generates an RDD comprised of i.i.d.

uniformVectorRDD(sc, numRows, numCols[, …])

Generates an RDD comprised of vectors containing i.i.d.

Methods Documentation

static exponentialRDD(sc: pyspark.context.SparkContext, mean: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]

Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

meanfloat

Mean, or 1 / lambda, for the Exponential distribution.

sizeint

Size of the RDD.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of float comprised of i.i.d. samples ~ Exp(mean).

Examples

>>> mean = 2.0
>>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - sqrt(mean)) < 0.5
True
static exponentialVectorRDD(sc: pyspark.context.SparkContext, mean: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

meanfloat

Mean, or 1 / lambda, for the Exponential distribution.

numRowsint

Number of Vectors in the RDD.

numColsint

Number of elements in each Vector.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism)

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).

Examples

>>> import numpy as np
>>> mean = 0.5
>>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1)
>>> mat = np.mat(rdd.collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(mat.std() - sqrt(mean)) < 0.5
True
static gammaRDD(sc: pyspark.context.SparkContext, shape: float, scale: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]

Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

shapefloat

shape (> 0) parameter for the Gamma distribution

scalefloat

scale (> 0) parameter for the Gamma distribution

sizeint

Size of the RDD.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).

Examples

>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> abs(stats.stdev() - expStd) < 0.5
True
static gammaVectorRDD(sc: pyspark.context.SparkContext, shape: float, scale: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

shapefloat

Shape (> 0) of the Gamma distribution

scalefloat

Scale (> 0) of the Gamma distribution

numRowsint

Number of Vectors in the RDD.

numColsint

Number of elements in each Vector.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional,

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).

Examples

>>> import numpy as np
>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - expMean) < 0.1
True
>>> abs(mat.std() - expStd) < 0.1
True
static logNormalRDD(sc: pyspark.context.SparkContext, mean: float, std: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]

Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.

Parameters
scpyspark.SparkContext

used to create the RDD.

meanfloat

mean for the log Normal distribution

stdfloat

std for the log Normal distribution

sizeint

Size of the RDD.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
RDD of float comprised of i.i.d. samples ~ log N(mean, std).

Examples

>>> from math import sqrt, exp
>>> mean = 0.0
>>> std = 1.0
>>> expMean = exp(mean + 0.5 * std * std)
>>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
>>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - expStd) < 0.5
True
static logNormalVectorRDD(sc: pyspark.context.SparkContext, mean: float, std: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

meanfloat

Mean of the log normal distribution

stdfloat

Standard Deviation of the log normal distribution

numRowsint

Number of Vectors in the RDD.

numColsint

Number of elements in each Vector.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).

Examples

>>> import numpy as np
>>> from math import sqrt, exp
>>> mean = 0.0
>>> std = 1.0
>>> expMean = exp(mean + 0.5 * std * std)
>>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
>>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect()
>>> mat = np.matrix(m)
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - expMean) < 0.1
True
>>> abs(mat.std() - expStd) < 0.1
True
static normalRDD(sc: pyspark.context.SparkContext, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]

Generates an RDD comprised of i.i.d. samples from the standard normal distribution.

To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)

Parameters
scpyspark.SparkContext

used to create the RDD.

sizeint

Size of the RDD.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).

Examples

>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - 0.0) < 0.1
True
>>> abs(stats.stdev() - 1.0) < 0.1
True
static normalVectorRDD(sc: pyspark.context.SparkContext, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

numRowsint

Number of Vectors in the RDD.

numColsint

Number of elements in each Vector.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).

Examples

>>> import numpy as np
>>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - 0.0) < 0.1
True
>>> abs(mat.std() - 1.0) < 0.1
True
static poissonRDD(sc: pyspark.context.SparkContext, mean: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]

Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

meanfloat

Mean, or lambda, for the Poisson distribution.

sizeint

Size of the RDD.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of float comprised of i.i.d. samples ~ Pois(mean).

Examples

>>> mean = 100.0
>>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - sqrt(mean)) < 0.5
True
static poissonVectorRDD(sc: pyspark.context.SparkContext, mean: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

meanfloat

Mean, or lambda, for the Poisson distribution.

numRowsfloat

Number of Vectors in the RDD.

numColsint

Number of elements in each Vector.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism)

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).

Examples

>>> import numpy as np
>>> mean = 100.0
>>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1)
>>> mat = np.mat(rdd.collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(mat.std() - sqrt(mean)) < 0.5
True
static uniformRDD(sc: pyspark.context.SparkContext, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]

Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).

To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)

Parameters
scpyspark.SparkContext

used to create the RDD.

sizeint

Size of the RDD.

numPartitionsint, optional

Number of partitions in the RDD (default: sc.defaultParallelism).

seedint, optional

Random seed (default: a random long integer).

Returns
pyspark.RDD

RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).

Examples

>>> x = RandomRDDs.uniformRDD(sc, 100).collect()
>>> len(x)
100
>>> max(x) <= 1.0 and min(x) >= 0.0
True
>>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions()
4
>>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions()
>>> parts == sc.defaultParallelism
True
static uniformVectorRDD(sc: pyspark.context.SparkContext, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).

Parameters
scpyspark.SparkContext

SparkContext used to create the RDD.

numRowsint

Number of Vectors in the RDD.

numColsint

Number of elements in each Vector.

numPartitionsint, optional

Number of partitions in the RDD.

seedint, optional

Seed for the RNG that generates the seed for the generator in each partition.

Returns
pyspark.RDD

RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).

Examples

>>> import numpy as np
>>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect())
>>> mat.shape
(10, 10)
>>> mat.max() <= 1.0 and mat.min() >= 0.0
True
>>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions()
4