RandomRDDs¶
-
class
pyspark.mllib.random.
RandomRDDs
¶ Generator methods for creating RDDs comprised of i.i.d samples from some distribution.
Methods
exponentialRDD
(sc, mean, size[, …])Generates an RDD comprised of i.i.d.
exponentialVectorRDD
(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
gammaRDD
(sc, shape, scale, size[, …])Generates an RDD comprised of i.i.d.
gammaVectorRDD
(sc, shape, scale, numRows, …)Generates an RDD comprised of vectors containing i.i.d.
logNormalRDD
(sc, mean, std, size[, …])Generates an RDD comprised of i.i.d.
logNormalVectorRDD
(sc, mean, std, numRows, …)Generates an RDD comprised of vectors containing i.i.d.
normalRDD
(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
normalVectorRDD
(sc, numRows, numCols[, …])Generates an RDD comprised of vectors containing i.i.d.
poissonRDD
(sc, mean, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
poissonVectorRDD
(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
uniformRDD
(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
uniformVectorRDD
(sc, numRows, numCols[, …])Generates an RDD comprised of vectors containing i.i.d.
Methods Documentation
-
static
exponentialRDD
(sc: pyspark.context.SparkContext, mean: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]¶ Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or 1 / lambda, for the Exponential distribution.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ Exp(mean).
Examples
>>> mean = 2.0 >>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
-
static
exponentialVectorRDD
(sc: pyspark.context.SparkContext, mean: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or 1 / lambda, for the Exponential distribution.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism)
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).
Examples
>>> import numpy as np >>> mean = 0.5 >>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
-
static
gammaRDD
(sc: pyspark.context.SparkContext, shape: float, scale: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]¶ Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- shapefloat
shape (> 0) parameter for the Gamma distribution
- scalefloat
scale (> 0) parameter for the Gamma distribution
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).
Examples
>>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> abs(stats.stdev() - expStd) < 0.5 True
-
static
gammaVectorRDD
(sc: pyspark.context.SparkContext, shape: float, scale: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- shapefloat
Shape (> 0) of the Gamma distribution
- scalefloat
Scale (> 0) of the Gamma distribution
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional,
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).
Examples
>>> import numpy as np >>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
-
static
logNormalRDD
(sc: pyspark.context.SparkContext, mean: float, std: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]¶ Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.
- Parameters
- sc
pyspark.SparkContext
used to create the RDD.
- meanfloat
mean for the log Normal distribution
- stdfloat
std for the log Normal distribution
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
- RDD of float comprised of i.i.d. samples ~ log N(mean, std).
Examples
>>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - expStd) < 0.5 True
-
static
logNormalVectorRDD
(sc: pyspark.context.SparkContext, mean: float, std: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean of the log normal distribution
- stdfloat
Standard Deviation of the log normal distribution
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).
Examples
>>> import numpy as np >>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect() >>> mat = np.matrix(m) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
-
static
normalRDD
(sc: pyspark.context.SparkContext, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]¶ Generates an RDD comprised of i.i.d. samples from the standard normal distribution.
To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use
RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)
- Parameters
- sc
pyspark.SparkContext
used to create the RDD.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).
Examples
>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - 0.0) < 0.1 True >>> abs(stats.stdev() - 1.0) < 0.1 True
-
static
normalVectorRDD
(sc: pyspark.context.SparkContext, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).
Examples
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - 0.0) < 0.1 True >>> abs(mat.std() - 1.0) < 0.1 True
-
static
poissonRDD
(sc: pyspark.context.SparkContext, mean: float, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]¶ Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or lambda, for the Poisson distribution.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ Pois(mean).
Examples
>>> mean = 100.0 >>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
-
static
poissonVectorRDD
(sc: pyspark.context.SparkContext, mean: float, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or lambda, for the Poisson distribution.
- numRowsfloat
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism)
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).
Examples
>>> import numpy as np >>> mean = 100.0 >>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
-
static
uniformRDD
(sc: pyspark.context.SparkContext, size: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[float]¶ Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).
To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use
RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)
- Parameters
- sc
pyspark.SparkContext
used to create the RDD.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).
Examples
>>> x = RandomRDDs.uniformRDD(sc, 100).collect() >>> len(x) 100 >>> max(x) <= 1.0 and min(x) >= 0.0 True >>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions() 4 >>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions() >>> parts == sc.defaultParallelism True
-
static
uniformVectorRDD
(sc: pyspark.context.SparkContext, numRows: int, numCols: int, numPartitions: Optional[int] = None, seed: Optional[int] = None) → pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]¶ Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD.
- seedint, optional
Seed for the RNG that generates the seed for the generator in each partition.
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).
Examples
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect()) >>> mat.shape (10, 10) >>> mat.max() <= 1.0 and mat.min() >= 0.0 True >>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions() 4
-
static