# Packages

• package
Definition Classes
root
• package
Definition Classes
root
• package
Definition Classes
org
• package

Core Spark functionality.

Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.

Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

Definition Classes
apache
• package

RDD-based machine learning APIs (in maintenance mode).

RDD-based machine learning APIs (in maintenance mode).

The spark.mllib package is in maintenance mode as of the Spark 2.0.0 release to encourage migration to the DataFrame-based APIs under the org.apache.spark.ml package. While in maintenance mode,

• no new features in the RDD-based spark.mllib package will be accepted, unless they block implementing new features in the DataFrame-based spark.ml package;
• bug fixes in the RDD-based APIs will still be accepted.

The developers will continue adding more features to the DataFrame-based APIs in the 2.x series to reach feature parity with the RDD-based APIs. And once we reach feature parity, this package will be deprecated.

Definition Classes
spark

SPARK-4591 to track the progress of feature parity

• package
Definition Classes
mllib
• L1Updater
• LBFGS
• Optimizer
• SimpleUpdater
• SquaredL2Updater
• Updater
c

Compute gradient and loss for a multinomial logistic loss function, as used in multi-class classification (it is also used in binary logistic regression).

In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which can be downloaded from http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of multinomial logistic regression model. A simple calculation shows that

$$P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))\\ P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))\\ ...\\ P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))\\$$

for K classes multiclass classification problem.

The model weights $$w = (w_1, w_2, ..., w_{K-1})^T$$ becomes a matrix which has dimension of (K-1) * (N+1) if the intercepts are added. If the intercepts are not added, the dimension will be (K-1) * N.

As a result, the loss of objective function for a single instance of data can be written as

\begin{align} l(w, x) &= -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w) \\ &= log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1} \\ &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1} \end{align}

where $\alpha(i) = 1$ if $$i \ne 0$$, and $\alpha(i) = 0$ if $$i == 0$$, $$margins_i = x w_i$$.

For optimization, we have to calculate the first derivative of the loss function, and a simple calculation shows that

\begin{align} \frac{\partial l(w, x)}{\partial w_{ij}} &= (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j \\ &= multiplier_i * x_j \end{align}

where $\delta_{i, j} = 1$ if $$i == j$$, $\delta_{i, j} = 0$ if $$i != j$$, and multiplier = $\exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})$

If any of margins is larger than 709.78, the numerical computation of multiplier and loss function will be suffered from arithmetic overflow. This issue occurs when there are outliers in data which are far away from hyperplane, and this will cause the failing of training once infinity / infinity is introduced. Note that this is only a concern when max(margins) > 0.

Fortunately, when max(margins) = maxMargin > 0, the loss function and the multiplier can be easily rewritten into the following equivalent numerically stable formula.

\begin{align} l(w, x) &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1} \\ &= log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin - (1-\alpha(y)) margins_{y-1} \\ &= log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1} \end{align}

where sum = $\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1$.

Note that each term, $(margins_i - maxMargin)$ in $\exp$ is smaller than zero; as a result, overflow will not happen with this formula.

For multiplier, similar trick can be applied as the following,

\begin{align} multiplier &= \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1}) \\ &= \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1}) \end{align}

where each term in $\exp$ is also smaller than zero, so overflow is not a concern.

For the detailed mathematical derivation, see the reference at http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297

Linear Supertypes
Ordering
1. Alphabetic
2. By Inheritance
Inherited
3. Serializable
4. Serializable
5. AnyRef
6. Any
1. Hide All
2. Show All
Visibility
1. Public
2. All

### Instance Constructors

numClasses

the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.

### Value Members

1. final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
2. final def ##(): Int
Definition Classes
AnyRef → Any
3. final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
4. final def asInstanceOf[T0]: T0
Definition Classes
Any
5. def clone()
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( ... ) @native()
6. def compute(data: Vector, label: Double, weights: Vector, cumGradient: Vector): Double

Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to avoid creating new objects, and return loss.

Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to avoid creating new objects, and return loss.

data

features for one data point

label

label for this data point

weights

weights/coefficients corresponding to features

returns

loss

Definition Classes
7. def compute(data: Vector, label: Double, weights: Vector): (Vector, Double)

Compute the gradient and loss given the features of a single data point.

Compute the gradient and loss given the features of a single data point.

data

features for one data point

label

label for this data point

weights

weights/coefficients corresponding to features

returns

Definition Classes
8. final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
9. def equals(arg0: Any): Boolean
Definition Classes
AnyRef → Any
10. def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
11. final def getClass(): Class[_]
Definition Classes
AnyRef → Any
Annotations
@native()
12. def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@native()
13. final def isInstanceOf[T0]: Boolean
Definition Classes
Any
14. final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
15. final def notify(): Unit
Definition Classes
AnyRef
Annotations
@native()
16. final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@native()
17. final def synchronized[T0](arg0: ⇒ T0): T0
Definition Classes
AnyRef
18. def toString(): String
Definition Classes
AnyRef → Any
19. final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws( ... )
20. final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws( ... )
21. final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws( ... ) @native()