GradientBoostedTrees

class pyspark.mllib.tree.GradientBoostedTrees

Learning algorithm for a gradient boosted trees model for classification or regression.

Methods

trainClassifier(data, categoricalFeaturesInfo)

Train a gradient-boosted trees model for classification.

trainRegressor(data, categoricalFeaturesInfo)

Train a gradient-boosted trees model for regression.

Methods Documentation

classmethod trainClassifier(data: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categoricalFeaturesInfo: Dict[int, int], loss: str = 'logLoss', numIterations: int = 100, learningRate: float = 0.1, maxDepth: int = 3, maxBins: int = 32)pyspark.mllib.tree.GradientBoostedTreesModel

Train a gradient-boosted trees model for classification.

Parameters
datapyspark.RDD

Training dataset: RDD of LabeledPoint. Labels should take values {0, 1}.

categoricalFeaturesInfodict

Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.

lossstr, optional

Loss function used for minimization during gradient boosting. Supported values: “logLoss”, “leastSquaresError”, “leastAbsoluteError”. (default: “logLoss”)

numIterationsint, optional

Number of iterations of boosting. (default: 100)

learningRatefloat, optional

Learning rate for shrinking the contribution of each estimator. The learning rate should be between in the interval (0, 1]. (default: 0.1)

maxDepthint, optional

Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 3)

maxBinsint, optional

Maximum number of bins used for splitting features. DecisionTree requires maxBins >= max categories. (default: 32)

Returns
GradientBoostedTreesModel

that can be used for prediction.

Examples

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import GradientBoostedTrees
>>>
>>> data = [
...     LabeledPoint(0.0, [0.0]),
...     LabeledPoint(0.0, [1.0]),
...     LabeledPoint(1.0, [2.0]),
...     LabeledPoint(1.0, [3.0])
... ]
>>>
>>> model = GradientBoostedTrees.trainClassifier(sc.parallelize(data), {}, numIterations=10)
>>> model.numTrees()
10
>>> model.totalNumNodes()
30
>>> print(model)  # it already has newline
TreeEnsembleModel classifier with 10 trees

>>> model.predict([2.0])
1.0
>>> model.predict([0.0])
0.0
>>> rdd = sc.parallelize([[2.0], [0.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]
classmethod trainRegressor(data: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categoricalFeaturesInfo: Dict[int, int], loss: str = 'leastSquaresError', numIterations: int = 100, learningRate: float = 0.1, maxDepth: int = 3, maxBins: int = 32)pyspark.mllib.tree.GradientBoostedTreesModel

Train a gradient-boosted trees model for regression.

Parameters
data :

Training dataset: RDD of LabeledPoint. Labels are real numbers.

categoricalFeaturesInfodict

Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.

lossstr, optional

Loss function used for minimization during gradient boosting. Supported values: “logLoss”, “leastSquaresError”, “leastAbsoluteError”. (default: “leastSquaresError”)

numIterationsint, optional

Number of iterations of boosting. (default: 100)

learningRatefloat, optional

Learning rate for shrinking the contribution of each estimator. The learning rate should be between in the interval (0, 1]. (default: 0.1)

maxDepthint, optional

Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 3)

maxBinsint, optional

Maximum number of bins used for splitting features. DecisionTree requires maxBins >= max categories. (default: 32)

Returns
GradientBoostedTreesModel

that can be used for prediction.

Examples

>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import GradientBoostedTrees
>>> from pyspark.mllib.linalg import SparseVector
>>>
>>> sparse_data = [
...     LabeledPoint(0.0, SparseVector(2, {0: 1.0})),
...     LabeledPoint(1.0, SparseVector(2, {1: 1.0})),
...     LabeledPoint(0.0, SparseVector(2, {0: 1.0})),
...     LabeledPoint(1.0, SparseVector(2, {1: 2.0}))
... ]
>>>
>>> data = sc.parallelize(sparse_data)
>>> model = GradientBoostedTrees.trainRegressor(data, {}, numIterations=10)
>>> model.numTrees()
10
>>> model.totalNumNodes()
12
>>> model.predict(SparseVector(2, {1: 1.0}))
1.0
>>> model.predict(SparseVector(2, {0: 1.0}))
0.0
>>> rdd = sc.parallelize([[0.0, 1.0], [1.0, 0.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]