CoordinateMatrix

class pyspark.mllib.linalg.distributed.CoordinateMatrix(entries: pyspark.rdd.RDD[Union[Tuple[int, int, float], pyspark.mllib.linalg.distributed.MatrixEntry]], numRows: int = 0, numCols: int = 0)

Represents a matrix in coordinate format.

Parameters
entriespyspark.RDD

An RDD of MatrixEntry inputs or (int, int, float) tuples.

numRowsint, optional

Number of rows in the matrix. A non-positive value means unknown, at which point the number of rows will be determined by the max row index plus one.

numColsint, optional

Number of columns in the matrix. A non-positive value means unknown, at which point the number of columns will be determined by the max row index plus one.

Methods

numCols()

Get or compute the number of cols.

numRows()

Get or compute the number of rows.

toBlockMatrix([rowsPerBlock, colsPerBlock])

Convert this matrix to a BlockMatrix.

toIndexedRowMatrix()

Convert this matrix to an IndexedRowMatrix.

toRowMatrix()

Convert this matrix to a RowMatrix.

transpose()

Transpose this CoordinateMatrix.

Attributes

entries

Entries of the CoordinateMatrix stored as an RDD of MatrixEntries.

Methods Documentation

numCols() → int

Get or compute the number of cols.

Examples

>>> entries = sc.parallelize([MatrixEntry(0, 0, 1.2),
...                           MatrixEntry(1, 0, 2),
...                           MatrixEntry(2, 1, 3.7)])
>>> mat = CoordinateMatrix(entries)
>>> print(mat.numCols())
2
>>> mat = CoordinateMatrix(entries, 7, 6)
>>> print(mat.numCols())
6
numRows() → int

Get or compute the number of rows.

Examples

>>> entries = sc.parallelize([MatrixEntry(0, 0, 1.2),
...                           MatrixEntry(1, 0, 2),
...                           MatrixEntry(2, 1, 3.7)])
>>> mat = CoordinateMatrix(entries)
>>> print(mat.numRows())
3
>>> mat = CoordinateMatrix(entries, 7, 6)
>>> print(mat.numRows())
7
toBlockMatrix(rowsPerBlock: int = 1024, colsPerBlock: int = 1024)pyspark.mllib.linalg.distributed.BlockMatrix

Convert this matrix to a BlockMatrix.

Parameters
rowsPerBlockint, optional

Number of rows that make up each block. The blocks forming the final rows are not required to have the given number of rows.

colsPerBlockint, optional

Number of columns that make up each block. The blocks forming the final columns are not required to have the given number of columns.

Examples

>>> entries = sc.parallelize([MatrixEntry(0, 0, 1.2),
...                           MatrixEntry(6, 4, 2.1)])
>>> mat = CoordinateMatrix(entries).toBlockMatrix()
>>> # This CoordinateMatrix will have 7 effective rows, due to
>>> # the highest row index being 6, and the ensuing
>>> # BlockMatrix will have 7 rows as well.
>>> print(mat.numRows())
7
>>> # This CoordinateMatrix will have 5 columns, due to the
>>> # highest column index being 4, and the ensuing
>>> # BlockMatrix will have 5 columns as well.
>>> print(mat.numCols())
5
toIndexedRowMatrix()pyspark.mllib.linalg.distributed.IndexedRowMatrix

Convert this matrix to an IndexedRowMatrix.

Examples

>>> entries = sc.parallelize([MatrixEntry(0, 0, 1.2),
...                           MatrixEntry(6, 4, 2.1)])
>>> mat = CoordinateMatrix(entries).toIndexedRowMatrix()
>>> # This CoordinateMatrix will have 7 effective rows, due to
>>> # the highest row index being 6, and the ensuing
>>> # IndexedRowMatrix will have 7 rows as well.
>>> print(mat.numRows())
7
>>> # This CoordinateMatrix will have 5 columns, due to the
>>> # highest column index being 4, and the ensuing
>>> # IndexedRowMatrix will have 5 columns as well.
>>> print(mat.numCols())
5
toRowMatrix()pyspark.mllib.linalg.distributed.RowMatrix

Convert this matrix to a RowMatrix.

Examples

>>> entries = sc.parallelize([MatrixEntry(0, 0, 1.2),
...                           MatrixEntry(6, 4, 2.1)])
>>> mat = CoordinateMatrix(entries).toRowMatrix()
>>> # This CoordinateMatrix will have 7 effective rows, due to
>>> # the highest row index being 6, but the ensuing RowMatrix
>>> # will only have 2 rows since there are only entries on 2
>>> # unique rows.
>>> print(mat.numRows())
2
>>> # This CoordinateMatrix will have 5 columns, due to the
>>> # highest column index being 4, and the ensuing RowMatrix
>>> # will have 5 columns as well.
>>> print(mat.numCols())
5
transpose()pyspark.mllib.linalg.distributed.CoordinateMatrix

Transpose this CoordinateMatrix.

Examples

>>> entries = sc.parallelize([MatrixEntry(0, 0, 1.2),
...                           MatrixEntry(1, 0, 2),
...                           MatrixEntry(2, 1, 3.7)])
>>> mat = CoordinateMatrix(entries)
>>> mat_transposed = mat.transpose()
>>> print(mat_transposed.numRows())
2
>>> print(mat_transposed.numCols())
3

Attributes Documentation

entries

Entries of the CoordinateMatrix stored as an RDD of MatrixEntries.

Examples

>>> mat = CoordinateMatrix(sc.parallelize([MatrixEntry(0, 0, 1.2),
...                                        MatrixEntry(6, 4, 2.1)]))
>>> entries = mat.entries
>>> entries.first()
MatrixEntry(0, 0, 1.2)