IndexedRowMatrix¶
-
class
pyspark.mllib.linalg.distributed.
IndexedRowMatrix
(rows: pyspark.rdd.RDD[Union[Tuple[int, VectorLike], pyspark.mllib.linalg.distributed.IndexedRow]], numRows: int = 0, numCols: int = 0)¶ Represents a row-oriented distributed Matrix with indexed rows.
- Parameters
- rows
pyspark.RDD
An RDD of IndexedRows or (int, vector) tuples or a DataFrame consisting of a int typed column of indices and a vector typed column.
- numRowsint, optional
Number of rows in the matrix. A non-positive value means unknown, at which point the number of rows will be determined by the max row index plus one.
- numColsint, optional
Number of columns in the matrix. A non-positive value means unknown, at which point the number of columns will be determined by the size of the first row.
- rows
Methods
Compute all cosine similarities between columns.
Computes the Gramian matrix A^T A.
computeSVD
(k[, computeU, rCond])Computes the singular value decomposition of the IndexedRowMatrix.
multiply
(matrix)Multiply this matrix by a local dense matrix on the right.
numCols
()Get or compute the number of cols.
numRows
()Get or compute the number of rows.
toBlockMatrix
([rowsPerBlock, colsPerBlock])Convert this matrix to a BlockMatrix.
Convert this matrix to a CoordinateMatrix.
Convert this matrix to a RowMatrix.
Attributes
Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.
Methods Documentation
-
columnSimilarities
() → pyspark.mllib.linalg.distributed.CoordinateMatrix¶ Compute all cosine similarities between columns.
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(6, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows) >>> cs = mat.columnSimilarities() >>> print(cs.numCols()) 3
-
computeGramianMatrix
() → pyspark.mllib.linalg.Matrix¶ Computes the Gramian matrix A^T A.
Notes
This cannot be computed on matrices with more than 65535 columns.
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(1, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows)
>>> mat.computeGramianMatrix() DenseMatrix(3, 3, [17.0, 22.0, 27.0, 22.0, 29.0, 36.0, 27.0, 36.0, 45.0], 0)
-
computeSVD
(k: int, computeU: bool = False, rCond: float = 1e-09) → pyspark.mllib.linalg.distributed.SingularValueDecomposition[pyspark.mllib.linalg.distributed.IndexedRowMatrix, pyspark.mllib.linalg.Matrix]¶ Computes the singular value decomposition of the IndexedRowMatrix.
The given row matrix A of dimension (m X n) is decomposed into U * s * V’T where
- U: (m X k) (left singular vectors) is a IndexedRowMatrix
whose columns are the eigenvectors of (A X A’)
- s: DenseVector consisting of square root of the eigenvalues
(singular values) in descending order.
- v: (n X k) (right singular vectors) is a Matrix whose columns
are the eigenvectors of (A’ X A)
For more specific details on implementation, please refer the scala documentation.
- Parameters
- kint
Number of leading singular values to keep (0 < k <= n). It might return less than k if there are numerically zero singular values or there are not enough Ritz values converged before the maximum number of Arnoldi update iterations is reached (in case that matrix A is ill-conditioned).
- computeUbool, optional
Whether or not to compute U. If set to be True, then U is computed by A * V * s^-1
- rCondfloat, optional
Reciprocal condition number. All singular values smaller than rCond * s[0] are treated as zero where s[0] is the largest singular value.
- Returns
Examples
>>> rows = [(0, (3, 1, 1)), (1, (-1, 3, 1))] >>> irm = IndexedRowMatrix(sc.parallelize(rows)) >>> svd_model = irm.computeSVD(2, True) >>> svd_model.U.rows.collect() [IndexedRow(0, [-0.707106781187,0.707106781187]), IndexedRow(1, [-0.707106781187,-0.707106781187])] >>> svd_model.s DenseVector([3.4641, 3.1623]) >>> svd_model.V DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, ...0.0], 0)
-
multiply
(matrix: pyspark.mllib.linalg.Matrix) → pyspark.mllib.linalg.distributed.IndexedRowMatrix¶ Multiply this matrix by a local dense matrix on the right.
- Parameters
- matrix
pyspark.mllib.linalg.Matrix
a local dense matrix whose number of rows must match the number of columns of this matrix
- matrix
- Returns
Examples
>>> mat = IndexedRowMatrix(sc.parallelize([(0, (0, 1)), (1, (2, 3))])) >>> mat.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect() [IndexedRow(0, [2.0,3.0]), IndexedRow(1, [6.0,11.0])]
-
numCols
() → int¶ Get or compute the number of cols.
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(1, [4, 5, 6]), ... IndexedRow(2, [7, 8, 9]), ... IndexedRow(3, [10, 11, 12])])
>>> mat = IndexedRowMatrix(rows) >>> print(mat.numCols()) 3
>>> mat = IndexedRowMatrix(rows, 7, 6) >>> print(mat.numCols()) 6
-
numRows
() → int¶ Get or compute the number of rows.
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(1, [4, 5, 6]), ... IndexedRow(2, [7, 8, 9]), ... IndexedRow(3, [10, 11, 12])])
>>> mat = IndexedRowMatrix(rows) >>> print(mat.numRows()) 4
>>> mat = IndexedRowMatrix(rows, 7, 6) >>> print(mat.numRows()) 7
-
toBlockMatrix
(rowsPerBlock: int = 1024, colsPerBlock: int = 1024) → pyspark.mllib.linalg.distributed.BlockMatrix¶ Convert this matrix to a BlockMatrix.
- Parameters
- rowsPerBlockint, optional
Number of rows that make up each block. The blocks forming the final rows are not required to have the given number of rows.
- colsPerBlockint, optional
Number of columns that make up each block. The blocks forming the final columns are not required to have the given number of columns.
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(6, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows).toBlockMatrix()
>>> # This IndexedRowMatrix will have 7 effective rows, due to >>> # the highest row index being 6, and the ensuing >>> # BlockMatrix will have 7 rows as well. >>> print(mat.numRows()) 7
>>> print(mat.numCols()) 3
-
toCoordinateMatrix
() → pyspark.mllib.linalg.distributed.CoordinateMatrix¶ Convert this matrix to a CoordinateMatrix.
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 0]), ... IndexedRow(6, [0, 5])]) >>> mat = IndexedRowMatrix(rows).toCoordinateMatrix() >>> mat.entries.take(3) [MatrixEntry(0, 0, 1.0), MatrixEntry(0, 1, 0.0), MatrixEntry(6, 0, 0.0)]
-
toRowMatrix
() → pyspark.mllib.linalg.distributed.RowMatrix¶ Convert this matrix to a RowMatrix.
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(6, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows).toRowMatrix() >>> mat.rows.collect() [DenseVector([1.0, 2.0, 3.0]), DenseVector([4.0, 5.0, 6.0])]
Attributes Documentation
-
rows
¶ Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.
Examples
>>> mat = IndexedRowMatrix(sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(1, [4, 5, 6])])) >>> rows = mat.rows >>> rows.first() IndexedRow(0, [1.0,2.0,3.0])