pyspark.pandas.DataFrame.spark.frame

spark.frame(index_col: Union[str, List[str], None] = None) → pyspark.sql.dataframe.DataFrame

Return the current DataFrame as a Spark DataFrame. DataFrame.spark.frame() is an alias of DataFrame.to_spark().

Parameters
index_col: str or list of str, optional, default: None

Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost.

See also

DataFrame.to_spark
DataFrame.pandas_api
DataFrame.spark.frame

Examples

By default, this method loses the index as below.

>>> df = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> df.to_spark().show()  
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  4|  7|
|  2|  5|  8|
|  3|  6|  9|
+---+---+---+
>>> df = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> df.spark.frame().show()  
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  4|  7|
|  2|  5|  8|
|  3|  6|  9|
+---+---+---+

If index_col is set, it keeps the index column as specified.

>>> df.to_spark(index_col="index").show()  
+-----+---+---+---+
|index|  a|  b|  c|
+-----+---+---+---+
|    0|  1|  4|  7|
|    1|  2|  5|  8|
|    2|  3|  6|  9|
+-----+---+---+---+

Keeping index column is useful when you want to call some Spark APIs and convert it back to pandas-on-Spark DataFrame without creating a default index, which can affect performance.

>>> spark_df = df.to_spark(index_col="index")
>>> spark_df = spark_df.filter("a == 2")
>>> spark_df.pandas_api(index_col="index")  
       a  b  c
index
1      2  5  8

In case of multi-index, specify a list to index_col.

>>> new_df = df.set_index("a", append=True)
>>> new_spark_df = new_df.to_spark(index_col=["index_1", "index_2"])
>>> new_spark_df.show()  
+-------+-------+---+---+
|index_1|index_2|  b|  c|
+-------+-------+---+---+
|      0|      1|  4|  7|
|      1|      2|  5|  8|
|      2|      3|  6|  9|
+-------+-------+---+---+

Likewise, can be converted to back to pandas-on-Spark DataFrame.

>>> new_spark_df.pandas_api(
...     index_col=["index_1", "index_2"])  
                 b  c
index_1 index_2
0       1        4  7
1       2        5  8
2       3        6  9