pyspark.pandas.read_spark_io¶
-
pyspark.pandas.
read_spark_io
(path: Optional[str] = None, format: Optional[str] = None, schema: Union[str, StructType] = None, index_col: Union[str, List[str], None] = None, **options: Any) → pyspark.pandas.frame.DataFrame¶ Load a DataFrame from a Spark data source.
- Parameters
- pathstring, optional
Path to the data source.
- formatstring, optional
Specifies the output data source format. Some common ones are:
‘delta’
‘parquet’
‘orc’
‘json’
‘csv’
- schemastring or StructType, optional
Input schema. If none, Spark tries to infer the schema automatically. The schema can either be a Spark StructType, or a DDL-formatted string like col0 INT, col1 DOUBLE.
- index_colstr or list of str, optional, default: None
Index column of table in Spark.
- optionsdict
All other options passed directly into Spark’s data source.
See also
DataFrame.to_spark_io
DataFrame.read_table
DataFrame.read_delta
DataFrame.read_parquet
Examples
>>> ps.range(1).to_spark_io('%s/read_spark_io/data.parquet' % path) >>> ps.read_spark_io( ... '%s/read_spark_io/data.parquet' % path, format='parquet', schema='id long') id 0 0
>>> ps.range(10, 15, num_partitions=1).to_spark_io('%s/read_spark_io/data.json' % path, ... format='json', lineSep='__') >>> ps.read_spark_io( ... '%s/read_spark_io/data.json' % path, format='json', schema='id long', lineSep='__') id 0 10 1 11 2 12 3 13 4 14
You can preserve the index in the roundtrip as below.
>>> ps.range(10, 15, num_partitions=1).to_spark_io('%s/read_spark_io/data.orc' % path, ... format='orc', index_col="index") >>> ps.read_spark_io( ... path=r'%s/read_spark_io/data.orc' % path, format="orc", index_col="index") ... id index 0 10 1 11 2 12 3 13 4 14