pyspark.pandas.read_csv¶
-
pyspark.pandas.
read_csv
(path: str, sep: str = ',', header: Union[str, int, None] = 'infer', names: Union[str, List[str], None] = None, index_col: Union[str, List[str], None] = None, usecols: Union[List[int], List[str], Callable[[str], bool], None] = None, squeeze: bool = False, mangle_dupe_cols: bool = True, dtype: Union[str, numpy.dtype, pandas.core.dtypes.base.ExtensionDtype, Dict[str, Union[str, numpy.dtype, pandas.core.dtypes.base.ExtensionDtype]], None] = None, nrows: Optional[int] = None, parse_dates: bool = False, quotechar: Optional[str] = None, escapechar: Optional[str] = None, comment: Optional[str] = None, encoding: Optional[str] = None, **options: Any) → Union[pyspark.pandas.frame.DataFrame, pyspark.pandas.series.Series]¶ Read CSV (comma-separated) file into DataFrame or Series.
- Parameters
- pathstr
The path string storing the CSV file to be read.
- sepstr, default ‘,’
Delimiter to use. Non empty string.
- headerint, default ‘infer’
Whether to to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names
- namesstr or array-like, optional
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause an error to be issued. If a string is given, it should be a DDL-formatted string in Spark SQL, which is preferred to avoid schema inference for better performance.
- index_col: str or list of str, optional, default: None
Index column of table in Spark.
- usecolslist-like or callable, optional
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.
- squeezebool, default False
If the parsed data only contains one column then return a Series.
- mangle_dupe_colsbool, default True
Duplicate columns will be specified as ‘X0’, ‘X1’, … ‘XN’, rather than ‘X’ … ‘X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns. Currently only True is allowed.
- dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object together with suitable na_values settings to preserve and not interpret dtype.
- nrowsint, default None
Number of rows to read from the CSV file.
- parse_datesboolean or list of ints or names or list of lists or dict, default False.
Currently only False is allowed.
- quotecharstr (length 1), optional
The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
- escapecharstr (length 1), default None
One-character string used to escape other characters.
- comment: str, optional
Indicates the line should not be parsed.
- encoding: str, optional
Indicates the encoding to read file
- optionsdict
All other options passed directly into Spark’s data source.
- Returns
- DataFrame or Series
See also
DataFrame.to_csv
Write DataFrame to a comma-separated values (csv) file.
Examples
>>> ps.read_csv('data.csv')