pyspark.sql.avro.functions.from_avro

pyspark.sql.avro.functions.from_avro(data: ColumnOrName, jsonFormatSchema: str, options: Optional[Dict[str, str]] = None) → pyspark.sql.column.Column

Converts a binary column of Avro format into its corresponding catalyst value. The specified schema must match the read data, otherwise the behavior is undefined: it may fail or return arbitrary result. To deserialize the data with a compatible and evolved schema, the expected Avro schema can be set via the option avroSchema.

Parameters
dataColumn or str

the binary column.

jsonFormatSchemastr

the avro schema in JSON string format.

optionsdict, optional

options to control how the Avro record is parsed.

Notes

Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of “Apache Avro Data Source Guide”.

Examples

>>> from pyspark.sql import Row
>>> from pyspark.sql.avro.functions import from_avro, to_avro
>>> data = [(1, Row(age=2, name='Alice'))]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> avroDf = df.select(to_avro(df.value).alias("avro"))
>>> avroDf.collect()
[Row(avro=bytearray(b'\x00\x00\x04\x00\nAlice'))]
>>> jsonFormatSchema = '''{"type":"record","name":"topLevelRecord","fields":
...     [{"name":"avro","type":[{"type":"record","name":"value","namespace":"topLevelRecord",
...     "fields":[{"name":"age","type":["long","null"]},
...     {"name":"name","type":["string","null"]}]},"null"]}]}'''
>>> avroDf.select(from_avro(avroDf.avro, jsonFormatSchema).alias("value")).collect()
[Row(value=Row(avro=Row(age=2, name='Alice')))]