pyspark.RDD.saveAsTextFile

RDD.saveAsTextFile(path: str, compressionCodecClass: Optional[str] = None) → None

Save this RDD as a text file, using string representations of elements.

Parameters
pathstr

path to text file

compressionCodecClassstr, optional

fully qualified classname of the compression codec class i.e. “org.apache.hadoop.io.compress.GzipCodec” (None by default)

Examples

>>> from tempfile import NamedTemporaryFile
>>> tempFile = NamedTemporaryFile(delete=True)
>>> tempFile.close()
>>> sc.parallelize(range(10)).saveAsTextFile(tempFile.name)
>>> from fileinput import input
>>> from glob import glob
>>> ''.join(sorted(input(glob(tempFile.name + "/part-0000*"))))
'0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n'

Empty lines are tolerated when saving to text files.

>>> from tempfile import NamedTemporaryFile
>>> tempFile2 = NamedTemporaryFile(delete=True)
>>> tempFile2.close()
>>> sc.parallelize(['', 'foo', '', 'bar', '']).saveAsTextFile(tempFile2.name)
>>> ''.join(sorted(input(glob(tempFile2.name + "/part-0000*"))))
'\n\n\nbar\nfoo\n'

Using compressionCodecClass

>>> from tempfile import NamedTemporaryFile
>>> tempFile3 = NamedTemporaryFile(delete=True)
>>> tempFile3.close()
>>> codec = "org.apache.hadoop.io.compress.GzipCodec"
>>> sc.parallelize(['foo', 'bar']).saveAsTextFile(tempFile3.name, codec)
>>> from fileinput import input, hook_compressed
>>> result = sorted(input(glob(tempFile3.name + "/part*.gz"), openhook=hook_compressed))
>>> ''.join([r.decode('utf-8') if isinstance(r, bytes) else r for r in result])
'bar\nfoo\n'