wholeTextFiles(path: str, minPartitions: Optional[int] = None, use_unicode: bool = True) → pyspark.rdd.RDD[Tuple[str, str]]¶
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. The text files must be encoded as UTF-8.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
rdd = sparkContext.wholeTextFiles("hdfs://a-hdfs-path"), then
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Small files are preferred, as each file will be loaded fully in memory.
>>> dirPath = os.path.join(tempdir, "files") >>> os.mkdir(dirPath) >>> with open(os.path.join(dirPath, "1.txt"), "w") as file1: ... _ = file1.write("1") >>> with open(os.path.join(dirPath, "2.txt"), "w") as file2: ... _ = file2.write("2") >>> textFiles = sc.wholeTextFiles(dirPath) >>> sorted(textFiles.collect()) [('.../1.txt', '1'), ('.../2.txt', '2')]