pyspark.sql.SparkSession.sql

SparkSession.sql(sqlQuery: str, args: Dict[str, Any] = {}, **kwargs: Any) → pyspark.sql.dataframe.DataFrame

Returns a DataFrame representing the result of the given query. When args is specified, the given string represents a parameterized query where the parameters are replaced after parsing the query text. When kwargs is specified, this method formats the given string by using the Python standard formatter before the query string is parsed.

Parameters
sqlQuerystr

SQL query string.

argsDict[str,Any]

A dictionary of named parameters that begin from the : marker and the corresponding Python objects as values for substitution. (For more details about acceptable Python objects, read the Spark SQL Guide topic on Data Types.)

Note: sql() converts dictionary values to Column literal expressions with the lit() method.

kwargsdict

Other variables that the user wants to set that can be referenced in the query.

Added optional argument kwargs to specify the mapping of variables in the query. This feature is experimental and unstable.

Returns
DataFrame

Examples

Executing a SQL query.

>>> spark.sql("SELECT * FROM range(10) where id > 7").show()
+---+
| id|
+---+
|  8|
|  9|
+---+

Executing a SQL query with an integer parameter.

>>> spark.sql("SELECT * FROM range(10) where id > :param", args={"param": 7}).show()
+---+
| id|
+---+
|  8|
|  9|
+---+

Executing a SQL query with a string parameter.

>>> spark.sql("SELECT * FROM employee where name = :param", args={"param": 'John'}).show()
+---+----+
| id|name|
+---+----+
|100|John|
+---+----+

Executing a SQL query with variables as Python formatter standard.

>>> spark.sql(
...     "SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", bound1=7, bound2=9
... ).show()
+---+
| id|
+---+
|  8|
+---+
>>> mydf = spark.range(10)
>>> spark.sql(
...     "SELECT {col} FROM {mydf} WHERE id IN {x}",
...     col=mydf.id, mydf=mydf, x=tuple(range(4))).show()
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
+---+
>>> spark.sql('''
...   SELECT m1.a, m2.b
...   FROM {table1} m1 INNER JOIN {table2} m2
...   ON m1.key = m2.key
...   ORDER BY m1.a, m2.b''',
...   table1=spark.createDataFrame([(1, "a"), (2, "b")], ["a", "key"]),
...   table2=spark.createDataFrame([(3, "a"), (4, "b"), (5, "b")], ["b", "key"])).show()
+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
|  2|  5|
+---+---+

Also, it is possible to query using class:Column from DataFrame.

>>> mydf = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"])
>>> spark.sql("SELECT {df.A}, {df[B]} FROM {df}", df=mydf).show()
+---+---+
|  A|  B|
+---+---+
|  1|  4|
|  2|  4|
|  3|  6|
+---+---+