

Mean() function returns the average of the values in a column. Max() function returns the maximum value in a column.ĭf.select(max("salary")).show(truncate=False)ĭf.select(min("salary")).show(truncate=False)

Kurtosis() function returns the kurtosis of the values in a group.ĭf.select(kurtosis("salary")).show(truncate=False) when ignoreNulls is set to true, it returns the last non-null element.ĭf.select(last("salary")).show(truncate=False) Last() function returns the last element in a column. grouping() can only be used with GroupingSets/Cube/Rollupįirst() function returns the first element in a column when ignoreNulls is set to true, it returns the first non-null element.ĭf.select(first("salary")).show(truncate=False) If you try grouping directly on the salary column you will get below error.Įxception in thread "main" .AnalysisException: returns 1 for aggregated or 0 for not aggregated in the result. Grouping() Indicates whether a given input column is aggregated or not. Print("count: "+str(df.select(count("salary")).collect())) Print("Distinct Count of Department & Salary: "+str(df2.collect()))Ĭount() function returns number of elements in a column. ||Ĭollect_set() function returns all values from an input column with duplicate values eliminated.ĭf.select(collect_set("salary")).show(truncate=False)ĬountDistinct() function returns the number of distinct elements in a columnsĭf2 = df.select(countDistinct("department", "salary")) Now let’s see how to aggregate data in PySpark.ĭf.select(collect_list("salary")).show(truncate=False) Schema = ĭf = spark.createDataFrame(data=simpleData, schema = schema) All examples provided here are also available at PySpark Examples GitHub project. First, let’s create a DataFrame to work with PySpark aggregate functions.
