Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily
1,015 167 5MB
English Pages 182 Year 2019
Table of contents :
Table of ContentsInstalling Pyspark and Setting up Your Development EnvironmentGetting Your Big Data into the Spark Environment Using RDDsBig Data Cleaning and Wrangling with Spark NotebooksAggregating and Summarizing Data into Useful ReportsPowerful Exploratory Data Analysis with MLlibPutting Structure on Your Big Data with SparkSQLTransformations and ActionsImmutable DesignAvoiding Shuffle and Reducing Operational ExpensesSaving Data in the Correct FormatWorking with the Spark Key/Value APITesting Apache Spark JobsLeveraging the Spark GraphX API