Geared for experienced developers, the Spark Developer | Introduction to Spark for Big Data, Hadoop & Machine Learning course provides students with a comprehensive, hands-on exploration of enterprise-grade Spark programming, interacting with the significant components to craft complete data science solutions. Students will leave this course armed with the skills they require to begin working with Spark in a practical, real world environment. This course is offered in support of the Python programming language but can also be offered for R or Java with advance notice and planning. Our team will work with you to coordinate the languages, tools and environment that will work best for your organization and needs.


* Actual course outline may vary depending on offering center. Contact your sales representative for more information.

Learning Objectives

This course is approximately 50% hands-on, combining expert lecture, real-world demonstrations and group discussions with machine-based practical labs and exercises. Working in a hands-on learning environment led by our expert practitioner students will explore:
Spark Essentials
Spark SQL
Spark MLib
Spark Streaming
Streaming with Kafka
Data Flow with NiFi
Spark GraphX
Performance and Tuning
Cluster Mode
Spark - the Big Picture

  • Getting Started

  • Our Data and our problem set
    Accessing the cluster, the data, and the tools
    The Continuous Workshop approach
    "Let's build a model together"
    Focus on analysis, exploration, data munging, algorithms
    Tooling and fundamentals as necessary to get the job done

  • Spark Overview

  • Data Science: The State of the Art
    Hadoop, Yarn, and Spark
    Architectural Overview
    MLib Overview
    HDFS data - Accessing
    Lab Focus
    Working with HDFS data
    Distributed vs. Local Run Modes
    Spark vs. Other tools (when is Spark the right tool for the job?)
    Spark vs. SAS
    Spark Languages (Java, R, Python, and Scala)
    Hello, Spark

  • Spark Essentials

  • Spark Core
    Spark SQL
    Spark and Hive
    Spark Streaming
    Spark API

  • DataFrames

  • DataFrames and Resilient Distributed Datasets (RDDs)
    Adding variables to a DataFrame
    DataFrame Types
    DataFrame Operations
    Dependent vs. Independent variables
    Map/Reduce with DataFrames

  • Spark SQL

  • Spark SQL Overview
    Data stores: HDFS, Cassandra, HBase, Hive, and S3
    Table Definitions

  • Spark MLib

  • MLib overview
    MLib Algorithms Overview
    Classification Algorithms
    Regression Algorithms
    Lab Focus
    Brief Comparison to SAS
    Here's your split, how to tune regression
    Decision Trees and forests
    Lab Focus
    Brief Comparison to SAS
    Stepwise approach to Decision Trees
    Working with Exit Criteria
    Recommendation with ALS
    Clustering Algorithms
    Lab Focus
    Key Clustering Algorithms
    Choosing Clustering Algorithms
    Working with key algorithms
    Machine Learning Pipelines
    Linear Algebra (SVD, PCA)
    Statistics in MLib

  • Spark Streaming

  • Streaming overview
    Real-time data ingestion
    Window Operations

  • Streaming with Kafka

  • Kafka overview
    Kafka and Spark Streaming

  • Data Flow with NiFi

  • Apache NiFi overview
    NiFi data flows with Spark/R

  • Spark GraphX

  • GraphX overview
    ETL with GraphX
    Graph computation

  • Performance and Tuning

  • Broadcast variables
    Memory Management

  • Cluster Mode

  • Standalone Cluster
    Masters and Workers
    Working with large data sets

  • Spark - the Big Picture

  • Spark in Real-Time and near-Real-Time Decision Support Systems
    Spark in the Enterprise
    Best Practices


This course is an Introductory level and beyond course. Typical attendees would include systems administrators, testers or technical data related roles who need to learn to use Spark for data analysis or processing data.




Students should have incoming skills equivalent to the course(s) below, or should have attended this / these as a prequisite: Data Science Overview | Tools, Tech & Modern Roles in the Data Driven Enterprise Python Primer for Data Science | Hands-on Technical Overview Attending students should have the following background: Basic knowledge of Python Programming (or students who know R and can pick up Python easily) Basic prior exposure to Java syntax (those without that background can copy and paste the labs) Introduction to SQL (familiarity wits SQL basics) Basic knowledge of Statistics and Probability & Data science


Length: 3.0 days (24 hours)


Not Your Location? Change

Course Schedule:

To request a custom delivery, please chat with an expert.