PySpark

What is PySpark?

Python API for Apache Spark, a distributed computing framework
Enable big data processing using RDD, DataFrames/DataSets
Run on clusters

Core concepts

Spark session: entry point for using Spark
RDD (Resilient Distributed Dataset)
- Low level distributed collection of objects
- immutable, fault-tolerant, parallel operations
- operations:
  - transformations (lazy): map, filter, flatMap,
  - Actions: collect, count, take, reduce
Dataframe: is a distributed collection of data organized into named columns
Key operations
Performance features
- lazy evaluation: execution deferred until an action
- partitioning
- caching
- broadcast variables
- user defined functions

Kafka

download apache kafka

wget https://archive.apache.org/dist/kafka/3.5.1/kafka_2.13-3.5.1.tgz

unzip

tar -xzf kafka_2.13-3.5.1.tgz

change the configuration

cd kafka_2.13-3.5.1

nano config/server.properties

Start ZooKeeper:

Start ZooKeeper by running the following command in the Kafka directory:

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka Server (Broker):

In a new terminal, start the Kafka server:

bin/kafka-server-start.sh config/server.properties

producer

bin/kafka-console-producer.sh --broker-list pchuanvn.com:9092 --topic test

consumer bin/kafka-console-consumer.sh --bootstrap-server pchuanvn.com:9092 --topic test --from-beginning

Setup pySpark cluster

setup master

setup java

Install JDK

sudo apt-get -y install openjdk-8-jdk-headless

install python3 sudo apt install python3 install pySpark

Download Apache Spark

wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

Untar the downloaded file

tar -xzf spark-3.5.0-bin-hadoop3.tgz

Rename folder to spark

mv spark-3.5.0-bin-hadoop3 spark

Install PySpark using pip

export SPARK_HOME=/home/pc/spark

export PATH=$PATH:$SPARK_HOME/bin

test

To make PySpark accessible from the command line, add the following lines to your ~/.bashrc file or ~/.zshrc file. I am using .bashrc file, so I am adding the following lines.

sparkuser@sparknode:~$ vi ~/.bashrc

Add below lines at the end of the .bashrc file.

export SPARK_HOME=/home/sparkuser/spark export PATH=$PATH:$SPARK_HOME/bin Now load the environment variables to the opened session by running the below command

Source the bashrc file to reload

sparkuser@sparknode:~$ source ~/.bashrc With this, the Apache Spark install on Linux Ubuntu completes. Now let’s run a sample example that comes with Spark binary distribution.

Here I will be using Spark-Submit to submit example Python file which calculates PI value for 10 places. You can find spark-submit at $SPARK_HOME/bin directory.

Run spark example

cd $SPARK_HOME

./bin/spark-submit examples/src/main/python/pi.py 10

Navigate to the Spark configuration directory: cd spark-3.2.0-bin-hadoop3.2/conf Copy the template configuration file: cp spark-env.sh.template spark-env.sh Edit spark-env.sh to specify the master IP address: echo "export SPARK_MASTER_HOST=pchuanvn" >> spark-env.sh Start Master Node: ../sbin/start-master.sh

Google Sites

Report abuse