Python API for Apache Spark, a distributed computing framework
Enable big data processing using RDD, DataFrames/DataSets
Run on clusters
Spark session: entry point for using Spark
RDD (Resilient Distributed Dataset)
Low level distributed collection of objects
immutable, fault-tolerant, parallel operations
operations:
transformations (lazy): map, filter, flatMap,
Actions: collect, count, take, reduce
Dataframe: is a distributed collection of data organized into named columns
Key operations
Performance features
lazy evaluation: execution deferred until an action
partitioning
caching
broadcast variables
user defined functions
download apache kafka
wget https://archive.apache.org/dist/kafka/3.5.1/kafka_2.13-3.5.1.tgz
unzip
tar -xzf kafka_2.13-3.5.1.tgz
change the configuration
cd kafka_2.13-3.5.1
nano config/server.properties
Start ZooKeeper by running the following command in the Kafka directory:
bin/zookeeper-server-start.sh config/zookeeper.properties
In a new terminal, start the Kafka server:
bin/kafka-server-start.sh config/server.properties
producer
bin/kafka-console-producer.sh --broker-list pchuanvn.com:9092 --topic test
consumer bin/kafka-console-consumer.sh --bootstrap-server pchuanvn.com:9092 --topic test --from-beginning
Install JDK
sudo apt-get -y install openjdk-8-jdk-headless
install python3 sudo apt install python3 install pySpark
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 spark
export SPARK_HOME=/home/pc/spark
export PATH=$PATH:$SPARK_HOME/bin
test
To make PySpark accessible from the command line, add the following lines to your ~/.bashrc file or ~/.zshrc file. I am using .bashrc file, so I am adding the following lines.
sparkuser@sparknode:~$ vi ~/.bashrc
Add below lines at the end of the .bashrc file.
export SPARK_HOME=/home/sparkuser/spark export PATH=$PATH:$SPARK_HOME/bin Now load the environment variables to the opened session by running the below command
Source the bashrc file to reload
sparkuser@sparknode:~$ source ~/.bashrc With this, the Apache Spark install on Linux Ubuntu completes. Now let’s run a sample example that comes with Spark binary distribution.
Here I will be using Spark-Submit to submit example Python file which calculates PI value for 10 places. You can find spark-submit at $SPARK_HOME/bin directory.
Run spark example
cd $SPARK_HOME
./bin/spark-submit examples/src/main/python/pi.py 10
Navigate to the Spark configuration directory: cd spark-3.2.0-bin-hadoop3.2/conf Copy the template configuration file: cp spark-env.sh.template spark-env.sh Edit spark-env.sh to specify the master IP address: echo "export SPARK_MASTER_HOST=pchuanvn" >> spark-env.sh Start Master Node: ../sbin/start-master.sh