Download PDF
Question: What is the use of Spark Local mode?
Answer: Spark can run in cluster mode as well as local mode, local mode are used only for experiment and test your jobs. In this case both the driver and executors run on the same machine as separate thread.
Question: Explain the key components of the Apache Spark Application?
Answer: In a Spark cluster the one of the main component is a cluster manager that keeps track of available resources. Then Driver program which is responsible for running the executors on the various nodes in the cluster. Executors are running the tasks on the nodes.
Question: What different types of API supported by the Apache Spark?
Answer: Spark has two different APIs like
1. Unstructured API : Which is low level (Mostly, you would not be using it)
2. Higher Level API: Which is a structured API and most of the time you will be using it.
Question: What is the use of SparkSession object?
Answer: When your write your application first thing you would be creating a SparkSession object, which represent the entry point for your Spark cluster. Hence, you need to instantiate SparkSession in your application. When you use Spark in Interactive mode then this is implicitly created for you and referred as spark. SparkSession is part of the Driver. Hence, for your one application, you would have one Driver and one SparkSession object. SparkSession object is bundled in package called org.apache.spark.sql.SparkSession or pyspark.sql.session.SparkSession
Question: What is DataFrame in Spark?
Answer: DataFrame is introduced as part of Spark structured API, which you can think of similar to relational database table which is having Rows and columns but distributed across the nodes in the cluster. Suppose you have 1000 rows with 5 columns in each row, and 4 nodes in the cluster, then it will possibly distribute 250 rows on each node (not guaranteed, depends on many factors and configurations). Python and R language also has the concept of DataFrame but they are not distributed across the nodes and reside only on one computer/node. You can convert this Pythons Pandas DataFrame or R DataFrame to Spark DataFrame (Distributed).
Question: What are the reasons or benefits of distributing rows over multiple nodes in the cluster?
Answer: There a majorly two reason you distribute your data using either RDD, DataFrame or Dataset across the nodes in the cluster are
1. Data is too large to fit on the single disk/memory
2. Having data distributed will help in performing computation in parallel and distributed. So you can use the power of all the computers/nodes in the cluster.
Question: Which all are abstractions available in Spark for holding data?
Answer: There are following types of abstraction available for holding Spark data
1. RDD (Unstructured) Resilient Distributed Dataset, you should avoid using it until you absolutely needed it.
2. DataFrames:
3. DataFrame
4. SQL Table
Question: What do you mean by partitions?
Answer: DataFrame/Dataset has rows which are distributed across the nodes in the cluster. Hence, rows resided on each node is a partition of the DataFrame.
Question: How does partitioning affects the parallelism?
Answer: Partitions resided on different nodes. Suppose you have 1000 nodes and your DataFrame is also partitioned into 1000 then you can have 1000 computation run in parallel. Because each node will process separate rows from your DataFrame. Now suppose you have only 1 partition of your data then having a 1000 nodes in the cluster will not give any benefit. Because it can start computation only on one node, which is having DataFrame. Similarly, if you partitioned your data in 1000’s partitions on the single node only, then also it would not run computation in parallel because it has only one executor to process data. So to have effective parallelism you must have partitioned data as well as more nodes should be available to process the data.
Question: In Spark once DataFrame/Dataset created you cannot manipulate it, then how your work is done?
Answer: Dataset/DataFrame/RDD all are immutable in Spark. Hence, to work on these you have to use transformations which will create DataFrame/Dataset/RDD from your existing one.
Question: What all are the types of transformation available in Spark?
Answer: There are two types of transformations available in Spark
- Narrow dependencies (Narrow transformation): Narrow transformations are the result of map, filter and such that is from the data from a single partition only, i.e. it is self-sustained. An output RDD/Dataset/DataFrame has partitions with records that originate from a single partition in the parent RDD/DataFrame. Only a limited subset of partitions used to calculate the result. Spark groups narrow transformations as a stage.
- Wide dependencies: Shuffle requires partitions to be exchanged between the nodes in the cluster and cause of performance issue as well. Wide transformations are the result of groupByKey and reduceByKey . The data required to compute the records in a single partition may reside in many partitions of the parent RDD/Dataset/DataFrame. All of the tuples/rows with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute RDD/DataFrame shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.
Question: What do you mean by in-memory pipeline?
Answer: On the narrow dependencies transformation you can apply more than one function/transformation in memory only by piping them for exam myDataFrame.where(x%2==0).filter(x%4==0)
Hence, we are piping here output of where will in-memory filtered.
Question: When you apply the transformation on the Spark which required shuffling, can you apply the pipeline on that?
Answer: No, if shuffling is applied then in memory pipeline does not work. In case of shuffling Spark writes the data on the Disk.
Question: What is lazy evaluation?
Answer: Lazy evaluated, i.e. the data inside RDD/DataFrame/Dataset is not available or transformed until an action is executed that triggers the execution. You create series of transformation on the RDD/DataFrame/Dataset but actual execution would happen only action is found.
Question: How lazy evaluation benefits?
Answer: when you create series of transformations steps it builds the plan for execution and this plan would be optimized before executing on the cluster by Spark itself. Hence, once you define all your transformation steps Spark will create a logical and then physical plan.
Question: What do you mean by predicate pushdown?
Answer: Suppose you are reading data from Oracle Database table called Employee which has 10 Million records in it. You want to process data only for the employees which has salary more than 100K. You write your Spark program such a way that you fetch all the data from Employee table (Select * from Employee) and then at the end of the program you filter the data for the employee which has salary more than 100K (employyeDF.filter(salary>=100000). Then you are un-necessary fetching entire employee table in Spark memory. Hence, in the background Spark will optimize your code and see it needs to apply the predicate on the source data. So it pushes this predicate to source data itself. So you can assume Spark will filter data at source only with the query like (select * from Employee where salary>100000)
Question: How do you define actions?
Answer: An action is an operation that triggers execution of RDD/DataFrame/Dataset transformations and returns a value (to a Spark driver - the user program). They trigger execution of RDD/DataFrame/Dataset transformations to return values. Simply put, an action evaluates the RDD lineage graph.
You can think of actions as a valve and until no action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data.
Question: What all different types of actions?
Answer: There are three types of actions as below
1. Action to view data on the console
2. Actions to collect data
3. Action to write data on output sources e.g. Hive table
Question: What do you mean by a Job in the Spark?
Answer: Many people get confused between Spark Application and Spark Job. A Spark Application can have many spark Job, so when your applications is submitted and after series of transformation if it finds that there is an action then that part will be created as job see below
Transformation-1 -> Transformation-2-> Transformation-3->Action-> Transformation-4-> Transformation-5-> Transformation-6->Action
So there would be total two Spark jobs will be created (In one application). You can use Spark Web UI to check how many Spark jobs are created for your Spark Application.
Question: While loading data using DataFrameReader (spark.read) you use option called ïnferSchema, what it does?
Answer: When you load data with the inferSchema as below
spark.read.option("inferSchema", "true")
It reads some amount of data from actual file and then tries to parse the types of each column in the data.
Question: How do you read explain plan?
Answer: You should read explain plan from top to bottom. Where top is related to your final results and bottom is with the input data source.
Question: What is a Lineage Graph?
Answer: A RDD/DataFrame/Dataset Lineage Graph (aka operator graph) is a graph of the parent RDD/Dataset/DataFrame. It is built as a result of applying transformations. A lineage graph is hence a graph of what transformations need to be executed after an action has been called. This represent logical plan, so any given point of time Spark knows how to recompute and partition by performing all of the operations it had before on the same input data.
Question: In a DataFrame renaming a column is an action or transformation?
Answer: Renaming a column is a transformation because it does not initiate the execution “withColumnRenamed” is the function using which you can rename a column.
Question: How does “RelationalGroupedDataset” is different?
Answer: When you apply the Group By operation on a DataFrame it results in RelationalGroupedDataset”, however you can not directly apply Action on this. You have to first apply aggregation function on it like count(),sum(), max(), min() etc. then only you can have action on it.
Question: How would you submit application to the cluster?
Answer: You can use spark-submit tool and also you can provide various arguments like what all resources you want to use from the cluster.
Question: What is the difference between DataFrame and Dataset?
Answer: DataFrame is a distributed collection of untyped Row object similarly Dataset is also distributed collection or Rows, but these Rows objects are typed (means hold the Data type information for each column in Row). Dataset API is available only in the Scala/Java API of Spark. Same goal can be achieved in Python using the DataFrame only (Because Python is Dynamic type of language and Scala/Java are typed language). If you say Dataset [Employee] it means every Row in the Dataset is representing one Employee class object.
Question: How can you convert DataFrame into Dataset?
Answer: We have to use “as” operator see below
val heDS = heDF.as[Training]
Here heDF is a DataFrame, Training is a case class and heDS is a Dataset
Question: How does structured streaming is different that Dstream API?
Answer: The major difference is that with Dstream API you have to learn different API for both RDD and Dstream work. But with the innovation of SparkSQL Spark team able to make sure that you learn only one API which you should be able to use across other libraries like Structured Streaming, GraphFrame etc. Virtually your code written for batch job should work as it is with the structured streaming.
Question: What do you mean by schema in Spark?
Answer: Schema is same thing as you have learned in other data types like XML, JSON etc. In schema we define the name of the columns and its data types like whether columns are Integer, Date or String etc.
Question: How Catalyst optimizer maintains the different data types across various programming languages?
Answer: As the innovation of Spark Catalyst optimizer many new things have been introduced, one of them is, Spark maintains its own type information. Spark keeps the map internally and a lookup table exists for each of the programming language Scala, Java , Python, SQL and R.
Question: What is “Row” type, in Spark?
Answer: “Row” is a type in Spark which is optimized for computation. It does not use Java Garbage-collection and object instance creation. And this represent a Record in a Dataset or DataFrame.
Question: Can you please describe the various phases of Spark Structured API or Spark SQL code submitted?
Answer: There are mainly below 5 phases through which your SparkSQL code goes through to produce the final result.
Step-1: Application or code submission
Step-2: If code is valid then Spark Catalyst will create a Logical plan.
Step-3: Using the Logical plan Spark Catalysts mat create more than one physical plan.
Step-4: Then cost based optimization will be applied on the physical plan and finally one physical plan with the lowest cost will be selected and executed. To get the cost it checks how big are the partitions and tables etc.) . Physical plan compiles your DataFrame, Dataset and SQL into series of RDDs.
Step-5: This physical plan would be executed on the various nodes of the cluster and final result would be generated.
Question: What do you mean by logical resolved or unresolved logical plan?
Answer: Spark read your code and at first it creates the unresolved logical plan this plan is called unresolved because the table and the columns it is referring does not exists. To resolve this table and columns Spark uses the catalog (this is a repository) of all table and DataFrame information, to resolve columns and tables in the resolver.
Note : You must use this Spark SQL HandsOn Training to learn this in depth.