Spark Interview Questions- Part-1

Download PDF

Question: Why Spark, even Hadoop exists?

Answer: Below are few reasons.

· Iterative Algorithm: Generally MapReduce is not good to process iterative algorithms like Machine Learning and Graph processing. Graph and Machine Learning algorithms are iterative by nature and less saves to disk, this type of algorithm needs data in memory to run algorithm steps again and again or less transfers over network means better performance.

· In Memory Processing: MapReduce uses disk storage for storing processed intermediate data and also read from disks which is not good for fast processing. Because Spark keeps data in Memory (Configurable), which saves lot of time, by not reading and writing data to disk as it happens in case of Hadoop.

· Near real-time data processing: Spark also supports near real-time streaming workloads via Spark Streaming application framework.

· Rich and Simple API: Spark 1.x to Spark 2.x lot of improvements are done. And mostly creating a rick API over DataSet and DataFrame. SQL support has been improved.

Question: Why both Spark and Hadoop needed?

Ans: Spark is often called cluster computing engine or simply execution engine. Spark uses many concepts from Hadoop MapReduce. Both Spark and Hadoop work together well. Spark with HDFS and YARN gives better performance and also simplifies the work distribution on cluster. As HDFS is storage engine for storing huge volume of data and Spark as a processing engine (In memory as well as more efficient data processing).

· HDFS: It is used as a Storage engine for Spark as well as Hadoop.

· YARN: It is a framework to manage Cluster using pluggable scheduler.

· Run other than MapReduce: With Spark you can run MapReduce algorithm as well as other higher level of operators for instance map(), filter(), reduceByKey(), groupByKey() etc.

Question: How do you differentiate between Hadoop and Spark?

Answer: Hadoop is more of an eco-system, which provides a platform for storing (different data formats), compute engine, cluster management, distributed file system (HDFS). Whereas Spark is more of a compute engine only, which can well integrated with the Hadoop. Even, you can say Spark work as a one of the compute engine for Hadoop.

Spark does not have its own storage engine, but it can connect to various other storage like HDFS, Local File System, and RDBMS etc.

Question: What are the basic functionality of the Spark Core?

Answer: Spark core is the heart of the entire Spark engine, which provide various functionality as below.

· Managing memory pool

· Scheduling task on the cluster.

· Recovering from the failed jobs

· Can integrate various storage like RDBMS, HDFS, AWS S3 etc.

· Provides the RDD APIs, which are the basis for the higher-level API.

Spark core abstract the native API or lower level technicalities for the end user.

Question: What is the main improvement done in Spark 2.0 w.r.t. Spark 1.0.0?

Answer: Major improvement in Spark 2.0 is with regards to its API, SQL 2003 support, in streaming they have added structured streaming. Support of UDF in R language is also added.

· A P I: Dataset and DataFrame API is merged.

However, overall architecture of Spark 2.0 is same as Spark 1.0. Core Spark internally still works on Direct Acyclic Graph and RDD.

Question: Which all kind of data processing supported by Spark?

Answer: Spark offers three kinds of data processing using batch, interactive (Spark Shell), and stream processing with the unified API and data structures.

Question: How do you define SparkContext?

Answer: It’s an entry point for a Spark Job. Each Spark application starts by instantiating a Spark context. A Spark application is an instance of SparkContext. Or you can say, a Spark context constitutes a Spark application. However, since Spark 2.0 onwards it is highly recommended that you use SparkSession instead of SparkContext. In Spark shell both are available sc for SparkContext and spark for SparkSession.

A SparkContext or SparkSession is essentially a client of Spark’s execution environment and it acts as the master of your Spark application.

Question: How can you define SparkConf?

Answer: Spark properties control most application settings and are configured separately for each application. These properties can be set directly on a SparkConf passed to your SparkContext. SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set() method. For example, we could initialize an application with two threads as follows:

Note that we run with local[2], meaning two threads - which represents “minimal” parallelism, which can help detect bugs that only exist when we run in a distributed context.

val conf = new SparkConf().setMaster("local[2]").setAppName("CountingSheep")

val sc = new SparkContext(conf)

Question: Which all are the, ways to configure Spark Properties and order them least important to the most important.

Answer: There are the following ways to set up properties for Spark and user programs (in the order of importance from the least important to the most important):

· conf/spark-defaults.conf - the default

· --conf - the command line option used by spark-shell and spark-submit

· SparkConf

Question: What is the Default level of parallelism in Spark?

Answer: Default level of parallelism is the number of partitions when not specified explicitly by a user.

Question: Is it possible to have multiple SparkContext/SparkSession in single JVM?

Answer: Yes, spark.driver.allowMultipleContexts is true (default: false). If true Spark logs warnings instead of throwing exceptions when multiple SparkContexts/SparkSessions are active, i.e. multiple SparkContext/SparkSessions are running in this JVM. When creating an instance of SparkContex.

Question: Can RDD be shared between SparkContexts?

Answer: No, When an RDD is created; it belongs to and is completely owned by the Spark context it originated from. RDDs can’t be shared between SparkContexts.

Question: What is the advantage of broadcasting values across Spark Cluster?

Answer: Spark transfers the value to Spark executors once, and tasks can share it without incurring repetitive network transmissions when requested multiple times.

Question: Can we broadcast an RDD?

Answer: Yes, you should not broadcast a RDD to use in tasks and Spark will warn you. It will not stop you, though.

Question: How can you stop SparkSession and what is the impact if stopped?

Answer: You can stop a SparkSession using SparkSession.stop() method. Stopping a Spark session stops the Spark Runtime Environment and effectively shuts down the entire Spark application.

Question: How would you the amount of memory to allocate to each executor?

Answer: SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor.

Question: How do you define RDD?

Answer: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

· Resilient: Fault-tolerant and so able to recomputed missing or damaged partitions on node failures with the help of RDD lineage graph.

· Distributed: across clusters.

· Dataset: is a collection of partitioned data.

· Typed: Data in RDD are strongly typed.

· Lazy evaluation: Transformation (creating new RDD from existing RDD) is lazy.

· Immutable: Once you create an RDD, its content cannot be changed.

· Parallel Processing: On single RDD, which is distributed across the nodes in the cluster can be worked upon in parallel.

· Caching: You can cache the RDD in memory, if you need it later on, rather than recreating again and again (Which gives the performance boost)

· Question: What is the use of Spark for Data Scientists?

· Answer: Data Scientists wants to use Spark for Interactive queries to answer questions and build statistical models.

Question: What is the use of Spark for Data Scientists?

Answer: Data Scientists wants to use Spark for Interactive queries to answer questions and build statistical models.

Question: How does Data Engineer willing to use Apache Spark?

Answer: They writes the jobs which can be repeatedly run or only once. Data Engineer can use the model build by the Data Scientists and regularly run them on existing data or on new data. They can prepare data for further analysis which include creating data ingest pipelines.

Question: Why it is said that Spark is a unified platform?

Answer: Spark is a single platform supporting multiple things like loading the data, running SQL queries on the data, apply Machine Learning and then processing stream data. Most of the time for each individual case you see different platforms are created. But good thing about Apache Spark 2.x is that you just need to learn single API model and you can use it across various libraries. So, it combines many things in single platform. Even you can create your own Libraries using this platform.

Question: What do you mean by Spark 2.x supports the Structured API?

Answer: Spark 2.x has new API which works or converted into underline RDD, structured API includes mainly three things DataFrame, Datasets and SQL.

Question: Can we use Spark as a Storage engine?

Answer: No, Spark is purely a computation engine which can run computation in parallel on 1000’s of computer systems. However, storing data is your responsibility you can use HDFS, AWS S3, Azure, Google Cloud, Hadoop Hive, Cassandra or even Apache Kafka you can use. Spark stores data only for computation and not for the long term persistence. As you see it can load data from diverse sources, no need to have another storage layer. Most of the BigData platform like Cloudera Hadoop already bundle the Spark in their package. Spark main focus is optimized computation on any size of data using distributed computing.

Question: What is the Role of Cluster Manager in Apache Spark?

Answer: Spark has ability run on more than one node, however to better manage the resources in the cluster Spark needed a Cluster Manager, if you are already using Hadoop then Spark supports the YARN cluster manager, if you are not using Hadoop then Spark comes with its own cluster manager. Cluster Manager is responsible for managing the resources in the cluster. When you submit the application then cluster manger will decide how and what amount of resources should be allocated to your application. There are various other parameters and algorithm these cluster manager use to decide resource allocation. Spark also supports the one open source Mesos cluster manager as well.

Question: What all are the main components of the Spark Application?

Answer: There are mainly two components in a Spark Application.

1. A Driver process: It is responsible for running your main functions like “main() method” in Java. It can be executed from one of the node in the cluster or from a gateway node (which is not part of the Spark cluster).

2. Set of Executor processes

Question: What is the responsibility of the Driver process in a Spark cluster?

Answer: As mentioned it can run in Gateway node or from any node in the cluster which is responsible for following activity (remember you would have one driver for each of your application and it is the heart of your application and without that your application can not run, remain alive till your submitted application finishes)

- Keeps the information about your application

- Responding user programs input

- Distributing and scheduling work across the executors in the cluster.

Question: What is the executors?

Answer: Executors are responsible for executing the work and has mainly below two responsibilities

1. Executing the code assigned by the driver

2. Reporting back the state of the computation on that executor back to the driver node.

Question: How many executors a node can have in a cluster?

Answer: By carefully considering your node capacity like CPU and Memory. By using the configuration you can decide how many executors should be available on each node.

Page updated

Report abuse