Spark Interview Question-Part-4

Download PDF

Question: If you have your existing queries created using Hive, what all changes you need to do to run them on Spark?

Answer: SparkSQL module has the capability to run queries in Spark, and also Spark SQL can use Hive Meta store as well, so there is no need to change anything to run hive queries in Spark SQL, even it can use UDFs defined by the Hive.

Question: What is the Driver Program in Spark?

Answer: This is one of the main program in Spark application, in case of REPL (spark shell will be driver program). Driver program create the instance of SparkSession or SparkContext. It is the responsibility of the Driver program to communicate with the Cluster manager to distribute the task to cluster worker nodes.

Question: Which all are daemon process are part of the standalone cluster manager?

Answer: As we have mentioned, standalone cluster manager is not good for the production use cases. You should consider this only for testing and POC purpose. So, when you start Spark with the standalone cluster manager then it creates the two daemon processes.

· Master Node

· Worker Node

Question: How do you define the Worker Node in Spark cluster?

Answer: You can say that, they are slave nodes, and actual your processing on data happens on worker node. Worker node always communicate with the Master Node and inform about the availability of the resources. Generally on each node, in your cluster one worker node is started, which is responsible for running your application on that particular node and also monitor that application.

Question: What is the executors in Spark?

Answer: Each Spark application, you submit has its own executors process, executors exists only until your application runs. Driver program uses these executors to run the tasks on the worker nodes and also help in keeping data in memory and when required can be spilled on disk.

Question: What is the Task in Spark?

Answer: Task is the unit of work, which is sent to the executor running on the worker node. It is a command which will be sent to the executor by a driver program by serializing your Function object. It is the responsibility of the executor process to de-serialize this Function object and execute on the data which is partitioned (one of the RDD partition) and exists on that node.

Question: Define some use of the SparkSession object?

Answer: It is one of the entry point for your Spark cluster, and once you have hold of SparkSession you can create new DataFrame from existing data, can create accumulators, and broadcast variables on that cluster.

Question: How many SparkSession object, you should create per JVM?

Answer: You can create as many as you want, but it is always preferred that you create only one SparkSession object per JVM. So suppose, you have already created one of the SparkSession object in one JVM, you need to first stop it by calling stop() method on it. As we have mentioned previously, when you start spark shell (REPL) then by default one of the SparkSession object is created and assigned to a variable named spark. So you should not create new SparkSession object in that shell.

Question: Which object represents the primary entry point into the Spark 2.0 System?

Answer: SparkSession is the primary object, you should use Spark 2.0 onwards. Because it has all the capabilities which other specific objects have like SQLContext, HiveContext, and SparkContext. In the REPL, Spark command line utility, it is available as a spark object.

Question: When do you use client and cluster mode to submit your application?

Answer: There are two deployment mode exists for the Spark application

· Client Mode: You should use this mode when your gateway machine is collocated with the worker machine. In this mode driver application is launched as a part of the spark submit process and act as client for the cluster.

· Cluster Mode: This is used when your application is far away from the worker machines, like your own laptop or computer which is not part of the Spark cluster.

Question: Which all are the standard ways are available to pass the functions in Spark framework, using Scala?

Answer: There are mainly two ways, by which functions can be passed.

· Anonymous Functions (Lambda functions): This is a good option, when you have to pass some small functionality, and quite simple. See below example, in which we are splitting data and returning splitted values.

· Static Singleton methods: When you need to do some complex operations on the data, then you should use this. If you know Java, then this (static) methods are associated with the class and not to the object.

Question: Which all are the shared variables are provided by the Spark framework?

Answer: In Spark shared variables means, the variables which provides data sharing globally across the nodes. This can be implemented using below two variables, each has different purpose.

· Broadcast variables: Read only variables cached on each node on the cluster. This variables cannot be updated on individual node. It is more of a same data, you want to share across the nodes, during data processing.

· Accumulators: This variable can be updated on each individual node. However, final value will be aggregated, which is sent by each individual node.

Question: Please give us the scenario, in which case you will be using broadcast and accumulator shared variable?

Answer: Broadcast variables: You can use it as a cached data on each node. So whenever we need most frequently used small dataset which entire data processing. Then ask Spark to cache this small dataset on each node, this can be done using broadcast variable and during calculation, you can refer this cached data.

You can set the broadcast variable using driver program, and will be retrieved by the worker node on the cluster. Remember, broadcast variable will be retrieved and cached only when first read request is sent.

Accumulator: You can consider them more as a global counter. Remember they are not read-only variables, on each worker node, executor will update the counter independently. Then driver program will accumulate all the accumulator from worker node and generate aggregated result.

So you can use them, when you need to do some counting like how many messages were not processed correctly. So using accumulator on each node individual count will be generated for the messages which are not processed, and at the last at driver side all the count will be accumulated, and you will get to know, which all messages are not processed.

Question: How do you define ETL process?

Answer: ETL extends to extraction, transformation and loading. This is where, we create data pipelines for data movement and transformation. In short there are three stages (now a days order of ETL steps can be re-ordered and sometime it could be ELT)

· Extract: You will extract data from most of the source systems like RDBMS, FlatFiles, Social Networking feed, web log files etc. Data can be in various formats like XML, CSV,JSON, Parquet, AVRO, also frequency of the data retrieval can also be defined as daily, hourly etc.

· Transform: In this step you will be transforming data as per your downstream system expect. For example from text file, you can create JSON file. Like changing the file formats, similarly you can filter valid and invalid data. In this step you would do many sub-steps to clean your data as next step expected.

· Loading: This step refer to send the data in the sink, where you have defined. In hadoop world it could be HDFS, Hive tables, HDFS etc. In case of RDBMS it could be MySQL, Oracle and for NoSQL it could be Cassandra, MongoDB

However, please note that, Spark is not an ETL tool, you can have some ETL job done using entirely Spark framework.

Question: How do you save data from an RDD to a text file?

Answer: You have to use RDD’s method saveAsTextFile(“destination_path”). Similarly for other file formats various other methods are available.

Question: What is Spark DataFrame and what are its basic properties?

Answer: Spark DataFrame, you can visualize as a table in Relational Databases. It has following features as well.

· It is distributed over the Spark Clustered Nodes.

· Data organized in columns.

· It is immutable (to modify it, you have to create new DataFrame)

· It is in-memory

· You can applies schema to this data.

· They also help you to have Domain Specific language (DSL)

· They are evaluated lazily.

In one line you can say, DataFrames is an immutable distributed collection of data organized into named columns. DataFrame helps you take away the RDD’s complexity.

Question: What are the main difference between DataSet and DataFrame?

Answer: As you remember before Spark 1.6 DataFrame and DataSet were separate API, and they were unified in Spark 2.0. DataSet API is type-safe object and it can operate on to the compiled lambda function. DataFrame has un-typed objects, it means syntax error you can catch during compile time. But if there is any type mismatch, then it can only be caught during run-time.

DataSet, as it has strongly typed objects, it means both syntax as well as type mismatch error can be caught during compile time only.

Question: How do you read json file in Spark?

Answer: JSON is semi-structured data, and Spark provides the easy wat to read json data as below

spark.read.json(“hadoopexam.json”)

Question: What is Sequence File?

Answer: The best place to learn all the Hadoop file format is http://hadoopexam.com big data on-demand training, go and subscribe now. You, can request for discount, if you subscribe to more than one product.

SequenceFiles are also key-value pair, but they are having their key and values in binary format. And this is one of the most used Hadoop based FileFormat. Spark also provides API to conveniently using this FileFormat. SequenceFile contains header as well. Sync marker, helps reader to synchronize to a record boundary from any position in the File. Compression, you can enable two types of compression on SequenceFile, it could be either block level or record level compression.

Question: What is Kryo?

Answer: This is a serialization framework, usually if you use Java’s default serialization mechanism then it is quite slower. Hence, other serialization frameworks are available and Spark can support them. One of this is Kryo. So when you work with the object file, you should consider using Kyro.

Question: While connecting to HDFS filesystem, which information you need for the Spark API?

Answer: We need two main information name node URL and port

· NameNode URL. Port and file path for example

Question: What all you need to load data from AWS S3 bucket in Spark?

Answer: You can download the data from AWS (Amazon Web Service) S3 bucket. You need following three things

· URL for file stored in a bucket

· AWS Accesss Key ID

· AWS Secret Access Keys

Once, you have this info, you can load data from S3 bucket as below.

Question: How do you relate to SparkContext, SQLContext and HiveContext?

Answer: SparkContext provide the entry point for Spark system, to create SQLContext you need SparkContext object. HiveContext provides a superset functionality provided by the basic SQLContext. However, since Spark 2.0 there is a SparkSession object, which is preferred to enter into Spark system. SparkSession is unification for all these three SparkContext, SQLContext and HiveContext.

Question: Can you describe which all projects of Spark you have used?

Answer: Spark has many other project than Spark Core as below

· Spark SQL: This project help you to work with structured data, you can mix both SQL queries and Spark programming API, for your expected results.

· Spark Structured Streaming: It is good for processing streaming data. It can help you to create fault-tolerant streaming applications.

· MLib: This API is quite reach for writing machine learning applications. You can use Python, Scala, R language for writing Spark Machine Library.

· GraphX/GraphFrame: API for graphs and graph parallel computations.

Question: What is the difference, when you run Spark Applications either on YARN or standalone cluster manager?

Answer: When you run Spark Applications using YARN then Application processes are managed by the YARN Resource Manager and Node Manager.

Similarly when you run on Spark standalone, then application processes are managed by Spark Master and Worker Nodes.

Question: How do you compare MapReduce and Spark Application?

Answer: Spark has many advantages over the Hadoop job, let’s describe each one

MapReduce: The highest level unit of computation in MapReduce is a Job. Jobs responsibility includes to load data, applies map function and then shuffles it, after that run reduce function and finally write data to persistence storage.

Spark Application: Highest-level unit of computation is an application. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. So a Spark application can consists of more than just a single MapReduce job.

MapReduce starts a process for each task. In contrast, a Spark application can have processes running on its behalf even when it is not running any application. And in case of Spark, multiple tasks can run within the same executor. Both by combining extremely fast task startup as well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce.

Question: Please explain the Spark execution model?

Answer: Spark execution model have following concepts

· Driver: An application maps to a single driver process. Driver process manages the job flow and schedule tasks and is available the entire time the application is running. Typically, this driver process is the same as the client process used to initiate the job, although when run on YARN, the driver can run in the cluster. In interactive mode, the shell itself is the driver process.

· Executor: For a single application/driver set of executor processes are distributed across the hosts in a cluster. The executors are responsible for performing work, in the form of tasks, as well as for storing any data that you cache. Executor lifetime depends on whether dynamic allocation is enabled. An executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime.

Stage: A stage is a collection of tasks that run the same code, each on a different subset of the data.

Question: What is Dynamic Allocation?

Answer: Dynamic allocation allows Spark (Only on YARN) to dynamically scale the cluster resources allocated to your application based on the workload. When dynamic allocation is enabled and a Spark application has a backlog of pending tasks, it can request executors. When the application becomes idle, its executors are released and can be acquired by other applications.

When Spark dynamic resource allocation is enabled, all resources are allocated to the first submitted job available causing subsequent applications to be queued up. To allow applications to acquire resources in parallel, allocate resources to pools and run the applications in those pools and enable applications running in pools to be preempted.

Page updated

Report abuse