Download PDF
Question: How would you control the number of partitions of a DataFrame?
Answer: You can control the number of partitions of a DataFrame using repartition or coalesce operations.
Question: What are the possible operations on DataFrame?
Answer: DataFrame support two kinds of operations:
· Transformations - lazy operations that return another DataFrame.
· Actions - operations that trigger computation and return values.
Question: How a DataFrame helps parallel job processing?
Answer: Spark runs jobs in parallel, a DataFrame is splitted into partitions and each partition is processed separately by separate executor processor. However, inside a partition data is processed sequentially.
Question: What is the transformation?
Answer: A transformation is a lazy operation on a DataFrame that returns another DataFrame, like map, flatMap , filter , reduceByKey , join , cogroup. Transformations are lazy and are not executed immediately, and will be only executed after an action have been executed.
Question: What is Preferred Locations?
Answer: A preferred location (as known as locality preferences or placement preferences) is a block location for an HDFS file where to compute each partition on.
( getPreferredLocations method )Specifies placement preferences for a partition in a DataFrame.
Question: Please tell me, how execution starts and end on a DataFrame or Spark Job
Answer: Execution Plan starts with the earliest DataFrame (those with no dependencies on other DataFrame or reference cached data) and ends with the DataFrame that produces the result of the action that has been called to execute.
Question: Give example of transformations that do trigger jobs?
Answer: There are couple of transformations that do trigger jobs, e.g. sort By , zipWithIndex.
Question: How can you use Machine Learning library “SciKit library” which is written in Python, with Spark engine?
Answer: Machine learning tool written in Python, e.g. SciKit library, can be used as a Pipeline API in Spark MLlib or calling pipe().
Question: Why Spark is good at low-latency iterative workloads e.g. Graphs and Machine Learning?
Answer: Machine learning algorithms for instance logistic regression require many iterations before creating optimal resulting model. And similarly in graph algorithms which traverse all the nodes and edges. Any algorithm which needs many iteration before creating results can increase their performance when the intermediate partial results are stored in memory or at very fast solid state drives.
Spark can cache/store intermediate data in memory for faster model building and training.
Also, when graph algorithms are processed then it traverses graphs one connection per iteration with the partial result in memory. Less disk access and network traffic can make a huge difference when you need to process lots of data.
Question: Data is spread in all the nodes of cluster, how spark tries to process this data?
Answer: By default, Spark tries to read data into a DataFrame from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
Question: How would you hint, minimum number of partitions while transformation ?
Answer: You can request for the minimum number of partitions, using the second input parameter to many transformations.
Preferred way to set up the number of partitions for a DataFrame is to directly pass it as the second input parameter of parallelize method as shown above.
In below example you can see it is passed 400. The partitioning makes for 400 splits that would be done by the Hadoop’s TextInputFormat , not Spark and it would work much faster. It’s also that the code spawns 400 concurrent tasks to try to load file.txt directly into 400 partitions.
Question: How many concurrent task Spark can run for a DataFrame partition?
Answer: Spark can only run 1 concurrent task for every partition of a DataFrame, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your DataFrame to at least have 50 partitions (and probably 2-3x times that).
As far as choosing a "good" number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling spark.defaultParallelism .
Question: Which limits the maximum size of a partition?
Answer: The maximum size of a partition is ultimately limited by the available memory of an executor.
Question: When Spark works with file.txt.gz, how many partitions can be created?
Answer: When using textFile with compressed files ( file.txt.gz not file.txt or similar), Spark disables splitting, which create DataFrame with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do repartitioning.
Please note that Spark disables splitting for compressed files and creates DataFrame with only 1 partition. In such cases, it’s helpful to use spark.read.textFile('demo.gz') and do repartitioning using df.repartition(100) as follows:
With the lines, you end up with rdd to be exactly 100 partitions of roughly equal in size. Similarly you can use it for DataFrame.
Question: What is coalesce transformation?
Answer: The coalesce transformation is used to change the number of partitions. It can trigger RDD or DataFrame shuffling depending on the second shuffle boolean input parameter (defaults to false).
Question: What is the difference between cache() and persist() method of DataFrame
Answer: DataFrame can be cached (using cache() operation) or persisted (using persist method). The cache() operation is a synonym of persist() that uses the default storage level MEMORY_ONLY .
Question: What is Shuffling?
Answer: Shuffling is a process of repartitioning (redistributing) data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors.
Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.
Question: Does shuffling change the number of partitions?
Answer: No, By default, shuffling doesn’t change the number of partitions, but their content.
Question: What is the difference between groupByKey and use reduceByKey ?
Answer: Avoid groupByKey and use reduceByKey or combineByKey instead.
groupByKey shuffles all the data, which is slow.
reduceByKey shuffles only the results of sub-aggregations in each partition of the data.
Question: What is checkpointing?
Answer: Checkpointing is a process of truncating lineage graph and saving it to a reliable distributed (HDFS) or local file system. DataFrame checkpointing that saves the actual intermediate RDD data to a reliable distributed file system.
You mark DataFrame for checkpointing by calling df.checkpoint() . The DataFrame/RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD/DataFrame.
Question: What do you mean by Dependencies in RDD lineage graph?
Answer: Dependency is a connection between RDDs after applying a transformation.
Question: Define Spark architecture
Answer: Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own Java processes.
Question: What is the purpose of Driver in Spark Architecture?
Answer: A Spark driver is the process that creates and owns an instance of SparkContext. It is your Spark application that launches the main method in which the instance of SparkContext is created.
1. Drive splits a Spark application into tasks and schedules them to run on executors.
2. A driver is where the task scheduler lives and spawns tasks across workers.
3. A driver coordinates workers and overall execution of tasks.
Question: What are the workers?
Answer: Workers or slaves are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark. A worker receives serialized/marshalled tasks that it runs in a thread pool.
Question: Please explain, how worker’s work, when a new Job submitted to them?
Answer: When SparkSession is created, each worker starts one executor. This is a separate java process or you can say new JVM, and it loads application jar in this JVM. Now executors connect back to your driver program and driver send them commands, like, foreach, filter, map etc. As soon as the driver quits, the executors shut down.
Question: Please define executors in detail?
Answer: Executors are distributed agents responsible for executing tasks. Executors provide in-memory storage for RDDs that are cached in Spark applications. When executors are started they register themselves with the driver and communicate directly to execute tasks.
Question: What is DAGSchedular and how it performs?
Answer: DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling, i.e. after a DataFrame action has been called it becomes a job that is then transformed into a set of stages that are submitted as TaskSets for execution.
DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially.
Question: What is stage, with regards to Spark Job execution?
Answer: A stage is a set of parallel tasks, one per partition of a DataFrame, that compute partial results of a function executed as part of a Spark job.
Question: What is Task, with regards to Spark Job execution?
Answer: Task is an individual unit of work for executors to run. It is an individual unit of physical execution (computation) that runs on a single machine for parts of your Spark application on a data. All tasks in a stage should be completed before moving on to another stage.
1. A task can also be considered a computation in a stage on a partition in a given job attempt.
2. A Task belongs to a single stage and operates on a single partition (a part of a DataFrame).
3. Tasks are spawned one by one for each stage and data partition.
Question: What is Speculative Execution of a tasks?
Answer: Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job.
Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset. Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel.
Question: Which all cluster manager can be used with Spark?
Answer: Apache Mesos, Hadoop YARN, Spark standalone and Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution.
Question: What is a BlockManager?
Answer: Block Manager is a key-value store for blocks that acts as a cache. It runs on every node, i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and offheap.
A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a cached RDD partition, intermediate shuffle data, and broadcast data.
Question: What is Data locality / placement?
Answer: Spark relies on data locality or data placement or proximity to data source that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits), and then schedules the work to the SparkWorkers. Spark’s compute nodes / workers should be running on storage nodes.
Question: What is master URL in local mode?
Answer: You can run Spark in local mode using local , local[n] or the most general local[*].
The URL says how many threads can be used in total:
· local uses 1 thread only.
· local[n] uses n threads.
· local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Question: Define components of YARN?
Answer: YARN components are below
ResourceManager: runs as a master daemon and manages ApplicationMasters and NodeManagers.
ApplicationMaster: is a lightweight process that coordinates the execution of tasks of an application and asks the ResourceManager for resource containers for tasks. It monitors tasks, restarts failed ones, etc. It can run any type of tasks, be them MapReduce tasks or Giraph tasks, or Spark tasks.
NodeManager offers resources (memory and CPU) as resource containers.
NameNode: Holds the information regarding where each block is stored on the HDFS.
Container: can run tasks, including ApplicationMasters.
Question: What is a Broadcast Variable?
Answer: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Question: How can you define Spark Accumulators?
Answer: This are similar to counters in Hadoop MapReduce framework, which gives information regarding completion of tasks, or how much data is processed etc.
Question: What is Apache Parquet format?
Ans: Apache Parquet is a columnar storage format
Question: What is Apache Spark Structured Streaming?
Answer: Spark Streaming helps to process live stream data. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.