Download PDF
Question: How Spark Streaming applications are impacted with Dynamic Allocation?
Answer: When Dynamic Allocation is enabled in Spark, which means that executors are removed when they are idle. However, Dynamic allocation is not effective in case of Spark Streaming. In Spark Streaming, data comes in every batch, and executors will run whenever data is available. If the executor idle timeout is less than the batch duration, executors are constantly added and removed. If executor idle timeout is greater than the batch duration, executors are never removed. Hence, it is recommended that you disable the Dynamic Allocation for Spark streaming by setting “spark.dynamicAllocation.enabled” to flase.
Question: When you submit Spark streaming application on local mode and not on Hadoop YARN, then it is must to have two threads, why?
Answer: As we have discussed previously, when Spark Streaming application is executed, it require at least two threads, one for receive data and one for processing that data.
Question: How do you enable Fault-tolerant data processing in Spark streaming?
Answer: If the Driver host for a Spark Streaming application fails, it can lose data that had been received but not yet processed. To ensure that no data is lost, you can use Spark Streaming recovery. Spark writes incoming data to HDFS as it is received and uses this data to recover state if a failure occurs.
Question: When you use Spark Streaming with the AWS S3 (or cloud) which storage is recommended?
Answer: When using Spark Streaming application, with the cloud services as the underline storage layer, use ephemeral HDFS on the cluster to store checkpoints, instead of the cloud store such as Amazon S3 or Microsoft ADLS.
Question: Can Hive SQL syntax be used with the Spark SQL?
Answer: Hive and Impala tables and related SQL syntax are interchangeable in most of the cases. Because Spark uses the underlying Hive infrastructure, with Spark SQL whatever DDL statements, DML statements, and queries using the HiveQL syntax. For interactive query performance, you can access the same tables through Impala using impala-shell or the Impala JDBC and ODBC interfaces.
Question: Can you use SQL to read data which are outside the Hive or Impala table?
Answer: If datafiles are outside of the Hive or Impala table, then you can use SQL directly to read JSON or Parquet data into DataFrame.
Question: Give example of file types which are supported by Spark?
Answer: Spark support many file types, including text files, RCFiles, SequenceFiles, Hadoop InputFormat, Avro, Parquet and various compression of supported file types.
Question: If your Spark is installed on EC2 instances, and trying to access data stored in S3 bucket, but you don’t want to provide the credential at the same time, how can you do that?
Answer: As we know, both EC2 and S3 are Amazon services, we can leverage IAM roles, In this mode of operation associates the authorization with individual EC2 instances instead of with each Spark app or the entire cluster.
Run EC2 instances with instance profiles associated with IAM roles that have the permissions you want. Requests from a machine with such a profile authenticate without credentials.
Question: What is Avro files?
Answer: This is a serialization system, and support rich data structure, with binary encoding. Avro is a language independent. It require .avro as a file extension. By default Avro data files are not compressed, but it is recommended enabling compression to reduce disk usage and increase read and write performance. Avro data files support Deflate and Snappy compression. Snappy is faster, but Deflate is slightly more compact.
You do not need to specify configuration to read a compressed Avro data file. However, to write an Avro data file, you must specify the type of compression. How you specify compression depends on the component.
Question: What are the limitations, when you process Avro files, using Spark?
Answer: Because Spark is converting data types, keep the following in mind:
1. Enumerated types are erased - Avro enumerated types become strings when they are read into Spark, because Spark does not support enumerated types.
2. Unions on output - Spark writes everything as unions of the given type along with a null option.
3. Avro schema changes - Spark reads everything into an internal representation. Even if you just read and then write the data, the schema for the output is different.
4. Spark schema reordering - Spark reorders the elements in its schema when writing them to disk so that the elements being partitioned on are the last elements.
Question: How do you submit Spark applications?
Answer: To submit Spark application, you have to use “spark-submit” utility. Common use will be as below to submit using YARN.
uestion: When you submit the Spark application, with the configuration, what is the precedence order is followed?
Answer: The order of configuration properties are as below.
1. Properties passed to SparkConf.
2. Arguments passed to spark-submit, spark-shell, or pyspark.
3. Properties set in spark-defaults.conf.
Question: What are the advantages of running Spark on YARN cluster manager?
Answer: There are many advantages of running Spark on YARN cluster manager, as below
1. You can dynamically share and centrally configure the same pool of cluster resources among all frameworks that run on YARN.
2. You can use all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
3. You choose the number of executors to use; in contrast, Spark Standalone requires each application to run an executor on every host in the cluster.
4. Spark can run against Kerberos-enabled Hadoop clusters and use secure authentication between its processes.
Question: What all steps are followed when you submit the Spark Application on YARN cluster manager?
Answer: Spark orchestrates its operations through the driver program. When the driver program is run, the Spark framework initializes executor processes on the cluster hosts that process your data. The following occurs when you submit a Spark application to a cluster:
1. The driver is launched and invokes the main method in the Spark application.
2. The driver requests resources from the cluster manager to launch executors.
3. The cluster manager launches executors on behalf of the driver program.
4. The driver runs the application. Based on the transformations and actions in the application, the driver sends tasks to executors.
5. Tasks are run on executors to compute and save results.
6. If dynamic allocation is enabled, after executors are idle for a specified period, they are released.
7. When driver's main method exits or calls SparkContext.stop, it terminates any outstanding executors and releases resources from the cluster manager.
Question: How can you monitor and debug the Spark application submitted over YARN?
Answer: To obtain information about Spark application behavior you can consult YARN logs and the Spark web application UI.
Question: What general problem, you see when you need to run pyspark?
Answer: Managing dependencies and making them available for Python jobs on a cluster can be difficult. To determine which dependencies are required on the cluster, you must understand that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python transformations you define use any third-party libraries, such as NumPy or nltk, Spark executors require access to those libraries when they run on remote executors.
Question: In which scenario, python is preferred than Scala and Java for Spark application?
Answer: Apache Spark provides APIs in non-JVM languages such as Python. Many data scientists use Python because it has a rich variety of numerical libraries with a statistical, machine-learning, or optimization focus
Question: If you need a single file, while running your program on various nodes in cluster, how can you provide that to your application?
Answer: If you need only single file, which needs to transferred each node, you can use --py-files option, while submitting the application using spark-submit and specify the local path to the file. Or you can use programmatic option, sc.addPyFiles() function.
Question: When you need functionality from multiple files, then how do you provide?
Answer: If you use functionality from multiple python files, then you can create an egg/zip for the package, because --py-files flag also accepts a path to an egg file.
Question: What can be a problem of distributing egg files?
Answer: sending egg files is problematic because packages that contain native code must be compiled for the specific host on which it will run. When doing distributed computing with industry-standard hardware, you must assume is that the hardware is heterogeneous. However, because of the required C compilation, a Python egg built on a client host is specific to the client CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files you should install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.
Question: You have been given below code snippet, do you think that data shuffling will be applied on partitions across the nodes?
Answer: When data in one partition is not depend on partition, as well as transformation, which does not require another partition from different node. Then shuffling will not be applied. Out of all three transformation map, flatMap and filter does not require shuffling of data across partitions.
Question: Why does having more stage, will impact on the performance of Spark Application?
Answer: If you have written a Spark code, which has more stages then it suffer from performance. Because on each stage data will be persisted (either in cache or disk), also possibility of shuffling data across the partitions. As you know, wherever these two I/O (network and disk), comers into the picture there will be huge impact on performance. And stage boundary in a Spark job will cause this.
Question: What is the impact of having/changing numberOfPartitions during transformation, will cause the performance?
Answer: Transformations that can trigger a stage boundary typically accept a numPartitions argument, which specifies into how many partitions to split the data in the child stage. Just as the number of reducers is an important parameter in MapReduce jobs, the number of partitions at stage boundaries can determine an application's performance.
Question: Which all methods should be avoided, so less amount of data shuffling happens across the partitions?
Answer: When choosing an arrangement of transformations, minimize the number of shuffles and the amount of data shuffled. Shuffles are expensive operations. All shuffle data must be written to disk and then transferred over the network. repartition, join, cogroup , and any of the *By or *ByKey transformations can result in shuffles. Not all these transformations are equal.
Question: If you have a small dataset, which needs to be joined with another bigger dataset, what approach you will use in this case?
Answer: As you mentioned one dataset is smaller and other is very big. Then we will consider using broadcast variable, which will help in improving the overall performance. To avoid shuffles when joining two datasets, you can use broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor. A map transformation can then reference the hash table to do lookups.
Question: When it is advantageous to have shuffle?
Answer: When you are working with huge volume of data and more processing power is also available. And application is compute intensive, hence we need to use shuffling in this case. So that data can be processed in parallel using all the available CPUs. Another use case is aggregation, if a huge volume of data and you want to apply aggregate function on that, then single thread of the driver will become bottleneck. You should shuffle data across the nodes and then apply the aggregate functions on that data locally on each node. So that data can be aggregated parallel first and then final aggration will be done on the driver program.
Question: Which of the two resources used by the Spark application, but cannot be managed by neither YARN nor Spark?
Answer: The two main resources that Spark and YARN manage are CPU and memory. Disk and network I/O affect Spark performance as well, but neither Spark nor YARN actively manage them.
Question: When you deploy Spark on YARN cluster manager, how does ApplicationMaster memory comes into the picture?
Answer: The ApplicationMaster, which is a non-executor container that can request containers from YARN, requires memory and CPU that must be accounted for. In client deployment mode, they default to 1024 MB and one core. In cluster deployment mode, the ApplicationMaster runs the Spark application driver, so consider bolstering its resources with the --driver-memory and --driver-cores flags.
Question: What is Parquet file?
Answer: An open source, column-oriented binary file format for Hadoop that supports very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and allows adding more encodings as they are invented and implemented. Encoding and compression are separated, allowing Parquet consumers to implement operators that work directly on encoded data without paying a decompression and decoding penalty, when possible.
Question: What is Beeswax application?
Answer: Beeswax is a Hue application, using that you can perform queries on Hive, even you can create tables, load data, and run and manager Hive queries.
Question: Describe LZO compression?
Answer: A free, open source compression library. LZO compression provides a good balance between data size and speed of compression. The LZO compression algorithm is the most efficient of the codecs, using very little CPU. Its compression ratios are not as good as others, but its compression is still significant compared to the uncompressed file sizes. Unlike some other formats, LZO compressed files are splittable, enabling MapReduce to process splits in parallel.
Question: Define MapReduce algorithm?
Answer: A distributed processing framework for processing and generating large data sets and an implementation that runs on large clusters of industry-standard machines.
The processing model defines two types of functions: a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
A MapReduce job partitions the input data set into independent chunks that are processed by the map functions in a parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce functions. Typically both the input and the output of the job are stored in a distributed filesystem.
The implementation provides an API for configuring and submitting jobs and job scheduling and management services; a library of search, sort, index, inverted index, and word co-occurrence algorithms; and the runtime. The runtime system partitions the input data, schedules the program's execution across a set of machines, handles machine failures, and manages the required inter-machine communication.
Question: Define a stage in Spark?
Answer: In Spark, a collection of tasks that all execute the same code, each on a different partition. Each stage contains a sequence of transformations that can be completed without shuffling the data.
Question: Define a task in Apache Spark application?
Answer: A unit of work on a partition of a DataFrame.
Question: What are the main components of the YARN architecture?
Answer: A general architecture for running distributed applications. YARN specifies the following components:
· ResourceManager - A master daemon that authorizes submitted jobs to run, assigns an ApplicationMaster to them, and enforces resource limits.
· ApplicationMaster - A supervisory task that requests the resources needed for executor tasks. An ApplicationMaster runs on a different NodeManager for each application. The ApplicationMaster requests containers, which are sized by the resources a task requires to run.
· NodeManager - A worker daemon that launches and monitors the ApplicationMaster and task containers.
· JobHistory Server - Keeps track of completed applications.
The ApplicationMaster negotiates with the ResourceManager for cluster resources—described in terms of a number of containers, each with a certain memory limit—and then runs application-specific processes in those containers. The containers are overseen by NodeManagers running on cluster nodes, which ensure that the application does not use more resources than it has been allocated.
Question: What is Zookeeper service?
Answer: A centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services. In a CDH cluster, ZooKeeper coordinates the activities of high-availability services, including HDFS, Oozie, Hive, Solr, YARN, HBase, and Hue.
Question: What is the Case classes, and why do you use it in Spark?
Answer: Case classes are Scala way of creating Java Pojo’s. It is implicitly create getter and setter method for the Object member. This is used in Spark to assign Schema to a DataSet/DataFrame objects.
Question: What is a Catalyst optimizer?
Answer: It is a core part of SparkSQL, which is written using Scala. This helps Spark in
· Schema inference from JSON data
You can say that, it is helping Spark SQL to generate query plan, which can be easily converted to the Direct Acyclic Graph of RDD’s. Once DAG is created it is ready to execute. It does lot of things before creating optimized query plan. Main purpose of this optimizer, is to create optimized DAG.
Both the query plans and optimized query plans are internally represented as trees. Catalyst optimizer has various other libraries, which help in working on this trees.
Question: What all types of plans can be created by the Catalyst optimizer?
Answer: Catalyst optimizer can create following two query plans
1. Logical plan: It defines the computations on the DataSets, without defining how to carry out the specific computations.
2. Physical plan: It defines computation of the datasets, which can be executed to get expected result. Generally multiple physical plans are generated by the optimizer and then later on using Cost-based optimizer, less costly plan will be selected to execute the query.
Question: How do you print the all the plans created by Catalyst optimizer for running a query?
Answer: We have to use explain(Boolean) method. Something similar to below
Question: What is Project Tungsten?
Answer: You can say that, this is one of the largest execution engine for Spark. It has more focus on observing CPU and Memory, rather than I/O and network. In Spark CPU and Memory was the major bottleneck for performance.
Before Spark 2.0, a most of the CPU cycles were wasted, rather than using for computation, they were used for read/write of intermediate data to CPU cache.
Project tungsten helped in improving efficiency of memory and CPU, so that maximum hardware limits can be used.
Question: Do you see any pain point with regards to Spark DStream?
Answer: There are some common issues with the Spark DStream
· Timestamp: It consider timestamp, when event entered into the Spark system, rather than attached timestamp of the event.
· API: You have to write different code for both batch and steam processing.
· Failure condition: Developer has to manage various failure conditions.
Question: Please list down some advantages of the Structured Streaming compare to DStream?
Answer: As we have seen in previous question DStream has various issues, that’s the reason Structured Streaming introduced in the Apache Spark
· It is Fast
· Fault-tolerant
· Exactly-once stream processing approach
· Input data can be thought as a append only table (grows continuously)
· Trigger: It can specify a trigger, which can check for input data in a defined time interval.
· API: The high level API is built on the SparkSQL engine and is tightly integrated with SQL queries, DataFrame and DataSet APIs.
Question: You need to select small dataset from stored data, however, it is not recommended that you use Spark for that, why?
Answer: You should not use Spark for such use cases, because, Spark has to go through the all stored files and then find your result from it. You should consider using RDBMS or some other storage which index particular columns of the data and data retrieval would be faster in that case.
Question: What is structured streaming?
Answer: If you know, previously streaming data is processed using DStream and its API. But with the Structured streaming stream data will be processed using built in Spark SQL engine.
Question: What is the benefits of using structured streaming using Spark SQL?
Answer: If you see previous version of Spark, you need to learn different API for processing streaming data, which makes things difficult/Harder for developer. But with structured streaming
· Computation: You write streaming computation same as batch computation.
· Continuous data handling : Spark SQL engine take care of running streaming data computation, and updating final result as data arrives continuously.
· DataFrame/DataSet API: You can use the same DataFrame/DataSet API for streaming aggregations, stream data and batch data joins.
· Fault-tolerance: System ensures end-to-end exactly once fault-tolerance guarantee’s through checkpointing and write-ahead-log (WAL).
Question: How structured streaming are handled internally by Spark?
Answer: As we have mentioned previously that structured streaming is processed using Spark SQL engine only, which processes data stream as a series of small batch jobs. Having this micros batch helps in having latencies as low as 100 milliseconds and make sure that data will be processed exactly once (very important).
Question: Is there any option available, so that streaming data can be processed less than 100 milliseconds latencies?
Answer: In Spark 2.3, new feature is introduced for a low-latency data processing by changing the mode in streaming application, this new mode will be called “Continuous Processing”, by which end-to-end latencies can be as low as 1 millisecond with at-least once guarantee’s. Good point about this feature is that you don’t have to change your Dataset/DataFrame code, just change the mode to get low latency, if you need it in your application.
Question: What do you think, about its internal implementation of structured streaming, so that it can have same API?
Answer: The main important point here is, how the programming model is implemented in Spark structured streaming. In structured streaming, it consider live data as a table, which is continuously appended and code/program you write will be batch like only, hence Spark will query that data on that table, however queries are executed incrementally on this unbounded data table. For each query run, its like running query on static data. Every new data, you can assume like that new rows are being appended to the existing unbounded table.
Question: What is event-time data, and what is the use?
Answer: Whenever, you receive data, it may or may not contain time embedded with the message contents. If time is embedded then it is called event-time data. Suppose you want to calculate how many events are generated in last 5 minutes. It may be possible that events received are in different order or duplicates. System can use this event-time embedded in the message contents. Even sometime because of some reason like network failure, event received is quite late. System can use this embedded event time to get to know the exact time of events. It is very useful in IOT world.
Question: What is the use of Watermarking in structured streaming?
Answer: As we have discussed in previous question that events can be received in any order and time may be embedded with the event itself. In Spark 2.1, it is defined that you can specify watermark value, if message/event is older than this many seconds then discard it.
Question: How we can use DataFrame/DataSet API with the structured streaming?
Answer: Spark 2.0 onwards, DataFrame and DataSet API has been enhanced, so that they can consider static data, bounded data, streaming data, unbounded data same as for static data. So using common entry point of SparkSession can help you to work on streaming data as well by applying same operations/API.
Question: Can you give some example of Streaming data sources?
Answer: Spark provides some of the built in data sources, for the components which are quite popular and used ubiquitously for example.
· File: Any new file you receive in a directory can be considered as a stream of data.
· Kafka: Read data from Kafka messaging engine.
· Socket: Reading text data from socket (only support UTF-8 data) as well avoid using in prod, because it does not provide end-to-end fault tolerance.
· Rate: Generate fixed number of rows in every second.