In case you’re searching for Apache Spark Questions and answers for Experienced or Freshers, you are at the correct place. There is parcel of chances from many presumed organizations on the planet. The Apache Spark advertise is relied upon to develop to more than $5 billion by 2020, from just $180 million, as per Apache Spark industry gauges. In this way, despite everything you have the chance to push forward in your vocation in Apache Spark Development. Gangboard offers Advanced Apache Spark Interview Questions and answers that assist you in splitting your Apache Spark interview and procure dream vocation as Apache Spark Developer.
Best Apache Spark Interview Questions and Answers
Do you believe that you have the right stuff to be a section in the advancement of future Apache Spark, the GangBoard is here to control you to sustain your vocation. Various fortune 1000 organizations around the world are utilizing the innovation of Apache Spark to meet the necessities of their customers. Apache Spark is being utilized as a part of numerous businesses. To have a great development in Apache Spark work, our page furnishes you with nitty-gritty data as Apache Spark prospective employee meeting questions and answers. Apache Spark Interview Questions and answers are prepared by 10+ years experienced industry experts. Apache Spark Interview Questions and answers are very useful to the Fresher or Experienced person who is looking for the new challenging job from the reputed company. Our Apache Spark Questions and answers are very simple and have more examples for your better understanding.
By this Apache Spark Interview Questions and answers, many students are got placed in many reputed companies with high package salary. So utilize our Apache Spark Interview Questions and answers to grow in your career.
Q1. What is Apache Spark?
Answer: Spark is an in-memory parallel data processing framework. It support batch, streaming processing also interactive analytics.
Q2. What are the three ways to create RDD in Spark?
Answer: The three ways to create RDD in Spark is:
- By using parallelized collection
- By loading an external dataset
- From an existing RDD
Q3. Can we create RDD from existing RDD?
Answer: Yes, by applying transformations on RDD we can create new RDD.
Q4. In how many ways we can create RDD?
Answer: There are three possible ways to create RDD.
Q5. Can we create RDD using Dataset like .txt file?
Answer: Yes, by loading dataset we can create RDD.
Q6. Can we run Spark without using HDFS?
Answer: Yes, we need HDFS just for storage purpose.
Q7. Does spark support stand alone mode?
Answer: Yes it supports standalone mode
Q8. What are the types of transformations in Spark?
Answer: Narrow and Wide Transformation are available in Spark.
Q9. Give some example for Narrow transformation?
Answer: Map and Filter.
Q10. Give some example of wide transformations?
Answer: GroupByKey and ReduceByKey
Q11. What are the Components in Spark?
Answer: Spark SQL, Spark Streaming, Mlib And Graph X
Q12. What is Spark SQL?
Answer: It is a component of Spark which provides support for structured and semi-structured data. Data Frame appeared in Spark Release 1.3.0
Q13. What is Spark Dataset?
Answer: Dataset is an extension of Data Frame API which provides type-safe, object-oriented programming interface.
Q14. What are the limitations of Data frame?
Answer: Data Frame does not have provision for compile-time type safety.
Q15. What is the transformation in Spark?
Answer: Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces output as one or more RDD.
Q16. What is Action in Spark?
Answer: Actions return final results of RDD computations. It triggers execution using lineage graph and after carry out all intermediate transformations return the final results to the Driver.
Q17. Give some examples of Transformation in Spark?
Answer: Map, flatmap and filter.
Q18. Give some example of Action in Spark?
Answer: Count(),Collect() and reduce(func).
Q19. What collect does in Spark?
Answer: It returns all the elements in the RDD to Driver.
Q20. What are the different storage levels in Spark?
Answer: Memory_Only, MEMORY_AND_DISK, MEMORY_ONLY_SER , MEMORY_AND_DISK_SER & DISK_Only
Q21. What is the default storage level in Spark?
Q22. What is coalesce ()?
Answer: It is used to decrease the number of partitions in an RDD. It avoids full shuffle of RDD.
Q23. What is repartition ()?
Answer: It is used to increase the number of Partitions. It creates a new partition from the existing partition by shuffling of data.
Q24. What is the role of Driver in Spark?
Answer: The driver is the program which creates the Spark Context, connecting to a given Spark Master. It declares the transformations and actions on RDDs and submits such requests to the master.
Q25. What are the deployment modes in Spark?
Answer: Cluster mode and Client Mode
Q26. Why Spark is faster than Hadoop?
Answer: Spark is faster than Hadoop because it does processing in memory.
Q27. What is accumulator?
Answer: An accumulator is a shared variable which is used for aggregating information across the cluster.
Q28. What are the two types of shared variable available in Apache Spark?
Answer: Broadcast Variable and Accumulator.
Q29. What is the Broadcast variable?
Answer: It allows the programmer to keep a read-only variable cached on each machine instead of shipping a copy of it with tasks.
Q30. What is Spark D Stream?
Answer: Spark DStream is the basic abstraction of Spark Streaming. It is a continuous stream of data.
Q31.What is map transformation in Spark?
Answer: Map transformation takes a function as input after applying that function to each RDD return another RDD. Its return type can be different from its input type.
Q32. What is a flatmap transformation in Spark?
Answer: Flatmap is used when we want to produce multiple elements for each input element. The output of the flatmap is a List of the element through which we can iterate.
Q33. What is action Reduce in Spark?
Answer: Reduce takes a function as an input which has two parameters which are of same type and output a single value of the input type.
Q34. What is lazy evaluation in Spark?
Answer: When we apply transformation on RDD it does not immediately gives output it will make DAG of all transformation. Transformations in Spark are evaluated after you perform an action. This is called Lazy Evaluation.
Q35. What is MLib in Spark?
Answer: MLlib is a distributed machine learning framework built on top of Spark.
Q36. What is GraphX in Spark?
Answer: GraphX is a distributed graph-processing framework built top of Spark. It provides different APIs for expressing graph computation.
Q37. What is spark shell?
Answer: Spark Shell is a Spark Application which is written in Scala. It offers a command line environment with auto-completion.
which is helpful in developing our own Standalone Spark Application.
Q38. Write some function of Spark Context?
Answer: Used to create Spark RDDs, accumulators, and broadcast variables, access all Spark services and run jobs also to get the status of spark application. Starting and cancelling of Job etc.
Q39. Write some function of Spark Executor?
Answer: To run a task that makeup application and to return the result to Driver. It Provides in-memory storage for RDDs cached by user.
Q40. Which are the Programming languages supported by Spark?
Answer: Java, Python, Scala, SQL and R.
Q41. What is DAG in Spark?
Answer: DAG is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD.
Q42. What is Caching in Spark Streaming?
Answer: Caching Streaming is storing streaming RDD in memory. It is a mechanism to speed up applications that access the same RDD multiple times.
Q43. Which file systems are supported by Spark?
Answer: HDFS, Local File system & Amazon S3.
Q44. What is RDD?
Answer: Resilient Distribution Datasets (RDD) is a fault-tolerant collection of partitioned data that run in parallel. RDD is immutable and distributed in nature.
Q45. Write some input sources for Spark Streaming.
Answer: TCP Sockets, Stream of files, Apache Kafka, Apache Flume, Kinesis etc.
Q46. Can we use Hive on Spark?
Answer: Yes, by creating Hive context
Q47. What is a pipe () operation in Spark?
Answer: Spark is using Scala, Java, and Python to write the program. However, if one wants to pipe (inject) the data
which is written in other languages Spark provides a general mechanism for that in the form of pipe() method.
Q48. What are the data sources available in Spark SQL?
Answer: Parquet, Avro, JSON and Hive tables
Q49. What is a partition in Spark?
Answer: A partition in spark is a logical division of data stored on a Node in the cluster. Partitions are basic units of parallelism in Apache Spark.
Q50.What are the types of Partitioning in Apache Spark?
Answer: The types of portioning in Apache Spark are as follows:
- Hash Partitioning
- Range Partitioning
Q51. What are the types of Cluster managers in Spark?
Answer: Standalone,Yarn & Mesos.
Q52) Comparing Hadoop?
Answer: Comparing Hadoop it is reliable fast and easily process highly difficult data
Q53) what is Hive metastore and where it is stored.?
Answer: Hive Metastore is the schemas of the hsql data definitions of underlying data for Hive benches.
Q54) Hadoop MapReduce vs spark MapReduce?
- It does multi retention and dispensation the data through map reduce
- spark extracts the data by overlaying in yarn environment
Q55) RDD how It’s work?
Answer: Resilient Distribution Datasets which runs parallel for fault tolerance It is prepared by recipes through dual tactics adapting Spark Framework’s parallelize
Q56) Elements in spark?
- Spark Core: Vile locomotive for large-scale parallel and dispersed data dispensation
- Spark Streaming: Castoff IN dispensation real-time flowing data
- Spark SQL: Integrates interpersonal dispensation with Spark’s functional programming API
- GraphX: Graphs and graph-parallel reckoning
- MLlib: Completes machine erudition trendy Apache Spark
Q57) what are the different storage formats you have used in your project and compression techniques.?
Q58) Dstream in spark?
Answer: (DStream) is the emergent broad view on condition that by Spark Streaming
Q59) what is Hive metastore and where it is stored.?
Answer: Hive Metastore is the schemas of the hsql data definitions of underlying data for Hive tables.
Q60) Dstreams catches the data?
Answer: Dstreams is transmittable data and vittles in recollection
Q61)Dataset castoff in spark?
Answer: Json dataset are castoffed in spark
Q62) Difference between RDD and DataFrame?
- Optimization – No inbuilt optimization engine is available in RDD
- Serialization- it does so use Java serialization
- Compile-time type safety
- Efficiently process data, which is structured as well as unstructured
- Need to define the schema (manually)
- RDD API is slower to perform simple grouping and aggregation operations
- Optimization- Optimization takes place using catalyst optimizer, Analyzing a logical plan, Logical plan, Physical planning and Code generation to compile java bytecode
- Serialization– it uses off-heap storage (in memory) in binary format
- Run-time type validation
- Efficiently process data, which is structured as well as semi-structured
- Shema is automatically defined
- DataFrame API is slower to perform simple grouping and aggregation operations
Q63)How to convert RDD to DataFrame?
case class Customers(custid: Int,cname: String,lname: String)
valpeopleDF = spark.sparkContext.textFile(“D:/Hadoop/Spark/SparkScala/customer_data.csv”)
.map(attribute => Customers(attribute(0).toInt,attribute(1),attribute(2)))
Q64) How to programmatically specifying schema for DataFrame?
valschemaMap = List(“id”,”name”,”salary”).map(field =>StructField(field,StringType,true))
val schema = StructType(schemaMap)
Q65) How to remove Special character “#” from 100 of columns in DataFrame?
val columns = “#cust_id|#cust_name| #odr_date| #shipdt| #Courer| #recvd_dt|#returned or not|#returned dt|#reson of return”
valcreateNewDF = createDF.toDF(columns:_*)
Q66) Load a csv/textFile and remove header and footer?
valheaderFooterRemovedDF = loadDF.take(loadDF.count.toInt).drop(1).dropRight(1)
valschemaDefine = “id|name|date|type|status”.split(‘|’).map(col =>StructField(col.toString,StringType,true))
val schema = StructType(schemaDefine)
valfinalDF = spark.createDataFrame(spark.sparkContext.parallelize(headerFooterRemovedDF),schema)
Q67) Take first 10 record and last 10 record of file and combine both using DataFrame?
val loadDF = spark.read.format(“csv”).option(“path”,”file:///home/maria_dev/Files/assignment_table.csv”).load()
val combineDF = loadDF.take(10) ++ loadDF.take(loadDF.count.toInt).takeRight(10)
val schemaDesign = loadDF.first.toSeq.map(c => c.toString.trim).map(col => StructField(col,StringType,true))
val schema = StructType(schemaDesign)
val createDF = spark.createDataFrame(spark.sparkContext.parallelize(combineDF),schema)
Q68) How to calculate executor memory?
Configuration of the cluster is as below :
Nodes = 10
Each Node has core = 16 cores (-1 for operating systems)
Each Node Ram = 61 GB Ram (-1 for Hadoop Deamons)
Number of cores identification:
Number of cores is, number of concurrent tasks an executor can run in parallel so the general rule of thumb for optimal value is 5 (–num-cores 5)
Number of executor identification :
No.of.executor = No.of.cores / concurrent tasks (5 in general)
15/5 = 3 is no.of.executor in each node
No.of.nodes * no.of.executor in each node = no.of.executor (for spark job)
10 * 3 = 30 (-–num-executors 30 )
Q69) Definition of “Spark SQL”?
Answer: Spark SQL is a Spark interface to operate with structured as great as the semi-structured data value. It should this ability to place data value of various structured data specialists like “text files value”, JSON files value, Parquet files value, among other data.
Q70) What is the name of a few commonly used Spark Ecosystems?
- Spark SQL (Shark)
- Spark Streaming
Q71) What is meant by “Parquet fie”?
Answer: Parquet is defined by a columnar format file supported many of data value system processing. Spark SQL has been performing both of the read and write data operations function with Parquet file it’s supposed to be one of the best high data analytics formats so greatly.
Q72) Defined by Catalyst Framework?
Answer: Catalyst framework defined as a unique optimization dataset system framework present in “Spark SQL”. It allows Spark SQL catalyst framework data value has to automatically modify the SQL data value queries by adding new optimizations to data to produce a faster data processing system.
Q73) How do using BlinkDB?
Answer: BlinkDB is a query engine transfers the data for producing interactive data system SQL queries about huge numbers of data value including renders difficulty returns identified including significant error bars. BlinkDB helps users data balance ‘query accuracy’.
Q74) What are the various data sources available in Parquet file JSON Datasets Hive tables?
- Parquet file
- JSON Datasets
- Hive tables
Q75) Different SparkSQL from HQL & SQL?
Answer: SparkSQL is a unique element information use on the spark Core engine that executes SQL including Hive Query Language of destroying any syntax. It’s now to join the SQL report table and HQL table.
Q76) What does a Spark Engine do?
Answer: Spark Engine is held for scheduling, distributing and monitoring the data application across the cluster.
Q77)What did operations support for RDD?
Q78)What are the file systems support for Spark?
- Hadoop Distributed File System (HDFS).
- Local File system.