Loading...
Apache Spark Interview Questions and Answers

Apache Spark Interview Questions and Answers

by GangBoard Admin, April 17, 2019

In case you’re searching for Apache Spark Questions and answers for Experienced or Freshers, you are at the correct place. There is parcel of chances from many presumed organizations on the planet. The Apache Spark advertise is relied upon to develop to more than $5 billion by 2020, from just $180 million, as per Apache Spark industry gauges. In this way, despite everything you have the chance to push forward in your vocation in Apache Spark Development. Gangboard offers Advanced Apache Spark Interview Questions and answers that assist you in splitting your Apache Spark interview and procure dream vocation as Apache Spark Developer.

Best Apache Spark Interview Questions and Answers

Do you believe that you have the right stuff to be a section in the advancement of future Apache Spark, the GangBoard is here to control you to sustain your vocation. Various fortune 1000 organizations around the world are utilizing the innovation of Apache Spark to meet the necessities of their customers. Apache Spark is being utilized as a part of numerous businesses. To have a great development in Apache Spark work, our page furnishes you with nitty-gritty data as Apache Spark prospective employee meeting questions and answers. Apache Spark Interview Questions and answers are prepared by 10+ years experienced industry experts. Apache Spark Interview Questions and answers are very useful to the Fresher or Experienced person who is looking for the new challenging job from the reputed company. Our Apache Spark Questions and answers are very simple and have more examples for your better understanding.

By this Apache Spark Interview Questions and answers, many students are got placed in many reputed companies with high package salary. So utilize our Apache Spark Interview Questions and answers to grow in your career.  

Q1. What is Apache Spark?

Answer: Spark is an in-memory parallel data processing framework. It support batch, streaming processing also interactive analytics.

Q2. What are the three ways to create RDD in Spark?

Answer: The three ways to create RDD in Spark is:

  1. By using parallelized collection
  2. By loading an external dataset
  3. From an existing RDD

Q3. Can we create RDD from existing RDD?

Answer:  Yes, by applying transformations on RDD we can create new RDD.

Q4. In how many ways we can create RDD?

Answer: There are three possible ways to create RDD.

Q5. Can we create RDD using Dataset like .txt file?

Answer: Yes, by loading dataset we can create RDD.

Q6. Can we run Spark without using HDFS?

Answer: Yes, we need HDFS just for storage purpose.

Q7. Does spark support stand alone mode?

Answer: Yes it supports standalone mode

Q8. What are the types of transformations in Spark?

Answer: Narrow and Wide Transformation are available in Spark.

Q9. Give some example for Narrow transformation?

Answer: Map and Filter.

Q10. Give some example of wide transformations?

Answer: GroupByKey and ReduceByKey

Q11. What are the Components in Spark?

Answer: Spark SQL, Spark Streaming, Mlib And Graph X

Q12. What is Spark SQL?

Answer: It is a component of Spark which provides support for structured and semi-structured data. Data Frame appeared in Spark Release 1.3.0

Q13. What is Spark Dataset?

Answer: Dataset is an extension of Data Frame API which provides type-safe, object-oriented programming interface.

Q14. What are the limitations of Data frame?

Answer: Data Frame does not have provision for compile-time type safety.

Q15. What is the transformation in Spark?

Answer: Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces output as one or more RDD.

Q16. What is Action in Spark?

Answer: Actions return final results of RDD computations. It triggers execution using lineage graph and after carry out all intermediate transformations return the final results to the Driver.

Q17. Give some examples of Transformation in Spark?

Answer: Map, flatmap and filter.

Q18. Give some example of Action in Spark?

Answer: Count(),Collect() and reduce(func).

Q19. What collect does in Spark?

Answer: It returns all the elements in the RDD to Driver.

Q20. What are the different storage levels in Spark?

Answer: Memory_Only, MEMORY_AND_DISK, MEMORY_ONLY_SER , MEMORY_AND_DISK_SER  & DISK_Only

Q21. What is the default storage level in Spark?

Answer:  Memory_Only

Q22. What is coalesce ()?

Answer: It is used to decrease the number of partitions in an RDD. It avoids full shuffle of RDD.

Q23. What is repartition ()?

Answer: It is used to increase the number of Partitions. It creates a new partition from the existing partition by shuffling of data.

Q24. What is the role of Driver in Spark?

Answer: The driver is the program which creates the Spark Context, connecting to a given Spark Master. It declares the transformations and actions on RDDs and submits such requests to the master.

Q25. What are the deployment modes in Spark?

Answer: Cluster mode and Client Mode

Q26. Why Spark is faster than Hadoop?

Answer: Spark is faster than Hadoop because it does processing in memory.

Q27. What is accumulator?

Answer: An accumulator is a shared variable which is used for aggregating information across the cluster.

Q28. What are the two types of shared variable available in Apache Spark?

Answer: Broadcast Variable and Accumulator.

Q29. What is the Broadcast variable?

Answer: It allows the programmer to keep a read-only variable cached on each machine instead of shipping a copy of it with tasks.

Q30. What is Spark D Stream?

Answer: Spark DStream is the basic abstraction of Spark Streaming. It is a continuous stream of data.

Q31.What is map transformation in Spark?

Answer: Map transformation takes a function as input after applying that function to each RDD return another RDD. Its return type can be different from its input type.

Q32. What is a flatmap transformation in Spark?

Answer: Flatmap is used when we want to produce multiple elements for each input element. The output of the flatmap is a List of the element through which we can iterate.

Q33. What is action Reduce in Spark?

Answer: Reduce takes a function as an input which has two parameters which are of same type and output a single value of the input type.

Q34. What is lazy evaluation in Spark?

Answer: When we apply transformation on RDD it does not immediately gives output it will make DAG of all transformation. Transformations in Spark are evaluated after you perform an action. This is called Lazy Evaluation.

Q35. What is MLib in Spark?

Answer: MLlib is a distributed machine learning framework built on top of Spark.

Q36. What is GraphX in Spark?

Answer: GraphX is a distributed graph-processing framework built top of Spark. It provides different APIs for expressing graph computation.

Q37. What is spark shell?

Answer: Spark Shell is a Spark Application which is written in Scala. It offers a command line environment with auto-completion.

which is helpful in developing our own Standalone Spark Application.

Q38. Write some function of Spark Context?

Answer: Used to create Spark RDDs, accumulators, and broadcast variables, access all Spark services and run jobs also to get the status of spark application. Starting and cancelling of Job etc.

Q39. Write some function of Spark Executor?

Answer: To run a task that makeup application and to return the result to Driver. It Provides in-memory storage for RDDs cached by user.

Q40. Which are the Programming languages supported by Spark?

Answer: Java, Python, Scala, SQL and R.

Q41. What is DAG in Spark?

Answer: DAG is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD.

Q42. What is Caching in Spark Streaming?

Answer: Caching Streaming is storing streaming RDD in memory. It is a mechanism to speed up applications that access the same RDD multiple times.

Q43. Which file systems are supported by Spark?

Answer: HDFS, Local File system & Amazon S3.

Q44. What is RDD?

Answer: Resilient Distribution Datasets (RDD) is a fault-tolerant collection of partitioned data that run in parallel. RDD is immutable and distributed in nature.

Q45. Write some input sources for Spark Streaming.

Answer: TCP Sockets, Stream of files, Apache Kafka, Apache Flume, Kinesis etc.

Q46. Can we use Hive on Spark?

Answer: Yes, by creating Hive context

Q47. What is a pipe () operation in Spark?

Answer: Spark is using Scala, Java, and Python to write the program. However, if one wants to pipe (inject) the data

which is written in other languages Spark provides a general mechanism for that in the form of pipe() method.

Q48. What are the data sources available in Spark SQL?

Answer: Parquet, Avro, JSON and Hive tables

Q49. What is a partition in Spark?

Answer: A partition in spark is a logical division of data stored on a Node in the cluster. Partitions are basic units of parallelism in Apache Spark.

Q50.What are the types of Partitioning in Apache Spark?

Answer: The types of portioning in Apache Spark are as follows:

  • Hash Partitioning
  • Range Partitioning

Q51. What are the types of Cluster managers in Spark?

Answer: Standalone,Yarn & Mesos.

No Comments


    Leave a Reply

    Your email address will not be published Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

    *

    Online Training Quick Enquiry






    Get Free Online training

    Looking for Online Training