In case you’re searching for Data Science with Python Interview Questions and answers for Experienced or Freshers, you are at the correct place. There is parcel of chances from many presumed organizations on the planet. The Data Science with Python advertise is relied upon to develop to more than $5 billion by 2020, from just $180 million, as per Data Science with Python industry gauges. In this way, despite everything you have the chance to push forward in your vocation in Data Science with Python Development. Gangboard offers Advanced Data Science with Python Interview Questions and answers that assist you in splitting your Data Science with Python interview and procure dream vocation as Data Science with Python Developer
Best Data Science with Python Interview Questions and Answers
Do you believe that you have the right stuff to be a section in the advancement of future Data Science with Python, the GangBoard is here to control you to sustain your vocation. Various fortune 1000 organizations around the world are utilizing the innovation of Data Science with Python to meet the necessities of their customers. Data Science with Python is being utilized as a part of numerous businesses. To have a great development in Data Science with Python work, our page furnishes you with nitty-gritty data as Data Science with Python prospective employee meeting questions and answers. Data Science with Python Interview Questions and answers are prepared by 10+ years experienced industry experts. Data Science with Python Interview Questions and answers are very useful to the Fresher or Experienced person who is looking for the new challenging job from the reputed company. Our Data Science with Python Questions and answers are very simple and have more examples for your better understanding.
By this Data Science with Python Interview Questions and answers, many students are got placed in many reputed companies with high package salary. So utilize our Data Science with Python Interview Questions and answers to grow in your career.
Q1)What is Data science? What is the role of Machine Learning in Data science?
Answer: Data science is a blend of tools and algorithms with the goal to discover the hidden patterns from the raw data. The role Machine learning in Data science is Data science uses Machine learning principles to analyse and make future predictions.
Q2) How will you define supervised and unsupervised learning?
Answer: Supervised learning is one of the method associated with Machine learning in which all data is labelled and the algorithm learn to predict the output from the input data.In unsupervised learning all data is unlabelled and algorithm learn to inherent a structure from the input data.
Q3) What you mean by Type I error and Type II error in Hypothesis testing?
Answer: Type I error is occurred when you reject null hypothesis but actually it is true. It is also known as ‘False positive’.Type II error occurred when you accept null hypothesis but it is actually false. It is also known as ‘False negative’
Q4) How will you evaluate your regression model based on R2, Adjusted R2 and tolerance?
Answer: Tolerance is used as an indicator for finding multicollinearity. If the tolerance is high then it is desirable.It is important to consider R2 and Adjusted R2 for model evaluation. R2 increases irrespective of improvement in prediction accuracy as by adding more variable but Adjusted R2 increase only when an additional variable which improves the accuracy of the model else it remains the same. So we can use adjusted R2 and predicted R2 to include the correct number of variables for our regression model.
Q5) What is Logistic regression? How will you evaluate your Logistic regression model?
Answer: Logistic regression which comes under classification model is a technique to predicting binary outcome from a linear combination of predictor variable.
The following methods used for evaluating Logistic regression model:
- Since it used to predict probabilities, we can use AUC-ROC curve along with confusion matrix for finding the performance.
- Instead of R2 in linear regression, we can use AIC value which tells us the measure of fit which penalizes model for the number of model coefficients. We always prefer model with less AIC value.
- The value of Null Deviance and Residual Deviance can use to determine the efficiency of model. Null Deviance indicate response predicted by a model with nothing and Residual Deviance indicate response predicted by a model on adding independent variable. If both values are lower then better the model.
Q6) What is the difference between ANOVA and t-test?
Answer: The t-test and ANOVA(Analysis of Variance) are used to examine whether group meAnswer: differ from one another. The key difference between these two statistical method is
- The t-test compares two groups in order to examine how the group mean differ from one another, using t-distribution which is used when Standard deviation is not known and samples size is small.
- ANOVA is a statistical method used to compare two or more groups to find out similarity between each group mean.
Q7) What is the difference between Overfitting and Underfitting?
Answer: These are the main differences between overfitting and underfitting
- Overfitting happens a statistical model or machine learning algorithm captures the noise of data. Intuitively overfitting occures when the model or the algorithm fits the data too well(low bias but high variance)
Q8) What are the steps involved in an analytics project?
Answer: The following are the various steps involved in an analytics project:
- Understanding the business problem
- Data Exploration
- Data preparation for modelling
- Model building and analysing result
- Model validation using new data set
- Implementing the model and analyse performance over the period of time.
Q9)What all are the main packages used in Python for Data science and Machine Learning?
Answer: There are lot of libraries for data science in Python. The important libraries are:
Q10) How will you define your number of clusters in K-MeAnswer: clustering algorithm?
Answer: In K-MeAnswer: algorithm “K” defines the number of clusters. The methods used to find the optimal number of clusters are the following:
- Using WSS (Within Sum of Square) plot we can find the bending point and that point should taken as K in K-MeAnswer:
- We can use CH (Calinski-Harabasz) plot where the higher CH value will be taken as K for the K-MeAnswer: clustering.
Q11) Differentiate between univariate, bivariate and multivariate analysis?
Answer:: These are descriptive statistical analysis techniques which tells the number of variables involves in the analysis. For example, the pie charts of sales based on region involve only one variable is known as univariate analysis.
If the analysis attempts to find differences between 2 variables known as bivariate analysis.
Analysis that deals with the study of more than two variables to understand the how much the variable has the effect on the responses is referred to as multivariate analysis.
Q12) How kNN is different from K-MeAnswer: clustering?
Answer: These are two different machine learning algorithm used for different purpose. The main differences are:
- K-MeAnswer: comes under unsupervised learning algorithm and kNN is a supervised learning algorithm.
- K-MeAnswer: is a clustering algorithm where as kNN is a classification (or regression) algorithm.
- K-MeAnswer: algorithm divides a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other.
- kNN algorithm tries to classify an unlabelled observation based on its k (can be any number ) surrounding neighbours.
Q13) What are the assumptions required for linear regression?
Answer: There are four major assumptions:
- There is a linear relationship between the dependent variables and the independent variable, meaning the model you are creating actually fits the data
There is minimal multicollinearity between explanatory variables
- The variance around the regression line is the same for all values of the predictor variable.
Q14) You are given a dataset and you have build a decision tree model on top of it. You got an accuracy of 98%. Why you shouldn’t happy with your model performance?
Answer: The problem here is the dataset you got is an imbalanced one, so we can’t rely on the accuracy which we got as 98% because it only predicting the majority class correctly. Hence, in order to evaluate the model we should use sensitivity, specificity and F measure to determine the class wise performance. If minority class performance is found to be poor , we can undertake the following steps:
- Use under-sampling, oversampling or SMOTE to make data balanced
- Assign the weight to minority classes such that the minority classes will get larger value
- Alter prediction threshold value by doing probability calibration and find optimal threshold using AUC-ROC curve.
Q15) Where we are mostly using naiveBayes algorithm for classification?
Answer: The most using areas are:
- For making prediction in real time because it is an eager learning classifier
- Text classification/ Sentiment analysis is another common area where Naive Bayes is mostly using because of its better performance in multiclass problems and independent rule.
Q16) What is the difference between Covariance and correlation?
Answer: A measure used to represent how strongly two random variable are related known as correlation. Covariance is nothing but a measure of correlation. Correlation refers to the scaled form of covariance.
Q17) What is Gradient descent?
Answer: It is the First-order optimization algorithm. We can find the minimum of a convex function by starting at an arbitrary point and repeatedly take steps in the downward direction, which can be found by taking the negative direction of the gradient. After several iterations, we will eventually converge to the minimum.
Q18) How gradient descent is helpful in ML?
Answer: The minimum corresponds to the coefficients with the minimum error, or the best line of fit. The learning rate α determines the size of the steps we take in the downward direction. Mostly we use Stochastic Gradient Descent (SGD) to find the local minima. SGD: – Instead of taking a step after sampling the entire training set, we take a small batch of training data at random to determine our next step. Computationally more efficient and may lead to faster convergence.
Q19) What is Regularization?
Answer: This is a form of regression that constrains or regularizes or shrinks the coefficient estimates towards zero relative to the least squares estimate. λ represents the tuning parameter- as λ increases, flexibility decreases → decreased variance but increased bias. The tuning parameter is key in determining the sweet spot between under and over-fitting. In addition, while Ridge will always produce a model with p variables, Lasso can force coefficients to be equal to zero.
- Lasso (L1): min RSS + λ Pp j=1 |βj |
- Ridge (L2): min RSS + λ Pp j=1 β 2
Q20) What do you mean by Imbalanced Classes?
Answer: Imbalance in classes in training data leads to poor classifiers. It can result in a lot of false positives and also lead to few training data. Solutions include forcing balanced data by removing observations from the larger class, replicate data from the smaller class, or heavily weigh the training examples toward instances of the larger class.
Q21) What is bias – Variance trade off?
Answer: Bias Variance Trade-Off Inherent part of predictive modeling, where models with lower bias will have higher variance and vice versa. Goal to eachieve low bias and low variance.
- Bias: error from incorrect assumptions to make target function easier to learn (high bias → missing relevant relations or under fitting)
- Variance: error from sensitivity to fluctuations in the dataset, or how much the target estimate would differ if different training data was used (high variance → modeling noise or over fitting.
Q22) What are the evaluation metrics in Classification algorithm?
Answer: The confusion matrix is used to evaluate the model:
Accuracy: ratio of correct predictions over total predictions. Misleading when class sizes are substantially different. Accuracy is = (T P +T N) /(T P +T N+F N+F P)
Precision: how often the classifier is correct when it predicts positive: precision = T P/( T P +F P )
Recall: how often the classifier is correct for all positive instances: recall = T P /(T P +F N)
F-Score: single measurement to describe performance: F = 2 *(precision * recall)/ (precision + recall)
ROC Curves: plots true positive rates and false positive rates for various thresholds, or where the model determines if a data point is positive or negative (e.g. if >0.8, classify as positive). Best possible area under the ROC curve (AUC) is 1, while random is 0.5, or the main diagonal line.
Q23) What are Ensemble, Bagging and Boosting?
Answer: Ensemble learning is the strategy of combining many different classifiers/models into one predictive model. It revolves around the idea of voting: a so-called ”wisdom of crowds” approach. The most predicted class will be the final prediction.
Bagging: ensemble method that works by taking B bootstrapped subsamples of the training data and constructing B trees, each tree training on a distinct subsample as
Boosting: the main idea is to improve our model where it is not performing well by using information from previously constructed classifiers. Slow learner. Have 3 tuning parameters: number of classifiers B, learning parameter λ, interaction depth d (controls interaction order of model).
Q24) Explian Naïve – Bayes algorithm?
Answer: P(Ci|X) = [P(X|Ci) * P(Ci)] / P(X) Where:
- P(Ci): the prior probability of belonging to class i
- P(X): normalizing constant, or probability of seeing the given input vector over all possible input vectors
- P(X|Ci): the conditional probability of seeing input vector X given we know the class is Ci
Q25) Explain K-Means clustering?
Answer: K-Means Clustering Simple and elegant algorithm to partition a dataset into K distinct, non-overlapping clusters. Choose a K. Randomly assign a number between 1 and K to each observation. These serve as initial cluster assignments. Iterate until cluster assignments stop changing
(a) For each of the K clusters when compute the cluster centroid. The kth cluster can centroid is the vector of the p feature means for the observations in the kth cluster.
(b) Assign each observation to the cluster whose centroid is closest (where closest is defined using distance metric).
Q26) Difference between Supervised and unsupervised?
Answer: Supervised: If you’re learning a task under supervision, someone is present judging whether you’re getting the right answer. Similarly, in supervised learning, that means having an full set of the labeled data while training on the algorithm. Fully labeled means that each of example in training dataset are tagged with the answer the algorithm should come up with on its own. When shown the new image, then model compares it to the training examples to predict the correct label
Unsupervised: The main aim of unsupervised learning is to model the distribution in the data in order to learn more about the data Algorithms are left to their own devises to the discover and present the interesting structure in the data.
Q27) Mention Any Five Algorithms of Machine Learning.
- Decision Trees
- Neural Network
- Probabilistic networks
- Nearest Neighbor
- Support vector Machines
Q28) Explain why data cleaning is important in analysis ?
Answer: Data cleaning is very important in data science for data analysis,To Access the data very fast,To Optimize the data,To free up the memory,To reduce the storage data cost,To reduce the access time of data in efficient way,For creating the prediction future data analysis etc.
Q29) Why we need to use a python tuple is preferred over python list ?
Answer: Suppose when the programmer going to create the very big list then it will take too much time access ,In case of if the tuple it will no too much time ,tuple is the primary prefferable when data is immuatble ,means data is not going to change by the programmer or user and also it will prevent the un excepcte data modification or data corruption
Q30) What are the skills are required to learn the data science with respect to python?
Answer: If anybody decided to learn or upgrade he or she to datascience technology in python,then he need to have knowldge basic python programming like data types,control statements,loops,data structures like tuple,dictionary,list etc,should be strong in analytical skills and prediction,know the very well about predefind libraries like vector ,matrix,numpy,pandas,arrays etc.
Q31) Write syntax for creating sting variable?
u_str=str(input (“Enter the variable as string”))
Q32) write the types of Techniques of Machine Learning?
Answer: There are two techniques of machine Learning are,
- Generic Programming
- Inductive learning
Q33) List out the Supervised Learning Functions.
- Annotate Strings
- Speech recognition
- Predict time Series
Q34) Explain split(), sub(), subn() methods of “re”
split() – It is used to split the strings
sub() – to find the substring and replace that with the new string
subn() –It is also used to find substring once found it will retucrn with number of replaced characters
Q35) What are the types of joins?
Answer: There are two types of joins
- Inner Joins
- Outer Joins
Q36) Write a query that returns the Details of each department and a count of the number of Students in each:
STUDENTS containing: Stu_ID (Primary key) and Stu_Name
STUDENTS_DEPT containing: Stu_ID (Foreign key) and Dept_ID (Foreign key)
DEPT containing: Dept_ID (Primary key) and Dept_Name
select Dept_Name, count(1) from DEPT a right join STUDENT_DEPT b on a.Dept_id = b.Dept_id group by Dept_Name
Q37) Algorithm for a sorting a number dataset in Python.
u_list = [“101”, “204”, “710”, “806”, “909”]
u_list = [int(k) for k in u_list]
Q38) write the types of Techniques of Machine Learning?
Answer: There are two techniques of machine Learning are,
- Generic Programming
- Inductive learning
Q39) How will you reverse a list?
Q40) What are the Types of Request database Flask allows?
Answer: There are Three ways Flask allows to Request database,
- Before request()
- After Request()
- Teardown request()
Q41) Explain the use of // Divisionoperator in Python?
6//3 = 2
6.0//3.0 = 2.0.
It is a Floor Divisionoperator , which is used for dividing two operands with the result as quotient showing only digits before the decimal point.
Q42) Mention the Different types of sequence learning process?
- Sequence Prediction
- Sequence generation
- Sequence recognition
- sequential decision
Q43) Write a syntax, how you access a module written in Python from C
Answer: Module = =PyImport_ImportModule(“<modulename>”);
Q44) Write the Components of relational evaluation techniques.
- Data Acquisition
- Ground Truth Acquisition
- cross Validation Technique
- Significance Test
- Query Type
- Scoring Metric
Q45) How will you remove last object from a list?
Q46) What are the Various Methods for Sequential Supervised Learning?
Answer: Various Method to solve Sequential Supervised Learning problems are:
- Sliding-window methods
- Recurrent sliding windows
- hidden Markow models
- Maximum entropy Markow models
- Conditional random fields
- Graph transformer networks
Q47) Write the types of paradigms of ensemble methods?
Answer: There are two types of paradigms of ensemble methods are,
- Sequential ensemble Methods
- parallel ensemble methods
Q48) Give example for unzipping.
coordinate = [‘x1’, ‘y1’, ‘z1’]
value = [33, 34, 35, 20, 69]
result = zip(coordinate, value)
resultList = list(result)
c1, v1 = zip(* resultList)
print(‘c1 =’, c)
print(‘v1 =’, v)
Q49) What are the areas Pattern recognition is used.
Answer: Pattern recognition is used in,
- Data mining
- Informal retrieval
- Speech Recognition
- Computer Vision
Q50) How would you create an empty NumPy array?
Answer: To create an empty NumPy array, we have two options:
Q51) How would you make a Python script executable on Unix?
Answer: To make the python script as an executable it should satisfies the two conditions.
- It should be a executable mode
- It contains starting of the file with hash symbol(#) I.e #!/usr/local/bin/python
Q52) How would we can create an empty NumPy array?
Answer: For creating the numpy empty array we have two ways
array(, shape=(0, 0), dtype=float64
Q53) What are the supported data types in Python?
Answer: Python has five standard data types −
Q54) Program for one-linear that will count the number of capital letters in a file.
- with open(SOME_LARGE_FILE) as fh:
- count = 0
- text = fh.read()
- for character in text:
- if character.isupper():
- count += 1
Q55) Explain about Sequence Learning?
Answer: Sequence Learning is a Method of Learning and teaching in a logical manner
Q56) How will you reverse a list?
Q57) Write a syntax, how you access a module written in Python from C
Answer: Module = =PyImport_ImportModule(“<module_name>”);
Q58) How will you remove last object from a list?
Q59) What is Flask?Is flask equivalent to MVC Model?
Answer: Yes,Flask is minimalistic framework it is work same like a Model view controller framework
Q60) How to get indices of N maximum values in a NumPy array?
Answer: fileWriter = open(“c:\\scores.txt”, “w”)
Q61) Write syntax for creating sting variable?
A=input (“string variable “)
How to find the count of data
By using count query
Q62) list having tweets, find 10 most used top hashtags.
Answer: User can strore all kind of hashtags in dictionary and the find the top ten values
Q63) What are the types of Bias?
Answer: Thare are four types of Bias,
- Sampling Bias
- Time interval
Q64) write the types of Techniques of Machine Learning?
Answer: There are two types of machine Learning are,
- Generic Programming
- Inductive learning
Q65) What are the Different Data Structures in R?
Q66) Write program to convert uppercase little to lower case
Q67) Explain that why data cleaning is important in analysis ?
Answer: In data science, Data cleaning from multiple sources to transform it into a format that data analysts or data scientists can be work with is a cumbersome process because – as the number of data sources increases, the time take to data clean the data increases exponentially due to the number of data sources and the data volume of data generated in these data sources.It might take up to 85 % of the time for just cleaning data making it a very critical part of data analysis task.
Q68) How to find the best approximate solution to the knapsack problem1 in a given time by using best Algorithm
Answer: Greedy (it is best view most possibility for go to next).