In case you’re searching for Data Science with Python Interview Questions and answers for Experienced or Freshers, you are at the correct place. There is parcel of chances from many presumed organizations on the planet. The Data Science with Python advertise is relied upon to develop to more than $5 billion by 2020, from just $180 million, as per Data Science with Python industry gauges. In this way, despite everything you have the chance to push forward in your vocation in Data Science with Python Development. Gangboard offers Advanced Data Science with Python Interview Questions and answers that assist you in splitting your Data Science with Python interview and procure dream vocation as Data Science with Python Developer
Best Data Science with Python Interview Questions and Answers
Do you believe that you have the right stuff to be a section in the advancement of future Data Science with Python, the GangBoard is here to control you to sustain your vocation. Various fortune 1000 organizations around the world are utilizing the innovation of Data Science with Python to meet the necessities of their customers. Data Science with Python is being utilized as a part of numerous businesses. To have a great development in Data Science with Python work, our page furnishes you with nitty-gritty data as Data Science with Python prospective employee meeting questions and answers. Data Science with Python Interview Questions and answers are prepared by 10+ years experienced industry experts. Data Science with Python Interview Questions and answers are very useful to the Fresher or Experienced person who is looking for the new challenging job from the reputed company. Our Data Science with Python Questions and answers are very simple and have more examples for your better understanding.
By this Data Science with Python Interview Questions and answers, many students are got placed in many reputed companies with high package salary. So utilize our Data Science with Python Interview Questions and answers to grow in your career.
Q1)What is Data science? What is the role of Machine Learning in Data science?
Answer: Data science is a blend of tools and algorithms with the goal to discover the hidden patterns from the raw data. The role Machine learning in Data science is Data science uses Machine learning principles to analyse and make future predictions.
Q2) How will you define supervised and unsupervised learning?
Answer: Supervised learning is one of the method associated with Machine learning in which all data is labelled and the algorithm learn to predict the output from the input data.In unsupervised learning all data is unlabelled and algorithm learn to inherent a structure from the input data.
Q3) What you mean by Type I error and Type II error in Hypothesis testing?
Answer: Type I error is occurred when you reject null hypothesis but actually it is true. It is also known as ‘False positive’.Type II error occurred when you accept null hypothesis but it is actually false. It is also known as ‘False negative’
Q4) How will you evaluate your regression model based on R2, Adjusted R2 and tolerance?
Answer: Tolerance is used as an indicator for finding multicollinearity. If the tolerance is high then it is desirable.It is important to consider R2 and Adjusted R2 for model evaluation. R2 increases irrespective of improvement in prediction accuracy as by adding more variable but Adjusted R2 increase only when an additional variable which improves the accuracy of the model else it remains the same. So we can use adjusted R2 and predicted R2 to include the correct number of variables for our regression model.
Q5) What is Logistic regression? How will you evaluate your Logistic regression model?
Answer: Logistic regression which comes under classification model is a technique to predicting binary outcome from a linear combination of predictor variable.
The following methods used for evaluating Logistic regression model:
- Since it used to predict probabilities, we can use AUC-ROC curve along with confusion matrix for finding the performance.
- Instead of R2 in linear regression, we can use AIC value which tells us the measure of fit which penalizes model for the number of model coefficients. We always prefer model with less AIC value.
- The value of Null Deviance and Residual Deviance can use to determine the efficiency of model. Null Deviance indicate response predicted by a model with nothing and Residual Deviance indicate response predicted by a model on adding independent variable. If both values are lower then better the model.
Q6) What is the difference between ANOVA and t-test?
Answer: The t-test and ANOVA(Analysis of Variance) are used to examine whether group meAnswer: differ from one another. The key difference between these two statistical method is
- The t-test compares two groups in order to examine how the group mean differ from one another, using t-distribution which is used when Standard deviation is not known and samples size is small.
- ANOVA is a statistical method used to compare two or more groups to find out similarity between each group mean.
Q7) What is the difference between Overfitting and Underfitting?
Answer: These are the main differences between overfitting and underfitting
- Overfitting happens a statistical model or machine learning algorithm captures the noise of data. Intuitively overfitting occures when the model or the algorithm fits the data too well(low bias but high variance)
Q8) What are the steps involved in an analytics project?
Answer: The following are the various steps involved in an analytics project:
- Understanding the business problem
- Data Exploration
- Data preparation for modelling
- Model building and analysing result
- Model validation using new data set
- Implementing the model and analyse performance over the period of time.
Q9)What all are the main packages used in Python for Data science and Machine Learning?
Answer: There are lot of libraries for data science in Python. The important libraries are:
Q10) How will you define your number of clusters in K-MeAnswer: clustering algorithm?
Answer: In K-MeAnswer: algorithm “K” defines the number of clusters. The methods used to find the optimal number of clusters are the following:
- Using WSS (Within Sum of Square) plot we can find the bending point and that point should taken as K in K-MeAnswer:
- We can use CH (Calinski-Harabasz) plot where the higher CH value will be taken as K for the K-MeAnswer: clustering.
Q11) Differentiate between univariate, bivariate and multivariate analysis?
Answer:: These are descriptive statistical analysis techniques which tells the number of variables involves in the analysis. For example, the pie charts of sales based on region involve only one variable is known as univariate analysis.
If the analysis attempts to find differences between 2 variables known as bivariate analysis.
Analysis that deals with the study of more than two variables to understand the how much the variable has the effect on the responses is referred to as multivariate analysis.
Q12) How kNN is different from K-MeAnswer: clustering?
Answer: These are two different machine learning algorithm used for different purpose. The main differences are:
- K-MeAnswer: comes under unsupervised learning algorithm and kNN is a supervised learning algorithm.
- K-MeAnswer: is a clustering algorithm where as kNN is a classification (or regression) algorithm.
- K-MeAnswer: algorithm divides a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other.
- kNN algorithm tries to classify an unlabelled observation based on its k (can be any number ) surrounding neighbours.
Q13) What are the assumptions required for linear regression?
Answer: There are four major assumptions:
- There is a linear relationship between the dependent variables and the independent variable, meaning the model you are creating actually fits the data
There is minimal multicollinearity between explanatory variables
- The variance around the regression line is the same for all values of the predictor variable.
Q14) You are given a dataset and you have build a decision tree model on top of it. You got an accuracy of 98%. Why you shouldn’t happy with your model performance?
Answer: The problem here is the dataset you got is an imbalanced one, so we can’t rely on the accuracy which we got as 98% because it only predicting the majority class correctly. Hence, in order to evaluate the model we should use sensitivity, specificity and F measure to determine the class wise performance. If minority class performance is found to be poor , we can undertake the following steps:
- Use under-sampling, oversampling or SMOTE to make data balanced
- Assign the weight to minority classes such that the minority classes will get larger value
- Alter prediction threshold value by doing probability calibration and find optimal threshold using AUC-ROC curve.
Q15) Where we are mostly using naiveBayes algorithm for classification?
Answer: The most using areas are:
- For making prediction in real time because it is an eager learning classifier
- Text classification/ Sentiment analysis is another common area where Naive Bayes is mostly using because of its better performance in multiclass problems and independent rule.
Q16) What is the difference between Covariance and correlation?
Answer: A measure used to represent how strongly two random variable are related known as correlation. Covariance is nothing but a measure of correlation. Correlation refers to the scaled form of covariance.
Q17) What is Gradient descent?
Answer: It is the First-order optimization algorithm. We can find the minimum of a convex function by starting at an arbitrary point and repeatedly take steps in the downward direction, which can be found by taking the negative direction of the gradient. After several iterations, we will eventually converge to the minimum.
Q18) How gradient descent is helpful in ML?
Answer: The minimum corresponds to the coefficients with the minimum error, or the best line of fit. The learning rate α determines the size of the steps we take in the downward direction. Mostly we use Stochastic Gradient Descent (SGD) to find the local minima. SGD: – Instead of taking a step after sampling the entire training set, we take a small batch of training data at random to determine our next step. Computationally more efficient and may lead to faster convergence.
Q19) What is Regularization?
Answer: This is a form of regression that constrains or regularizes or shrinks the coefficient estimates towards zero relative to the least squares estimate. λ represents the tuning parameter- as λ increases, flexibility decreases → decreased variance but increased bias. The tuning parameter is key in determining the sweet spot between under and over-fitting. In addition, while Ridge will always produce a model with p variables, Lasso can force coefficients to be equal to zero.
- Lasso (L1): min RSS + λ Pp j=1 |βj |
- Ridge (L2): min RSS + λ Pp j=1 β 2
Q20) What do you mean by Imbalanced Classes?
Answer: Imbalance in classes in training data leads to poor classifiers. It can result in a lot of false positives and also lead to few training data. Solutions include forcing balanced data by removing observations from the larger class, replicate data from the smaller class, or heavily weigh the training examples toward instances of the larger class.
Q21) What is bias – Variance trade off?
Answer: Bias Variance Trade-Off Inherent part of predictive modeling, where models with lower bias will have higher variance and vice versa. Goal to eachieve low bias and low variance.
- Bias: error from incorrect assumptions to make target function easier to learn (high bias → missing relevant relations or under fitting)
- Variance: error from sensitivity to fluctuations in the dataset, or how much the target estimate would differ if different training data was used (high variance → modeling noise or over fitting.
Q22) What are the evaluation metrics in Classification algorithm?
Answer: The confusion matrix is used to evaluate the model:
Accuracy: ratio of correct predictions over total predictions. Misleading when class sizes are substantially different. Accuracy is = (T P +T N) /(T P +T N+F N+F P)
Precision: how often the classifier is correct when it predicts positive: precision = T P/( T P +F P )
Recall: how often the classifier is correct for all positive instances: recall = T P /(T P +F N)
F-Score: single measurement to describe performance: F = 2 *(precision * recall)/ (precision + recall)
ROC Curves: plots true positive rates and false positive rates for various thresholds, or where the model determines if a data point is positive or negative (e.g. if >0.8, classify as positive). Best possible area under the ROC curve (AUC) is 1, while random is 0.5, or the main diagonal line.
Q23) What are Ensemble, Bagging and Boosting?
Answer: Ensemble learning is the strategy of combining many different classifiers/models into one predictive model. It revolves around the idea of voting: a so-called ”wisdom of crowds” approach. The most predicted class will be the final prediction.
Bagging: ensemble method that works by taking B bootstrapped subsamples of the training data and constructing B trees, each tree training on a distinct subsample as
Boosting: the main idea is to improve our model where it is not performing well by using information from previously constructed classifiers. Slow learner. Have 3 tuning parameters: number of classifiers B, learning parameter λ, interaction depth d (controls interaction order of model).
Q24) Explian Naïve – Bayes algorithm?
Answer: P(Ci|X) = [P(X|Ci) * P(Ci)] / P(X) Where:
- P(Ci): the prior probability of belonging to class i
- P(X): normalizing constant, or probability of seeing the given input vector over all possible input vectors
- P(X|Ci): the conditional probability of seeing input vector X given we know the class is Ci
Q25) Explain K-Means clustering?
Answer: K-Means Clustering Simple and elegant algorithm to partition a dataset into K distinct, non-overlapping clusters. Choose a K. Randomly assign a number between 1 and K to each observation. These serve as initial cluster assignments. Iterate until cluster assignments stop changing
(a) For each of the K clusters when compute the cluster centroid. The kth cluster can centroid is the vector of the p feature means for the observations in the kth cluster.
(b) Assign each observation to the cluster whose centroid is closest (where closest is defined using distance metric).
Q26) Difference between Supervised and unsupervised?
Answer: Supervised: If you’re learning a task under supervision, someone is present judging whether you’re getting the right answer. Similarly, in supervised learning, that means having an full set of the labeled data while training on the algorithm. Fully labeled means that each of example in training dataset are tagged with the answer the algorithm should come up with on its own. When shown the new image, then model compares it to the training examples to predict the correct label
Unsupervised: The main aim of unsupervised learning is to model the distribution in the data in order to learn more about the data Algorithms are left to their own devises to the discover and present the interesting structure in the data.