**Data science interview question & answer**

In this data science interview question & answer blog, I will introduce you to the most frequently asked questions on Data Science, Analytics and Machine Learning interviews.data science interview question & answer This blog is the perfect guide for you to learn all the concepts required to clear a Data Science interview. data science interview question & answer To get in-depth knowledge on Data Science, you can enroll for live Data Science Certification Training by Mildaintrainings with 24/7 support.

Mildaintrainings a Pioneer name in the IT technical training offers Data science interview question & answer that will help you in cracking your interview and acquiring dream profession as Data Science Developer. Mildaintrainings also provides demo data science interview question & answer interview sessions that will help you to prepare yourself for giving interviews with immense knowledge, as all our trainers are working professionals and corporate trainers.

**Below are most asked data science interview question & answer by Experts:**

**Q1. What is Data Science?**

**Ans:**Data Science involves using automated methods to analyze massive amounts of data and to extract knowledge from them. By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge.

**Q2. What are the important skills to have in Python with regard to data analysis?**

**Ans:**The following are some of the important skills to possess which will come handy when performing data analysis using Python**.**

- Good understanding of the built-in data types especially lists, dictionaries, tuples and sets.
- Mastery of N-dimensional NumPy arrays.
- Mastery of pandas dataframes.
- Ability to perform element-wise vector and matrix operations on NumPy arrays. This requires the biggest shift in mindset for someone coming from a traditional software development background who’s used to for loops.
- Knowing that you should use the Anaconda distribution and the conda package manager.
- Familiarity with scikit-learn.
- Ability to write efficient list comprehensions instead of traditional for loops.
- Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
- Knowing how to profile the performance of a Python script and how to optimize bottlenecks.

The following will help to tackle any problem in data analytics and machine learning.

**Q3. What is Selection Bias?**

**Ans:** Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. It is the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.

**Q4. What is the difference between “long” and “wide” format data?**

**Ans:** In the wide format, a subject’s repeated responses will be in a single row, and each response is in a separate column.In the long format, each row is a one time point per subject.

**Q5. What do you understand by the term Normal Distribution?**

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up.However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.he random variables are distributed in the form of a symmetrical bell-shaped curve.

**Q6. What is the goal of A/B Testing?**

It is a statistical hypothesis testing for randomized experiment with two variables A and B.The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest.An example for this could be identifying the click-through rate for a banner ad.

**Q7. What do you understand by statistical power of sensitivity and how do you calculate it?**

**Ans: **Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true**. **

**Calculation of seasonality is pretty straight forward.**

** Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )**

where true positives are positive events which are correctly classified as positives.

**Q8. What are the differences between overfitting and underfitting?**

**Ans:** In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance

**Q9. What is Cluster Sampling?**

**Ans:** Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.

**Q10:** **What is Systematic Sampling?**

**Ans:** Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.

**Q11: ****Explain cross-validation ?**

**Ans: **Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.

**Q12: What are Recommender Systems?**

**Ans:** Recommender Systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

**Q13. ****What is Linear Regression?**

**Ans:** Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

Q14. **What is Collaborative filtering?**

Ans: The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.