Data Science Corner - Interview QnAs

Looking for Data Science Material ?

INTERVIEW Q&As

Download Now

This Course is Your Secret Weapon to Landing Your Dream Job!

This course has 500+ High Impact Real Interview Questions to prepare for breakthrough interviews.

The time you devote to going through this course and crafting your own answers will provide you with a winning approach to make you a top candidate.

These Interview Q&As will increase your knowledge base and make you well prepared and confident for the Interviews and will be a catalyst in the growth of your career.

Get it now @499

Q. What is the statistical power of sensitivity?

Ans. The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance.

It’s the likelihood that the test is correctly rejecting the null hypothesis. For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.

A high statistical power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.

A low statistical power means that the test results are questionable.

Statistical power helps you to determine if your sample size is large enough.

It is possible to perform a hypothesis test without calculating the statistical power. If your sample size is too small, your results may be inconclusive when they may have been conclusive if you had a large enough sample.

Q. How will you estimate number of weddings that take place in a year in India?

Ans. Facts

India’s population in a year – 1.3 bill

Population breakup – Rural – 70% and Urban – 30%

Assumptions

Every year India’s population would grow steadily, but the growth won’t be very fast-paced.

Every man and women would be eventually married (homogeneously or heterogeneously). They won’t prematurely die or prefer not to marry. People would be married only once.

In rural areas the age of marriage (in average) is between 15 – 35 year range. Similarly, in urban areas = 20 – 35 years. India is a young country, and 15 – 35 year range has around 50% of the total population.

Rural estimation

Rural population = 70% * 1.3 bill = 900 mill

Population within marriage age in a year = 50% * 900 mill = 450 mill

Number of marriages to happen = 450 / 2 = 225 mill marriages

These people will marry within a 20 year time period according to our assumptions.

Number of rural marriages in a year = 225 mill / 20 = 11.25 mill marriages

Urban estimation

Urban population = 30% * 1.3 bill = 400 mill

Population within marriage age in a year = 50% * 400 mill = 200 mill

Number of marriages to happen = 200 / 2 = 100 mill marriages

These people will marry within a 15 year time period according to our assumptions.

Number of urban marriages in a year = 100 mill / 15 = 6.6 mill marriages

Note and caveats

Many people die in accidents prematurely, and won’t marry. In addition, most people don’t marry as well as a consumer preference parameter. So, our market number is over-estimated. Even if we try to normalize it by introducing an error percentage of around 10%, the final number number will be lesser by around 10%-15%.

Answer = Approximately 14 million marriages occur in a year in India.

Q. What is Pruning in Decision Tree and how is it done?

Ans. Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Pruning processes can be divided into two types (pre- and post-pruning).

Pre-pruning procedures prevent a complete induction of the training set by replacing a stop () criterion in the induction algorithm. Pre-pruning methods are considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start

Post-pruning is the most common way of simplifying trees. Here, nodes and subtrees are replaced with leaves to reduce complexity.

The procedures are differentiated on the basis of their approach in the tree (top-down or bottom-up).

Top-down fashion: It will traverse nodes and trim subtrees starting at the root

Bottom-up fashion: It will begin at the leaf nodes

There is a popular pruning algorithm called reduced error pruning, in which starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected, the change is kept.

Q. Difference between forecasting and prediction.

Ans. A forecast refers to a calculation or an estimation which uses data from previous events, combined with recent trends to come up a future event outcome.

On the other hand, a prediction is an actual act of indicating that something will happen in the future with or without prior information.

Accuracy: A Forecast is more accurate compared to a prediction. This is because forecasts are derived by analysing a set of past data from the past and present trends.

On the other hand, a prediction can be right or wrong. For example, if you predict the outcome of a football match, the result depends on how well the teams played no matter their recent performance or players.

Bias: Forecasting uses mathematical formulas and as a result, they are free from personal as well as intuition bias.

On the other hand, predictions are in most cases subjective and fatalistic in nature.

Quantification: When using a model to do a forecast, it’s possible to come up with the exact quantity. For example, the World Bank uses economic trends, and the previous GDP values and other inputs to come up with a percentage value for a country’s economic growth.

However, when doing prediction, since there is no data for processing, one can only say whether the economy of a given country will grow or not.

Application: Forecasts are only applicable in the economic and meteorology field where there is a lot of information about the subject matter.

On the contrary, prediction can be applied anywhere as long as there is an expected future outcome.

Q. What is backpropagation? How does it work? Why do we need it?

Ans. The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem.

We need backpropagation because,

Calculate the error – How far is your model output from the actual output.

Minimum Error – Check whether the error is minimized or not.

Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error.

Repeat the process until the error becomes minimum.

Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs to your model and it will produce the output.

Q. A test has a true positive rate of 100% and false positive rate of 5%. There is a population with a 1/1000 rate of having the condition the test identifies. Considering positive test, what is the probability of having that condition?

Ans. Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don’t have the illness.

Thus there is a 5% error in case you do not have the illness.

Out of 1000 people, 1 person who has the disease will get true positive result.

Out of the remaining 999 people, 5% will also get true positive result.

Close to 50 people will get a true positive result for the disease.

This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.

Q. What is the impact of outliers on the Decision Tree and how do we solve that?

Ans. Most likely outliers will have a negligible effect because the nodes are determined based on the sample proportions in each split region (and not on their absolute values). However, different implementations to choose split points of continuous variables exist. Some consider all possible split points, others percentiles. But, in some poorly chosen cases (e.g. dividing the range between min and max in equidistant split points), outliers might lead to sub-optimal split points. But you shouldn’t encounter these scenarios in popular implementations. That’s why, in such cases avoid taking such criterions. On the whole though, they are quite robust.

Q. Difference between array and list?

Ans. The main difference between these two data types is the operation you can perform on them. Lists are containers for elements having differing data types but arrays are used as containers for elements of the same data type.

Q. What is the KNN imputation method?

Ans. The idea in kNN methods is to identify ‘k’ samples in the dataset that are similar or close in the space. Then we use these ‘k’ samples to estimate the value of the missing data points. Each sample’s missing values are imputed using the mean value of the ‘k’-neighbors found in the dataset.

Q. How do you prevent overfitting?

Ans. Techniques to reduce overfitting:

Increase training data.
Reduce model complexity.
Early stopping during the training phase.
Ridge Regularization and Lasso Regularization.

Page updated

Google Sites

Report abuse