Looking for Data Science Material ?
This Course is Your Secret Weapon to Landing Your Dream Job!
This course has 500+ High Impact Real Interview Questions to prepare for breakthrough interviews.
The time you devote to going through this course and crafting your own answers will provide you with a winning approach to make you a top candidate.
These Interview Q&As will increase your knowledge base and make you well prepared and confident for the Interviews and will be a catalyst in the growth of your career.
Ans. The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance.
It’s the likelihood that the test is correctly rejecting the null hypothesis. For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.
A high statistical power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.
A low statistical power means that the test results are questionable.
Statistical power helps you to determine if your sample size is large enough.
It is possible to perform a hypothesis test without calculating the statistical power. If your sample size is too small, your results may be inconclusive when they may have been conclusive if you had a large enough sample.
Ans. Facts
India’s population in a year – 1.3 bill
Population breakup – Rural – 70% and Urban – 30%
Assumptions
Every year India’s population would grow steadily, but the growth won’t be very fast-paced.
Every man and women would be eventually married (homogeneously or heterogeneously). They won’t prematurely die or prefer not to marry. People would be married only once.
In rural areas the age of marriage (in average) is between 15 – 35 year range. Similarly, in urban areas = 20 – 35 years. India is a young country, and 15 – 35 year range has around 50% of the total population.
Rural estimation
Rural population = 70% * 1.3 bill = 900 mill
Population within marriage age in a year = 50% * 900 mill = 450 mill
Number of marriages to happen = 450 / 2 = 225 mill marriages
These people will marry within a 20 year time period according to our assumptions.
Number of rural marriages in a year = 225 mill / 20 = 11.25 mill marriages
Urban estimation
Urban population = 30% * 1.3 bill = 400 mill
Population within marriage age in a year = 50% * 400 mill = 200 mill
Number of marriages to happen = 200 / 2 = 100 mill marriages
These people will marry within a 15 year time period according to our assumptions.
Number of urban marriages in a year = 100 mill / 15 = 6.6 mill marriages
Note and caveats
Many people die in accidents prematurely, and won’t marry. In addition, most people don’t marry as well as a consumer preference parameter. So, our market number is over-estimated. Even if we try to normalize it by introducing an error percentage of around 10%, the final number number will be lesser by around 10%-15%.
Answer = Approximately 14 million marriages occur in a year in India.
Ans. Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.
Pruning processes can be divided into two types (pre- and post-pruning).
Pre-pruning procedures prevent a complete induction of the training set by replacing a stop () criterion in the induction algorithm. Pre-pruning methods are considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start
Post-pruning is the most common way of simplifying trees. Here, nodes and subtrees are replaced with leaves to reduce complexity.
The procedures are differentiated on the basis of their approach in the tree (top-down or bottom-up).
Top-down fashion: It will traverse nodes and trim subtrees starting at the root
Bottom-up fashion: It will begin at the leaf nodes
There is a popular pruning algorithm called reduced error pruning, in which starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected, the change is kept.
Ans. A forecast refers to a calculation or an estimation which uses data from previous events, combined with recent trends to come up a future event outcome.
On the other hand, a prediction is an actual act of indicating that something will happen in the future with or without prior information.
Accuracy: A Forecast is more accurate compared to a prediction. This is because forecasts are derived by analysing a set of past data from the past and present trends.
On the other hand, a prediction can be right or wrong. For example, if you predict the outcome of a football match, the result depends on how well the teams played no matter their recent performance or players.
Bias: Forecasting uses mathematical formulas and as a result, they are free from personal as well as intuition bias.
On the other hand, predictions are in most cases subjective and fatalistic in nature.
Quantification: When using a model to do a forecast, it’s possible to come up with the exact quantity. For example, the World Bank uses economic trends, and the previous GDP values and other inputs to come up with a percentage value for a country’s economic growth.
However, when doing prediction, since there is no data for processing, one can only say whether the economy of a given country will grow or not.
Application: Forecasts are only applicable in the economic and meteorology field where there is a lot of information about the subject matter.
On the contrary, prediction can be applied anywhere as long as there is an expected future outcome.
Ans. The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem.
We need backpropagation because,
Calculate the error – How far is your model output from the actual output.
Minimum Error – Check whether the error is minimized or not.
Update the parameters – If the error is huge then, update the parameters (weights and biases). After that again check the error.
Repeat the process until the error becomes minimum.
Model is ready to make a prediction – Once the error becomes minimum, you can feed some inputs to your model and it will produce the output.
Ans. Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don’t have the illness.
Thus there is a 5% error in case you do not have the illness.
Out of 1000 people, 1 person who has the disease will get true positive result.
Out of the remaining 999 people, 5% will also get true positive result.
Close to 50 people will get a true positive result for the disease.
This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.
Ans. Most likely outliers will have a negligible effect because the nodes are determined based on the sample proportions in each split region (and not on their absolute values). However, different implementations to choose split points of continuous variables exist. Some consider all possible split points, others percentiles. But, in some poorly chosen cases (e.g. dividing the range between min and max in equidistant split points), outliers might lead to sub-optimal split points. But you shouldn’t encounter these scenarios in popular implementations. That’s why, in such cases avoid taking such criterions. On the whole though, they are quite robust.
Ans. The main difference between these two data types is the operation you can perform on them. Lists are containers for elements having differing data types but arrays are used as containers for elements of the same data type.
Ans. The idea in kNN methods is to identify ‘k’ samples in the dataset that are similar or close in the space. Then we use these ‘k’ samples to estimate the value of the missing data points. Each sample’s missing values are imputed using the mean value of the ‘k’-neighbors found in the dataset.
Ans. Techniques to reduce overfitting:
Increase training data.
Reduce model complexity.
Early stopping during the training phase.
Ridge Regularization and Lasso Regularization.