, , ,

While driving to work today, I was listening to radio and suddenly a claim by the RJ (Radio jockey) struck me – 77% of Indians think that the living room defines their home. This got me wondering, what was their sample set, how did they made this proclamation. They definitely haven’t surveyed all of India and got these results. Nobody does, it would be prohibitively expensive and time consuming to carry out this exercise. So, I thought of jotting down on my thoughts on how sampling is done, how do we extract/extrapolate results and the dangers which lurk behind in these exercises.

To start with us let us consider this textbook example – A company looking to launch a new product, does market survey to find out the demand in a particular region. They take a sample of 1000, 500 young and 500 older (30 yrs + ) and find that about 50% of them will be interested in the product. Buoyed by this they decide to launch the product in the said region and then truth hit home, only 20% of people actually bought the market. What could have gone wrong here – To start with a number of things.

For starters, the sample contained equal number of people from young and older folks, but what about their ratios in the general population. As it turns out there were only 20% young people in the region, but the sample gave them a representation of 50% and hence results were not reliable. There are two ways to address this – either get the same proportion in your sample set as in population or apply weights to get a better understanding of what people are saying. This is also one of the biggest reasons leading to so-called predictions about elections ending up way off the results.

The other thing to go wrong would have been the purchasing capacity of young people. If the sample consisted of a lot of college going people and the product in question was some thing like iPhone, which is pretty expensive in markets like India, then in all likelihood the survey wouldn’t throw up actionable results. And if that is indeed the case, then we need to change our sampling strategy and the methods which come handy are – Stratified Sampling and Cluster Sampling.

In stratified sampling you divide the population into multiple strata, and then take random samples from each of the strata. No strata should be left out and at the same time there shouldn’t be over-representation from any strata. The sampling needs to be random.

In Cluster sampling, you subdivide population or the strata into number of clusters and then you pick the clusters randomly in which you will conduct sampling. If the clusters are not picked at random, you run the risk of getting results you were looking for and not necessary correct insights from sampling. As you would have noticed that there may be some clusters where people will not be interviewed because their cluster is not picked up. This factor to an large extent is covered by random selection of clusters and doesn’t distort the overall picture.

Wow… that was lot of theory on sampling. Coming back to where we started off, why did I feel that there was something missing in 77% claim broadcast. Primarily because if there is a claim that 77% people like something, and then in all likelihood, if I talk to 10 people, 7 to 8 of them will agree to this theory, unless I am talking to all the people in my strata/cluster, who didn’t agree with it. I talked to some people at random and only about ~8% of people tend to agree with this theory. Do you see the danger of generalization and percentage here. I may have talked to just 10 people and if I one liked I have close to 10% fact !

Next time, you hear such a claim, ask a question, what was the sample size? how many cities covered ? A survey of 100 people at Bangalore Airport will not represent what India thinks in general. Remember (those familiar with Indian politics), NDA’s India Shining election campaign in 2004 general elections was based on such faulty sampling and went flat out at hustings.