Correlation and Causality in Market Research
A key error that statisticians and researchers make is to form causal relationships with data or attempting to form relations between data and trying to set a cause for it. Data can frequently be discreet and independent even though they are from the same sample lot. As seen in ‘Chi-Square Test’ which analyzed the number of smokers and their gender in a sample size. Now as per this test, a conclusion and causal relation cannot be drawn between smokers and the gender. One could only say that so many percentages of men and so many percentages of women are smokers. We cannot establish a causal relationship between smokers and their gender.
However, if the study had included details such as family status, income groups, race, education, and so on, then a causal relationship could be established that would form links between smoking and family background, race, or education. While again these linkages would be very generic, by increasing the sample size, a better understanding of causal relationships could have been obtained. According to Hill, the chi-square test is considered as the most frequently used test to indicate if there is a relationship among categorical variables and also to determine whether statistical significance, exists between two variables or bivariate tabular association.
However, the nature and type of data to be tested and analyzed needs to have an inherent relation. There is nothing to be gained by forcing a relationship between data and a cause. Therefore, for a true causal relation to be developed there must be a hypothesis that is worth testing. Now if the data shown in exhibit 1 was to be used by a producer of anti-smoking pills to create a marketing plan, then the data would lead to gross errors in the plan since there is no established causal relation.
To properly establish the relationship between a cause and data, one could examine the analysis given in ‘Regression Analysis’. As per the table and the regression analysis, one can see the values for pulse 1 and pulse 2. It can be seen that for 10, 32, and 35, the fit value is 83, 84, and 86. Therefore there is a causal relation that when the initial pulse is in the range of 76 to 80 and then pulse 2 reading would be between 118 to 128 and the fit value would be between 83 to 86. Therefore, we have established that there is a causal relationship between the pulse reading difference and the initial pulse rate. Similarly, it can be seen that in the table, observation 29 had a fit value of 105 and the standard deviation of 3.77, but this is an observation where X value was large and it got a large influence. The causal link here is not determined and all that can be said is that there is a large influence that has to be examined.
If the above data was to be used by a doctor or a fitness center to understand the health of people, then they can refer to the difference between the pulse 1 and pulse 2 value, the values for the two pulse and then understand the fit of the people. This observation is possible because there was an inherent relation between the two pulses and so a causal relationship could be established.