In this article, we’ll cover the main functions to conduct a statistical analysis. I’m working on a sample data set that describes the salaries of employees depending on several features.
I. Data exploration
The file I’m working on comes with a small description
DES. The description of the data is the following.
wage educ exper tenure nonwhite female married numdep smsa northcen south west construc ndurman trcommpu trade services profserv profocc clerocc servocc lwage expersq tenursq Obs: 526 1. wage average hourly earnings 2. educ years of education 3. exper years potential experience 4. tenure years with current employer 5. nonwhite =1 if nonwhite 6. female =1 if female 7. married =1 if married 8. numdep number of dependents 9. smsa =1 if live in SMSA 10. northcen =1 if live in north central U.S 11. south =1 if live in southern region 12. west =1 if live in western region 13. construc =1 if work in construc. indus. 14. ndurman =1 if in nondur. manuf. indus. 15. trcommpu =1 if in trans, commun, pub ut 16. trade =1 if in wholesale or retail 17. services =1 if in services indus. 18. profserv =1 if in prof. serv. indus. 19. profocc =1 if in profess. occupation 20. clerocc =1 if in clerical occupation 21. servocc =1 if in service occupation 22. lwage log(wage) 23. expersq exper^2 24. tenursq tenure^2
The first step is to load the data set in our path :
Then, we’ll plot a histogram of hourly salaries :
Now, let’s plot the histogram for the number of years of education.
We can take a look at the mean of each feature :
ans = 5.8961 12.5627 17.0171 5.1046 0.1027 ...
We can also take a look at other descriptive statistics :
std(wage1(:,:))' min(wage1)' max(wage1)' cov(wage1)' corrcoef(wage1(:,:))'
II. Linear regression
We now want to assess the effect of columns 2, 3 and 4 (education, experience, years with current employer) on the salary.
We can compute the Beta coefficients and the residuals :
The coefficients are the following :
beta = -2.8727 0.5990 0.0223 0.1693
The first coefficient is constant. Then, the model indicates that an additional year of education should increase the hourly salary by 0.60$.
Effect of each feature individually
We can also plot a histogram of the residuals. The distribution of the residual appears to be close to a normal one.
We can decide to remove outliers above 5 for example :
indices = find(u>2.5); u(indices) = ; hist(u)
Are the coefficients significant?
To assess the significance of each estimator, we need to estimate their standard deviation.
Then, we can easily compute the test statistic :
t = beta./std
t = -5.6300 16.6857 2.6471 11.1725
What should we compare the test statistics? Well, to the critical value! For example, since we estimate 4 parameters, the 95% critical value is :
C1 = tinv(0.95, n-4)
C1 = 1.6478
The T-stats are all greater than the critical value. For this reason, we reject the hypothesis that the variables have no impact on the output.
Effect of a single feature
If we are interested in the effect of a single feature, we can specify a value for its effect for example.
t_educ = (beta(2)-0.1)/std(2)
Effect of each feature on
Testing the effect of the features on the
log(wage) is a way to assess their importance in terms of the percentage change.
y=log(wage1(:,1)) beta=inv(X'*X)*X'*y std=sqrt(diag(sig2*inv(X'*X))) t = beta./std
Test is 2 features have the same effect
Now, let’s test the following hypothesis: the effect of education is the same as the effect of professional experience. This can be rewritten as : . However, estimating the variance of to build the test hypothesis would be challenging. We can apply a simple trick by building a new feature as the sum of the 2 features to test.
First of all, append the new feature to the model :
capitaltot = X(:,2) + X(:,3)
This introduces some multicolinearity which we should control by removing :
We can run the test :
beta=inv(X'*X)*X'*y std=sqrt(diag(sig2*inv(X'*X))) t = beta(4)/std(4)
This heads :
t = 0.4883
The T-test is smaller than the critical value. For this reason, we cannot reject the hypothesis that the effect in the percentage of education is the same as the one of professional experience.
Fisher Test for joint hypothesis
Let us now consider the hypothesis : . In such case, we should not apply 2 independent tests, since we are interested in the overall hypothesis.
We need to define a non-constrained model, a constrained model and test the difference between those two models.
% non-constrained model X0=[ones(n,1),wage1(:,[2,3,4])] y0=log(wage1(:,1)) beta0=inv(X0'*X0)*X0'*y0 u0=y0-X0*beta0 SSR0=u0'*u0
% model contraint X1=[ones(n,1),wage1(:,)] y1=log(wage1(:,1)) beta1=inv(X1'*X1)*X1'*y1 u1=y1-X1*beta1 SSR1=u1'*u1
The Fisher F-Stat is defined as :
Where K is the number of variables included in the non-constrained model (4 here).
K = 4 F = (SSR1-SSR0)/(SSR0) * (n-K)/2
This heads :
F = 80.1478
Let’s define new variables :
X=wage1 educ=X(:,2) exper=X(:,3) tenure=X(:,4) female=X(:,6) married=X(:,7) marrmale = (1-female).*married marrfem = female.*married singfem=female.*(1-married)
We create interaction variables :
- effect of a married male
- effect of a married female
- effect of a single female
The base case is a single male. We can define test statistics on those variables as we did above.
**Conclusion **: This was a quick introduction to Matlab. I hope you found it useful. Don’t hesitate to drop a comment if you have a question.
Like it? Buy me a coffee