In this article, we’ll cover the main functions to conduct a statistical analysis. I’m working on a sample data set that describes the salaries of employees depending on several features.

I. Data exploration

The file I’m working on comes with a small description DES. The description of the data is the following.

wage      educ      exper     tenure    nonwhite  female    married   numdep   
smsa      northcen  south     west      construc  ndurman   trcommpu  trade    
services  profserv  profocc   clerocc   servocc   lwage     expersq   tenursq  

Obs:   526

wage                     average hourly earnings
educ                     years of education
exper                    years potential experience
tenure                   years with current employer
nonwhite                 =1 if nonwhite
female                   =1 if female
married                  =1 if married
numdep                   number of dependents
smsa                     =1 if live in SMSA
northcen                 =1 if live in north central U.S
south                    =1 if live in southern region
west                     =1 if live in western region
construc                 =1 if work in construc. indus.
ndurman                  =1 if in nondur. manuf. indus.
trcommpu                 =1 if in trans, commun, pub ut
trade                    =1 if in wholesale or retail
services                 =1 if in services indus.
profserv                 =1 if in prof. serv. indus.
profocc                  =1 if in profess. occupation
clerocc                  =1 if in clerical occupation
servocc                  =1 if in service occupation
lwage                    log(wage)
expersq                  exper^2
tenursq                  tenure^2

The first step is to load the data set in our path :

wage1=load('WAGE1.raw')
[n,k]=size(wage1)

Then, we’ll plot a histogram of hourly salaries :

hist(wage1(:,1))

Now, let’s plot the histogram for the number of years of education.

We can take a look at the mean of each feature :

mean(wage1(:,:))'

We can also take a look at other descriptive statistics :

std(wage1(:,:))'
min(wage1)'
max(wage1)'
cov(wage1)'
corrcoef(wage1(:,:))'

II. Linear regression

Compute coefficients

We now want to assess the effect of columns 2, 3 and 4 (education, experience, years with current employer) on the salary.

\[y_i = \beta_0 + \beta_1 * {x_1}_i + \beta_2 * {x_2}_i + \beta_3 * {x_3}_i + u_i\]

y=wage1(:,1)
X=[ones(n,1),wage1(:,[2,3,4])]

We can compute the Beta coefficients and the residuals :

beta=inv(X'*X)*X'*y
u=y-X*beta

The coefficients are the following :

beta =

-2.8727
0.5990
0.0223
0.1693

The first coefficient is constant. Then, the model indicates that an additional year of education should increase the hourly salary by 0.60$.

Residuals

Effect of each feature individually

We can also plot a histogram of the residuals. The distribution of the residual appears to be close to a normal one.

We can decide to remove outliers above 5 for example :

indices = find(u>2.5);
u(indices) = [];
hist(u)

Test hypothesis

Are the coefficients significant?

To assess the significance of each estimator, we need to estimate their standard deviation.

sig2=u'*u/(n-4)
std=sqrt(diag(sig2*inv(X'*X)))

Then, we can easily compute the test statistic :

t = beta./std

What should we compare the test statistics? Well, to the critical value! For example, since we estimate 4 parameters, the 95% critical value is :

C1 = tinv(0.95, n-4)

C1 =

1.6478

The T-stats are all greater than the critical value. For this reason, we reject the hypothesis $H_0$ that the variables have no impact on the output.

Effect of a single feature

If we are interested in the effect of a single feature, we can specify a value for its effect for example.

t_educ = (beta(2)-0.1)/std(2)

Effect of each feature on `log(wage)`

Testing the effect of the features on the log(wage) is a way to assess their importance in terms of the percentage change.

y=log(wage1(:,1))
beta=inv(X'*X)*X'*y
std=sqrt(diag(sig2*inv(X'*X)))
t = beta./std

Test is 2 features have the same effect

Now, let’s test the following hypothesis: the effect of education is the same as the effect of professional experience. This can be rewritten as : $H_0 : \beta_2 = \beta_3$ . However, estimating the variance of $\beta_2 - \beta_3$ to build the test hypothesis would be challenging. We can apply a simple trick by building a new feature as the sum of the 2 features to test.

First of all, append the new feature to the model :

capitaltot = X(:,2) + X(:,3)

This introduces some multicolinearity which we should control by removing $X_3$ :

X=[X(:,[1,2,4]),capitaltot];
y=log(wage1(:,1))

We can run the test : $H_0 : \beta_4 = 0$

beta=inv(X'*X)*X'*y
std=sqrt(diag(sig2*inv(X'*X)))
t = beta(4)/std(4)

This heads :

t =

0.4883

The T-test is smaller than the critical value. For this reason, we cannot reject the hypothesis that the effect in the percentage of education is the same as the one of professional experience.

Fisher Test for joint hypothesis

Let us now consider the hypothesis : $H_0 : \beta_2 = 0, \beta_3 = 0$ . In such case, we should not apply 2 independent tests, since we are interested in the overall hypothesis.

We need to define a non-constrained model, a constrained model and test the difference between those two models.

% non-constrained model
X0=[ones(n,1),wage1(:,[2,3,4])]
y0=log(wage1(:,1))
beta0=inv(X0'*X0)*X0'*y0
u0=y0-X0*beta0
SSR0=u0'*u0

% model contraint
X1=[ones(n,1),wage1(:,[4])]
y1=log(wage1(:,1))
beta1=inv(X1'*X1)*X1'*y1
u1=y1-X1*beta1
SSR1=u1'*u1

The Fisher F-Stat is defined as :

\[\hat{F}_j = \frac { {SSR}_1 - {SSR}_0 } { {SSR}_0} \frac {n - K} {2}\]

Where K is the number of variables included in the non-constrained model (4 here).

K = 4
F = (SSR1-SSR0)/(SSR0) * (n-K)/2

This heads :

F =

80.1478

Interaction effect

Let’s define new variables :

X=wage1
educ=X(:,2)
exper=X(:,3)
tenure=X(:,4)
female=X(:,6)
married=X(:,7)
marrmale = (1-female).*married
marrfem = female.*married
singfem=female.*(1-married)

We create interaction variables :

effect of a married male
effect of a married female
effect of a single female

The base case is a single male. We can define test statistics on those variables as we did above.

**Conclusion **: This was a quick introduction to Matlab. I hope you found it useful. Don’t hesitate to drop a comment if you have a question.

Statistics in Matlab

Maël Fabien

I. Data exploration

II. Linear regression

Compute coefficients

Residuals

Effect of each feature individually

Test hypothesis

Effect of a single feature

Effect of each feature on `log(wage)`

Test is 2 features have the same effect

Fisher Test for joint hypothesis

Interaction effect

Statistics in Matlab

Maël Fabien

I. Data exploration

II. Linear regression

Compute coefficients

Residuals

Effect of each feature individually

Test hypothesis

Effect of a single feature

Effect of each feature on log(wage)

Test is 2 features have the same effect

Fisher Test for joint hypothesis

Interaction effect

Effect of each feature on `log(wage)`