Your Worst Nightmare About cars for sale in Keynsham Come to Life

If you get on the prowl for a new (or more recent) trip as well as require more persuading regarding the benefits of getting a pre-owned car, right here’s a glimpse at ten evident and also ignored…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Partial F backward method for selecting variables in a regression model using R

In an era that extensive random forests and deep neural networks are the hot topics in the machine learning/statistics field, some may forget simple methods that can easily solve real problems too.

One of this methods are regression models. I will explore them using an example in the sports field, specifically in volleyball. So, I will try to predict the height of a volleyball player jump during an attack using some independent variables that were considered to have potential to predict the dependent variable.

All the kinematic variables that will be used are:

x1 — Final horizontal speed (m/s) (VHF);
x2 — Average horizontal speed (m/s) (VHM);
x3 — Initial vertical speed (m/s) (VVI);
x4 — Average vertical speed (m/s) (VVM);
x5 — Average horizontal acceleration (m/s²) (AHM);
x6 — Initial vertical acceleration (m/s²) (AVI);
x7 — Average vertical acceleration (m/s²) (AVM);
x8 — Angular mean velocity of the hip joint (rad/s) (VAQ);
x9 — Angular mean velocity of the knee joint (rad/s) (VAJ);
x10 — Angular mean velocity of the ankle joint (rad/s) (VAT);
x11 — Angular mean velocity of the shoulder joint (rad/s) (VAO);
x12 — Maximum hip flexion angle (º) (AFMQ);
x13 — Maximum knee flexion angle (º) (AFMJ);
x14 — Angle of maximum ankle dorsiflexion (º) (ADFMT);
x15 — Angle of the body’s center of gravity at the beginning of the flight phase (º) (ACG);
x16 — Shoulder amplitude angle (º) (AO);

And the dependent variable to be predicted is:

y — Jump height (m) (HSM);

Metric’s Subtitle:

m/s — meters per second;

m/s² — meters per square second;

rad/s — radians per second;

º — degrees;

m — meters.

The first step is to read the data, select the independent variables and require not native R packages that contain functions we will use.

With the data ready, I will start the adjust of a full (all available variables will be used to predict y) model. The function ‘lm’ (linear model) can do this.

Now, I will realize an analysis of variance (ANOVA) to get the parcial F statistic for each variable, which will be used to select the features of our final model. The logic here is to remove the least significant variable at each step.

We get the following result:

The least significant variable according to the F statistic is the one with greater p-value (represented in the last column of the image above). This way, I will adjust the model again without the variable ‘VAQ’.

Again, we do the ANOVA:

Image 2: Partial F statistics.

Now, the least significante variable is ‘VVM’. The same iterative process will be repeated till we get only significant variables in our model. The result we get is shown below.

Image 3: F statistics for final variables.

As we can see in Image 3, the parcial F-test shows that the 3 variables left in the model are statistically significant (different from 0) at a confidence level of 5% (p-value < 0,05).

Now, I will use the command ‘summary(modelo)’ to observe the estimated coefficients, the need to remove the intercept and the final of the model.

In the fisrt column presented in Image 4, we can see the estimated coefficients. At a level of 5% of significante, the T-test confirms the F-parcial results that the chosen variables are different from 0, but the intercept is not (p-value = 0,81311 > 0,05). So, I will adjust the final model without it to check if there will be an increase in the determination coefficient (0,8919).

Image 5: T-test for the final model without the intercept.

Taking off the intercept, the jumps from 0,8919 to 0,9968.

The model’s variables are defined. It is time to test the quality of the model by testing 3 assumptions about its residuals: normality, homogeneity of variances and independency.

To check if the residuals are normal, we will use the Shapiro-Wilk test. The null hypotheses being tested is that the residuals follow a normal distribution.

Image 6: Shapiro-Wilk test for normality.

The p-value (0,475) of the test is higher than a level of 5% (0,05) of significance, so we don’t have evidences to reject the null hypothese, which means the residuals proved to be normal.

To test the homoscedasticity of the residuals, we will sort (crescent order) them and split the sorted values into 2 groups (one with the lowest 50% values and other with the 50% highest). Then, we can make a F-test to check if the variances between the two extreme groups are statistically equal. In other words, we can explain this trick by the argument “if the variances between the 2 most different divisions formed inside a group are statistically equal, then it is obvious that the variance inside this group is homogeneous”.

A p-value of 0,4496 (> 0,05) leads us to not reject the null hypotheses that the true ratio of the 2 variances is equal to 1. In other words, we can assume that they’re statistically equal.

Finally, we need to check if the residuals are independent. We’ll do this using the Durbin-Watson test. The null hypotheses is that they are independent.

Image 8: Durbin-Watson test for independency.

Again we get a non-significant p-value (greater than a 5% significance level), so the null hypotheses is not rejected.

With 99,68% of the variance of the jump height explained by the 3 final variables (‘VHF’, ‘VVI’ and ‘AVI’) and all the 3 assumptions accepted, we conclude our study with a valid predictive model.

Add a comment

Related posts:

South Carolina Governor

Henry McMaster is the current South Carolina Governor. He is the one-hundred and seven-teenth governor of South Carolina. McMaster is affiliated with the Republican Party. He is originally from the…