Simple Linear Regression - An Underestimated Statistics
Simple Regression
In every research scenario, the trait we measure will be influenced by many other traits to a certain range. One can determine the range of influence by performing a simple regression analysis.
In simple regression analysis, two types of variables named X and Y (‘X’ denotes independent variable and ‘Y’ denotes dependent variable) were considered. It gives a cause and effect relationship between the dependent and independent variable. As Plant breeders, one may keen on determining the effect of individual traits on yield. In this case, one can opt for regression analysis.
Step by Step guide for analysis (Excel)
- Data file format in excel
2. Click on the "Data
" tab and select "Data analysis
" option
*Note:
The data analysis option will not be available on default, we need to activate this particular option from the excel add-ins. (Click File Tab - Click Options - Click Add-ins category - Look in the manage box - Select Excel Add-ins - Click Go - Check the Analysis ToolPak check box - Click OK).
3. Now in the data analysis
dialog box, select the option "Regression
" and click "OK
"
4. Select the input data range for X and Y. In this case, let's take biomass as X and yield as Y. Also, under the "Output options
" we can select the place where we like to have our results displayed (1. We can select the output range
option and click an empty cell in the same datasheet which will display the result in the same sheet. 2. We can also select either a new worksheet
or a new workbook
option which will display the result in a fresh sheet or in a fresh excel file respectively). Once ready with these, click "OK
" for analysis.
5. Display of regression analysis results as given below;
How to Interpret???
*Note:
Under results, one must look into the ANOVA table for significance. If the ‘F’ value was found to be significant then only one can proceed further with the study.
In Regression Statistics:
- Multiple R is the Correlation coefficient which shows, how linear is the relationship between the dependent and independent variables.
- R square is the coefficient of determination (r x r) which explains the percentage of variation present in the dependent variable (Y) due to the independent variable (x).
- Adjusted R square value is considered if more than one X variable is present.
- The standard error gives how far the actual points deviate from the regression line.
R Square:
- R square value explains the degree of variance in the dependent variable that is caused by independent variable. This value is multiplied by 100 to get variation in percentage.
- The ‘R’ square value of above 0.3 indicates strong influence of the independent variable on the dependent variable.
Interpretation:
In the above study, simple linear regression analysis was carried out between biomass and seed yield. The ‘
F
’ probability value in the
ANOVA
table is
significant
at 1% which indicated the presence of a linear relationship between both biomass and yield.
Further, the
R square
value of 0.94 (94%) was recorded.
It indicated the
percentage of variation
caused by biomass on seed yield. Thus, it can be said that these two variables are sharing a
strong interaction
with each other.
*Note: This is just an introduction to simple regression analysis. In future blogs, we will discuss simple linear regression analysis using R and how it could be used in screening studies for identification of tolerance contributing traits.
Author Details: P Mathankumar, PG Scholar (Genetics and Plant Breeding), Tamil Nadu Agricultural University.
*Note: The data analysis option will not be available on default, we need to activate this particular option from the excel add-ins. (Click File Tab - Click Options - Click Add-ins category - Look in the manage box - Select Excel Add-ins - Click Go - Check the Analysis ToolPak check box - Click OK).
4. Select the input data range for X and Y. In this case, let's take biomass as X and yield as Y. Also, under the "Output options " we can select the place where we like to have our results displayed (1. We can select the output range option and click an empty cell in the same datasheet which will display the result in the same sheet. 2. We can also select either a new worksheet or a new workbook option which will display the result in a fresh sheet or in a fresh excel file respectively). Once ready with these, click "OK " for analysis.
How to Interpret???
*Note:
Under results, one must look into the ANOVA table for significance. If the ‘F’ value was found to be significant then only one can proceed further with the study.
In Regression Statistics:
- Multiple R is the Correlation coefficient which shows, how linear is the relationship between the dependent and independent variables.
- R square is the coefficient of determination (r x r) which explains the percentage of variation present in the dependent variable (Y) due to the independent variable (x).
- Adjusted R square value is considered if more than one X variable is present.
- The standard error gives how far the actual points deviate from the regression line.
R Square:
- R square value explains the degree of variance in the dependent variable that is caused by independent variable. This value is multiplied by 100 to get variation in percentage.
- The ‘R’ square value of above 0.3 indicates strong influence of the independent variable on the dependent variable.
Interpretation:
In the above study, simple linear regression analysis was carried out between biomass and seed yield. The ‘
F
’ probability value in the
ANOVA
table is
significant
at 1% which indicated the presence of a linear relationship between both biomass and yield.
Further, the
R square
value of 0.94 (94%) was recorded.
It indicated the
percentage of variation
caused by biomass on seed yield. Thus, it can be said that these two variables are sharing a
strong interaction
with each other.
*Note: This is just an introduction to simple regression analysis. In future blogs, we will discuss simple linear regression analysis using R and how it could be used in screening studies for identification of tolerance contributing traits.