R-squared vs Adjusted R-squared: Difference? - GeeksforGeeks (2024)

Last Updated : 24 Jun, 2024

Improve

R-squared and adjusted R-squared are performance metrics used in linear regression. They both tell us how well a model fits the data on the best-fit line plotted in between the output features (y) which are also known as dependent variables and the input features (x) which are also known as independent features. In this article, we will see the differences between r-squared

What is R-squared ?

R-squared or the coefficient of determination, is the statistical measure of the variance of the regression (best fit) line from the actual data points. In a generalized linear regression model the model accuracy ranging from 75 – 95% is considered as a good prediction model, whereas 100% accuracy means overfitting.

[Tex]R^2 = 1 – \frac{\sum_{i=1}^n (y_i – \hat{y}_i)^2}{\sum_{i=1}^n (y_i – \bar{y})^2}[/Tex]


Pseudo Code:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = df[['feature1', 'feature2', ...]] # select the feature columns
y = df['target'] # select the target column

#Create a linear regression model
model = LinearRegression()

# Fit the model to the data
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Calculate the R-squared score
r2 = r2_score(y, y_pred)

print("R-squared score:", r2)

What is Adjusted R-squared?

Adjusted R-squared is a performance metrics which can be termed as a more refined version of R-squared which priorities the input features that correlates with the target variable. It takes into account the number of predictors in the model and whether they are significant.

[Tex]\overline{R}^2 = 1 – \frac{(1 – R^2)(n – 1)}{n – k – 1}[/Tex]

Pseudo Code:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = df[['feature1', 'feature2',...]] # select the feature columns
y = df['target'] # select the target column

# Create a linear regression model
model = LinearRegression()

# Fit the model to the data
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Calculate the R-squared score
r2 = r2_score(y, y_pred)

# Calculate the adjusted R-squared score
n = len(y) # number of samples
k = X.shape[1] # number of features
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)

print("Adjusted R-squared score:", adjusted_r2)

Difference between R-squared and Adjusted R-squared?

  1. The value of R-squared increases when we increase an independent factor , whereas the value of Adjusted R-square increases only when the independent factor is necessary for the dependent factor.
  2. The value of R-square can not be negative, whereas the value of Adjusted R-squared value can be negative.
  3. Adjusted R-squared value is more reliable and accurate in determining the efficiency of the model than the R-squared.

Interview Insights

How to answer the difference between R-squared vs Adjusted R-squared in an interview?

“R-squared and Adjusted R-squared are both metrics used to evaluate the fit of a regression model. R-squared indicates the proportion of the variance in the dependent variable that is predictable from the independent variables, ranging from 0 to 1. However, it always increases or remains the same when more predictors are added, regardless of their significance. Adjusted R-squared, on the other hand, adjusts for the number of predictors and can decrease if unnecessary predictors are added. This adjustment makes it a more reliable metric for comparing models with different numbers of predictors, as it penalizes for overfitting. In practice, while R-squared gives a quick estimate of model fit, Adjusted R-squared is preferred for model selection as it provides a more accurate measure by accounting for the number of predictors in the model.”

Follow Up questions on R-squared vs Adjusted R-squared

Can you explain what overfitting is and how Adjusted R-squared helps prevent it?

Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. This usually happens when the model is too complex, having too many parameters relative to the number of observations. Overfitted models may perform well on training data but poorly on unseen test data. Adjusted R-squared helps prevent overfitting by including a penalty for the number of predictors in the model. Unlike R-squared, which always increases with additional predictors, Adjusted R-squared can decrease if the new predictors do not improve the model significantly. This discourages adding irrelevant variables and helps in selecting a more parsimonious model.

How would you decide which predictors to include in a regression model?

Answer: Deciding which predictors to include in a regression model involves several steps:

  1. Domain Knowledge: Start with variables that are known or hypothesized to influence the dependent variable based on theoretical understanding or previous studies.
  2. Correlation Analysis: Examine correlations between predictors and the dependent variable to identify potential candidates.
  3. Statistical Tests: Use statistical tests like t-tests for individual predictor significance and F-tests for overall model significance.
  4. Model Selection Criteria: Employ criteria like Adjusted R-squared, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and Cross-Validation to compare models.
  5. Stepwise Selection Methods: Use methods like forward selection, backward elimination, or stepwise regression to add or remove predictors systematically.
  6. Regularization Techniques: For high-dimensional data, use techniques like LASSO or Ridge regression to handle multicollinearity and select relevant predictors.

Can you describe a situation where R-squared might be misleading?

R-squared can be misleading in several situations:

  • Overfitting: In models with many predictors, R-squared might be high due to overfitting, capturing noise rather than the true underlying relationship.
  • Non-linear Relationships: R-squared assumes a linear relationship between the predictors and the response variable. It might be low for non-linear relationships even if the model fits well in a non-linear context.
  • Comparison Between Different Models: Comparing R-squared values between models with different numbers of predictors or different types of models (linear vs. polynomial) can be misleading, as it doesn’t account for model complexity.

What are some other metrics you can use to evaluate the performance of a regression model?

Other than R-squared and Adjusted R-squared, several metrics can be used to evaluate the performance of a regression model:

  • Mean Absolute Error (MAE): Average of the absolute differences between predicted and actual values.
  • Mean Squared Error (MSE): Average of the squared differences between predicted and actual values.
  • Root Mean Squared Error (RMSE): Square root of the MSE, providing a measure of the average magnitude of errors.
  • Mean Absolute Percentage Error (MAPE): Average of the absolute percentage errors between predicted and actual values.
  • AIC/BIC: Criteria for model selection that penalize model complexity.
  • Cross-Validation Scores: Assess model performance using different subsets of the data to ensure it generalizes well to unseen data.

How would you handle multicollinearity in a regression model?

Multicollinearity occurs when predictors are highly correlated with each other, leading to unstable coefficient estimates. To handle multicollinearity:

  • Remove Highly Correlated Predictors: Identify and remove predictors with high correlation coefficients.
  • Principal Component Analysis (PCA): Transform predictors into a set of uncorrelated components.
  • Regularization Techniques: Use LASSO (L1 regularization) or Ridge (L2 regularization) regression, which can reduce the impact of multicollinearity by penalizing large coefficients.
  • Variance Inflation Factor (VIF): Calculate VIF for each predictor and remove those with high VIF values.
  • Combine Predictors: If predictors are conceptually related, consider combining them into a single predictor.


      A

      aarush2003tandon

      Improve

      Previous Article

      Difference between CD-R and CD-RW

      Next Article

      What is the difference between a region-based CNN (R-CNN) and a fully convolutional network (FCN)?

      Please Login to comment...

      R-squared vs Adjusted R-squared: Difference? - GeeksforGeeks (2024)
      Top Articles
      Latest Posts
      Article information

      Author: Carmelo Roob

      Last Updated:

      Views: 6432

      Rating: 4.4 / 5 (65 voted)

      Reviews: 80% of readers found this page helpful

      Author information

      Name: Carmelo Roob

      Birthday: 1995-01-09

      Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

      Phone: +6773780339780

      Job: Sales Executive

      Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

      Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.