Linear Regression using Salary and Years of Experience Data
Data Source: Salary_dataset.csv Kaggle
The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)
For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).
Simple Linear Regression Formula
Formula: Y = b + aX + e
- Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
- Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
- a: is the slope change in Y for a one-unit change in X
- b: is the intercept of the Y axis. This represents the value of Y when X is zero.
The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.
The steps are as follows:
1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
4. The null hypothesis is then evaluated based on the p-value
5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
def load_data(csv_location):
"""Load CSV data into a pandas DataFrame."""
df = pd.read_csv(csv_location)
return df
def prepare_data(df):
"""Prepare independent and dependent variables for regression."""
X = df[['YearsExperience']].values # Extract as 2D array
y = df['Salary'].values # Extract as 1D array
return X, y
def fit_sklearn_model(X, y):
"""Fit a linear regression model using sklearn."""
model = LinearRegression()
model.fit(X, y)
return model
def fit_statsmodels_ols(X, y):
"""Fit a linear regression model using statsmodels OLS."""
X_with_const = sm.add_constant(X) # Add an intercept to the model
model = sm.OLS(y, X_with_const).fit()
return model
def plot_regression_line(df, intercept, slope):
"""Plot the regression line along with data points."""
plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
plt.title("Salary by Years of Experience")
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()
def main():
csv_location = "salary_dataset.csv"
df = load_data(csv_location)
# Display basic statistics
#print(df.describe())
X, y = prepare_data(df)
# Fit the model using sklearn
sklearn_model = fit_sklearn_model(X, y)
intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
print("Calculation of Regression Line:\n")
print(f"Intercept is: {intercept}")
print(f"Slope is: {slope}")
# Fit the model using statsmodels to get p-values and R-squared
statsmodels_model = fit_statsmodels_ols(X, y)
# print(statsmodels_model.summary())
# Extract R-squared and p-values
r_squared = statsmodels_model.rsquared
p_values = statsmodels_model.pvalues
print(f"R-squared: {r_squared}")
#print(f"P-values: {p_values}")
# Extracting specific p-values by index
intercept_p_value = p_values[0] # First p-value (intercept)
slope_p_value = p_values[1] # Second p-value (YearsExperience)
#print(f"Intercept p-value: {intercept_p_value}")
print(f"p-value (YearsExperience): {slope_p_value}")
print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.")
print("This is obtained from the t-distribution with nā2 degrees of freedom ")
print("where n is the number of observations\n")
if slope_p_value > 0.05:
print("P-value is not signficant and therefore we accept the null hypothesis")
if slope_p_value < 0.05:
print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor š has a statistically significant effect on the outcome š")
# Plotting the regression line
plot_regression_line(df, intercept, slope)
# Fit a linear regression line using scipy.stats (for comparison)
slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
# plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
if __name__ == "__main__":
main()
Leave a Reply