Simple Linear Regression Example with Python

Linear Regression using Salary and Years of Experience Data

Data Source: Salary_dataset.csv Kaggle
The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)

For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).

Simple Linear Regression
Formula
Formula: Y = b + aX + e

  1. Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
  2. Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
  3. a: is the slope change in Y for a one-unit change in X
  4. b: is the intercept of the Y axis. This represents the value of Y when X is zero.

    The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.

    The steps are as follows:
    1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
    2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
    3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
    4. The null hypothesis is then evaluated based on the p-value
    5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

def load_data(csv_location):
    """Load CSV data into a pandas DataFrame."""
    df = pd.read_csv(csv_location)
    return df

def prepare_data(df):
    """Prepare independent and dependent variables for regression."""
    X = df[['YearsExperience']].values  # Extract as 2D array
    y = df['Salary'].values  # Extract as 1D array
    return X, y

def fit_sklearn_model(X, y):
    """Fit a linear regression model using sklearn."""
    model = LinearRegression()
    model.fit(X, y)
    return model

def fit_statsmodels_ols(X, y):
    """Fit a linear regression model using statsmodels OLS."""
    X_with_const = sm.add_constant(X)  # Add an intercept to the model
    model = sm.OLS(y, X_with_const).fit()
    return model

def plot_regression_line(df, intercept, slope):
    """Plot the regression line along with data points."""
    plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
    plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
    plt.title("Salary by Years of Experience")
    plt.xlabel('Years of Experience')
    plt.ylabel('Salary')
    plt.legend()
    plt.show()

def main():
    csv_location = "salary_dataset.csv"
    df = load_data(csv_location)

    # Display basic statistics
    #print(df.describe())

    X, y = prepare_data(df)

    # Fit the model using sklearn
    sklearn_model = fit_sklearn_model(X, y)
    intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
    
    print("Calculation of Regression Line:\n")
    print(f"Intercept is: {intercept}")
    print(f"Slope is: {slope}")

    # Fit the model using statsmodels to get p-values and R-squared
    statsmodels_model = fit_statsmodels_ols(X, y)
  #  print(statsmodels_model.summary())

    # Extract R-squared and p-values
    r_squared = statsmodels_model.rsquared
    p_values = statsmodels_model.pvalues

    print(f"R-squared: {r_squared}")
    #print(f"P-values: {p_values}")

    # Extracting specific p-values by index
    intercept_p_value = p_values[0]  # First p-value (intercept)
    slope_p_value = p_values[1]  # Second p-value (YearsExperience)

    #print(f"Intercept p-value: {intercept_p_value}")
    print(f"p-value (YearsExperience): {slope_p_value}")

    print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") 
    print("This is obtained from the t-distribution with nāˆ’2 degrees of freedom ")
    print("where n is the number of observations\n")

    if slope_p_value > 0.05:
        print("P-value is not signficant and therefore we accept the null hypothesis")
    if slope_p_value < 0.05:
        print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor š‘‹ has a statistically significant effect on the outcome š‘Œ")
    # Plotting the regression line
    plot_regression_line(df, intercept, slope)

    # Fit a linear regression line using scipy.stats (for comparison)
    slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
   # plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')

if __name__ == "__main__":
    main()



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *