Scikit-Learn Linear Regression Tutorial

Linear regression is a powerful tool in machine learning for predicting a continuous outcome based on one or more input variables. This tutorial will guide you through the basics of linear regression using the Scikit-Learn library in Python.

What is Linear Regression?

Linear regression is a method to predict a continuous target variable based on one or more input variables. The goal is to find a linear relationship between the input variables and the target variable.

Key Concepts

Dependent Variable: The variable you are trying to predict.
Independent Variables: The variables used to predict the dependent variable.
Coefficient: The weight given to each independent variable.
Intercept: The value of the dependent variable when all independent variables are zero.

Getting Started

To get started with linear regression, you'll need to have Python and Scikit-Learn installed. You can install Scikit-Learn using pip:

pip install scikit-learn

Data Preparation

Before you can start building a linear regression model, you need to prepare your data. This involves loading the data, splitting it into training and testing sets, and encoding any categorical variables.

Example Data

Let's say you have a dataset containing information about houses. The dataset includes the size of the house (in square feet) and the price of the house.

import pandas as pd

# Load the dataset
data = pd.read_csv('/path/to/your/data.csv')

# Split the data into features and target variable
X = data[['size']]
y = data['price']

Building the Model

Once you have your data prepared, you can build a linear regression model using Scikit-Learn's LinearRegression class.

from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Fit the model to the data
model.fit(X, y)

Making Predictions

After the model is trained, you can use it to make predictions on new data.

# Make a prediction
new_data = [[1500]]  # 1500 square feet
prediction = model.predict(new_data)

print("Predicted house price:", prediction[0])

Evaluating the Model

It's important to evaluate the performance of your linear regression model. Scikit-Learn provides several metrics for evaluating regression models, such as Mean Squared Error (MSE) and R-squared.

from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)