Linear regression is a powerful tool in machine learning for predicting a continuous outcome based on one or more input variables. This tutorial will guide you through the basics of linear regression using the Scikit-Learn library in Python.
What is Linear Regression?
Linear regression is a method to predict a continuous target variable based on one or more input variables. The goal is to find a linear relationship between the input variables and the target variable.
Key Concepts
- Dependent Variable: The variable you are trying to predict.
- Independent Variables: The variables used to predict the dependent variable.
- Coefficient: The weight given to each independent variable.
- Intercept: The value of the dependent variable when all independent variables are zero.
Getting Started
To get started with linear regression, you'll need to have Python and Scikit-Learn installed. You can install Scikit-Learn using pip:
pip install scikit-learn
Data Preparation
Before you can start building a linear regression model, you need to prepare your data. This involves loading the data, splitting it into training and testing sets, and encoding any categorical variables.
Example Data
Let's say you have a dataset containing information about houses. The dataset includes the size of the house (in square feet) and the price of the house.
import pandas as pd
# Load the dataset
data = pd.read_csv('/path/to/your/data.csv')
# Split the data into features and target variable
X = data[['size']]
y = data['price']
Building the Model
Once you have your data prepared, you can build a linear regression model using Scikit-Learn's LinearRegression
class.
from sklearn.linear_model import LinearRegression
# Initialize the model
model = LinearRegression()
# Fit the model to the data
model.fit(X, y)
Making Predictions
After the model is trained, you can use it to make predictions on new data.
# Make a prediction
new_data = [[1500]] # 1500 square feet
prediction = model.predict(new_data)
print("Predicted house price:", prediction[0])
Evaluating the Model
It's important to evaluate the performance of your linear regression model. Scikit-Learn provides several metrics for evaluating regression models, such as Mean Squared Error (MSE) and R-squared.
from sklearn.metrics import mean_squared_error, r2_score
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate the MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
Further Reading
For more information on linear regression and Scikit-Learn, check out the following resources: