In statistics, the “line of best fit” refers to a line through a scatter plot of data points that best expresses the relationship between those points. Analysts use this line to predict future trends. Here’s how to plot a line of best fit in Python.
Understanding the Concept: The line of best fit is usually calculated using a method called linear regression. This method finds the line that minimizes the sum of the squares of the vertical distances from each data point to the line.Here is a diagram illustrating a scatter plot with a line of best fit:
The line of best fit is usually calculated using a method called linear regression. This method finds the line that minimizes the sum of the squares of the vertical distances from each data point to the line.
Here is a diagram illustrating a scatter plot with a line of best fit:

Libraries Used
- NumPy: For numerical operations, especially handling arrays.
- Matplotlib: For plotting and visualizing the data.
- Scikit-learn: For the linear regression model.
Example 1: Basic Line of Best Fit
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Sample Data
x = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))
y = np.array([2, 3.6, 5, 7, 9])
# Create and Fit Model
model = LinearRegression()
model.fit(x, y)
# Predict
y_pred = model.predict(x)
# Plotting
plt.scatter(x, y, label='Data Points')
plt.plot(x, y_pred, color='red', label='Line of Best Fit')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Line of Best Fit Example 1')
plt.legend()
plt.show()Resulting Plot:

In this example:
- Sample data
xandyare defined. - A linear regression model is created and fitted to the data.
- The
predictmethod generates the y-values for the line of best fit. - Matplotlib is used to plot the scatter plot and the line of best fit.
Example 2: Using Real-world Data
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import pandas as pd
# Load data (replace 'your_data.csv' with your file path)
data = pd.read_csv('your_data.csv')
x = data['X'].values.reshape((-1, 1)) # Ensure X and Y are the correct column names
y = data['Y'].values
# Model
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x)
# Plotting
plt.scatter(x, y, label='Data Points')
plt.plot(x, y_pred, color='red', label='Line of Best Fit')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Line of Best Fit Example 2')
plt.legend()
plt.show()Resulting Plot:

This example extends the first one by:
- Loading data from a CSV file using pandas.
- Reshaping the input
xto fit the model’s requirements.
Example 3: Advanced Plot Customization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Sample Data
x = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))
y = np.array([2, 3.6, 5, 7, 9])
# Model
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x)
# Plotting with customizations
plt.figure(figsize=(10, 6)) # Adjust figure size
plt.scatter(x, y, color='blue', marker='o', label='Data Points', s=50) # Customize scatter plot
plt.plot(x, y_pred, color='green', linestyle='--', linewidth=2, label='Line of Best Fit') # Customize line
plt.xlabel('Independent Variable', fontsize=12)
plt.ylabel('Dependent Variable', fontsize=12)
plt.title('Line of Best Fit Example 3', fontsize=14)
plt.grid(True) # Show grid
plt.legend()
plt.xticks(fontsize=10) # Customize ticks
plt.yticks(fontsize=10)
plt.show()Resulting Plot:

Here, the plot is customized further:
- Figure size, marker style, line style, and colors are modified.
- Labels and titles have adjusted font sizes.
- A grid is added for better readability.
Conclusion
Plotting a line of best fit in Python is straightforward with libraries like NumPy, Matplotlib, and scikit-learn. These examples provide a solid foundation for understanding and implementing linear regression in your data analysis projects.