Linear Regression Algorithm in Machine Learning

Shihara Dilshan
4 min readJul 19, 2021

Hello everyone,

I hope you all are doing great. Today I am going to talk about something new for me may be you guys as well. The one and only basic machine learning algorithm out there and it is Linear Regression. So simply what is Linear Regression?

Let’s make that question simple. Let’s talk about your GPA or your exam result. If I ask What determines your GPA or result, You would say it depends on how much time I spent on studying or how much past papers I have studied etc… For now let’s assume your GPA or exam result only depend on how much hours you spent on studying. So technically if you have spent so much time on studying definitely your result will be good right? But if you did not spent less time on studying you probably end up with lower result.

OK now lets go back to your 6th grade math class. What is the equation that you are going to use when one variable linearly dependent on another variable?

it is y = mx + c

This is the back born of Linear Regression algorithm in machine learning. So let’s say you have conduct a survey on random 1000 students and gather the data on their results and about their study time. If you put those data in to a graph you probably end-up with a graph that can be draw a linear line towards that data.

But what happens if you did not get a graph that can draw a line? Something like this.

And then my friend you can’t use linear regression to create machine learning model to predict anything out from that data.

Another exception for this scenario is what if the student’s result depend on multiple attributes? like his attendance, his sleep time, his IQ and etc… That is not a very complex thing. In scenarios like that we can draw a graph with multiple dimensions and we will be able to draw the linear line(not a line exactly). Don’t believe me? Take a look at the following image.

Like I said before this is back-born of Linear Regression Algorithm. It generate it own equation based on previous data set that we have given. And train it’s model to generate some results with higher accuracy.

Let’s implement this in action using Python.

Create a python script call app.py open it using any IDE you like

touch app.py

First at all import all the necessary libraries

import pandas as pd 
import numpy as np
import sklearn
from sklearn import linear_model

Next we need data. So let’s assume you have a CSV file with student data, something slimier to below format inside your current working directory.

Let’s read this CSV file using pandas library.

data = pd.read_csv(“./data.csv”)

This will return you a pandas data frame. This data frame may consist of attributes that we do not need. So let’s remove those unnecessary attributes.

data = data[["studytime", "GPA"]]

What we want is create a machine learning model that can predict student GPA based on his study time and some other facts(But here I only use one fact for simplicity)

Now let’s drop the “GPA” attribute from the data set and create two numpy arrays so we can split our dateset into two different parts for test and train our model.

x = np.array(data.drop(["GPA"], axis=1))
y = np.array(data["GPA"])

Then we can split our data set

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)

train_test_split function can be used to divide our data set into two different datasets for train and test. test_size argument can be use to determine the how many data should go to our test data set. For a example if we have a 1000 data we must split 900 for train and 100 for test. So test_size should be equal to 0.1

Maximum recommended test_size is 0.2 if you go beyond that value you have to be careful otherwise your machine learning model accuracy will fall.

Now let’s create and test our model

linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)

Optionally if you want to see your accuracy of your model you can use

accuracy = linear.score(x_test, y_test)
print(accuracy)

Finally we can see how our model predict the results.

predictions = linear.predict(x_test)for x in range(len(predictions)):
print(predictions[x], x_test[x], y_test[x])

Full code

import pandas as pd 
import numpy as np
import sklearn
from sklearn import linear_model
data = pd.read_csv(“./data.csv”)
data = data[["studytime", "GPA"]]
x = np.array(data.drop(["GPA"], axis=1))
y = np.array(data["GPA"])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
accuracy = linear.score(x_test, y_test)
print(accuracy)
predictions = linear.predict(x_test)for x in range(len(predictions)):
print(predictions[x], x_test[x], y_test[x])

--

--

Shihara Dilshan

Associate Technical Lead At Surge Global | React | React Native | Flutter | NodeJS | AWS | Type Script | Java Script | Dart | Go