Prerequisites- This article assumes a basic understanding of the python programming language.
Hello, my fellow readers, I am here with another new article to brief you about the bias/variance trade-off concept in data science and machine learning with an experiment. I will keep releasing articles about data science and machine learning to simplify concepts for my audiences. Follow my page if you haven’t done it yet… that being said, let’s dive into the article.
What is this article about?
This article extends from my previous article which was mostly focused on the theoretical part of the bias/variance trade-off topic. This article will be focused on experimenting with those theories. If you haven’t read that, I suggest you check out that article using this link before you dive into this article.
As we discussed in the previous article, bias issues occur when we undermine our data and variance issues occur when we overcomplicate our data. In other words, the bias issue appears when we limit the degree of freedom of our models and vice-versa for variance issues. Bias issues will prone our model to underfitting and variance issues will prone our model to overfitting . That being said, let’s see how exactly these problems occur.
To make the article precisely not boring, I will only quote some short snippets from my notebook which is on github. If you prefer to read the notebook, here you go with the github link.
Well, firstly we need to initialize some random x values and compute the y values based on the x values. We will use a quadratic formula for computing the y values. To make it look like real-world data, we need to add some noise to the data which will be gaussian noise in our case.
x = np.random.uniform(-10, 10, 100)
y = 2 * x**2 + 3 * x + 5 + np.random.normal(0, 20, 100)
This is what the graph of this equation looks like. It is obvious from the below plot image that the data is quadratic.
Next, we have to split the data into a train and test set so that we can use the train set for training and the test set for testing.
from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
Well, we already initialized the x(input) values and the y(output) values and split the data into training and testing sets. Let’s get to training the models now. To show you the bias issues, let’s first train a linear regression model on this data. Since the data is quadratic, the model will underfit the data while trying to fit this quadratic data into a linear model.
from sklearn.linear_model import LinearRegressionln_model = LinearRegression()
ln_model.fit(x_train.reshape(-1, 1), y_train)
Well, the model is trained now, let’s check the accuracy of the training set.
Oops! a 13% accuracy is very low. It is underfitting the dataset on the training set. This is ridiculous let’s look at the accuracy of the test set.
Again! The test set accuracy is much worse than the train set accuracy. The performance is very low in both cases. This is what we call the bias issue. We are assuming the data set as linear data despite it being quadratic data. So, it is underfitting the training set since quadratic data can’t fit into a linear model. It is underfitting because the performance is very low in both the learned(training) set and the new(test) set.
If using a simple model was causing the bias issue, let’s use a complex model and measure the performance. A complex model will fix the bias issue but it will pop us to another issue. Let’s see how it goes together…
Well, let’s train the data on an 8th-degree polynomial model. Let’s initialize the model and fit the data into the model.
from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(degree=8)
x_train_poly = poly.fit_transform(x_train.reshape(-1, 1))
poly_model = LinearRegression()
In the above code, what I am doing is, using the PolynomialFeatues class from the scikit-learn library. This class will help us transform the data into the degree format we want so that we can use the LinearRegression class of the scikit-learn to train on a polynomial model. More about PolynomialFeature can be found using this link.
Well, the model is trained now. Let’s check the accuracy of the training set.
Woah! We have very good accuracy on the training set. A 92% accuracy is nevertheless very good. Well, don’t get excited too much since it might be overfitting the data. To check whether it is overfitting or not, let’s have a look at the accuracy of new(test) data.
Oops! We have a very bad accuracy on the test set. An accuracy drop-down from 92% to 80% is ridiculously very bad. This shows that the model is overfitting the data and has the issue of variance since we gave the model a high degree of freedom.
Overfitting occurs when the model tries to learn very well from the training data by taking the advantage of the degree of freedom given. After the training is over, the model will be confused about new data because it tried to master/memorize that data rather than figuring out patterns.
What is the solution then?
Well, we have a variety of solutions. More of them are listed in the previous article. Generally, we have solutions like scaling the data to avoid a lot of issues, and overall, visualizing the data is the best bet to figure out very good models for our datasets.
That’s all about experimenting with the bias/variance trade-off and I hope you gain something from my article. If you have any feedback, let me know in the comment section. That being said, if you enjoyed the article, I guess I deserve a follow and a clap. Please share this article with your friends and family. Stay tuned!
Related articles from the author