Linear regression is a common statistical data analysis technique.

It is used to determine the extent to which there is a linear relationship between a dependent variable and one or more independent variables, and is widely used in machine learning applications.

There are two types of linear regression, simple linear regression and multiple linear regression.

In simple linear regression a single independent variable is used to predict the value of a dependent variable. In multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.

In this post I’m going to introduce basic concepts on linear regression and build a simple linear regression example using go and spago.

Simple regression

Simple linear regression uses traditional slope-intercept form, where $m$ and $b$ are the variables our algorithm will try to “learn” to produce the most accurate predictions.

$x$ represents our input data and $y$ represents our prediction.

$$ y = mx + b $$

Multivariable regression

A more complex, multi-variable linear equation might look like this, where $w$ represents the coefficients, or weights, our model will try to learn.

$$ f(x,y,z) = w_1x + w_2y, + w_3z $$

The variables $x$, $y$, and $z$ represent attributes or distinct pieces of information we have about each observation.

For example, take app download predictions, these predictions might include advertising spend across mediums like Facebook, Instagram, and Google.

$$ AppDownloads = w_1Facebook + w_2Instagram + w_3Google $$

But no need to worry about multivar things now, we’ll just be focusing on simple regressions for the rest of the post.

Prediction, Cost functions, Gradient descent, Training & Evaluation

Prediction

Let’s say our our prediction function outputs an estimate of sales given a company’s social media advertising spend and our current values for Weight and Bias.

$$ Sales = Weight \times SocialMedia + Bias $$

$Weight$: the coefficient for the $SocialMedia$ independent variable. In machine learning we call coefficients weights.

$SocialMedia$: the independent variable. In machine learning we call these variables features.

$Bias$: the intercept where our line intercepts the y-axis. In machine learning we can call intercepts bias. Bias offsets all predictions that we make.

Our algorithm will try to learn the correct values for $Weight$ and $Bias$. By the end of our training, our equation will approximate the line of best fit.

Cost function

While predictions are nice, we don’t really need it for this example, we need a cost function to help optimize our weights.

For our example we’ll use Mean Squared Error (MSE). MSE measures the average squared difference between an observation’s actual and predicted values. The output is a single number representing the cost, or score, associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our model.

Gradient descent

To minimize MSE we use Gradient Descent to calculate the gradient of our cost function. I wrote a small post on Recursive Descent that shows an example of how to build a set of classes for building and evaluating partial derivatives which help in building gradient descent applications. spago will conveniently do this for us!

To solve for the gradient, we iterate through our data points using our new weight and bias values and take the average of the partial derivatives along the way.

The resulting gradient tells us the slope of our cost function at our current position (i.e. weight and bias) and the direction we should update to reduce our cost function (we move in the direction opposite the gradient). The size of our update is controlled by the learning rate.

Training & Evaluation

Training a model is the process of iteratively improving your prediction equation by looping through the dataset multiple times, each time updating the weight and bias values in the direction indicated by the slope of the cost function (gradient).

Training is complete when we reach an acceptable error threshold, or when subsequent training iterations fail to reduce our cost.

Before training we need to initialize our weights (set default values), set our hyperparameters (learning rate and number of iterations), and prepare to log our progress over each iteration.

If our model is working, we should see our cost decrease after every iteration.

An example with spago

Let’s begin by building a simple regression model with spago. I’m using go1.14.4 at the time of writing this post.

This example will only predict a single weight, not bias.

Create a dir and initialize your go modules.

mkdir spagolr && cd spagolr
go mod init example.com/spagolr

Install spago.

go get -u github.com/nlpodyssey/spago

Scaffold out the application.

// main.go

package main

import (
	"fmt"
	"github.com/nlpodyssey/spago/pkg/mat"
	"github.com/nlpodyssey/spago/pkg/ml/ag"
	"github.com/nlpodyssey/spago/pkg/ml/nn"
)

var (
	_ nn.Model     = &LinearRegression{}
	_ nn.Processor = &Processor{}
)

// LinearRegression – we'll use this struct to store the weights of our
// function using spago's nn.Param
type LinearRegression struct {
	Weights *nn.Param `type:"weights"`
}

// Processor – this will run our regression
type Processor struct {
	nn.BaseProcessor
	w ag.Node
}

func main() {
	fmt.Println("🚀")
}

Now we’ll write functions to do the following:

  1. Create a LinearRegression
  2. Create a new Processor to process the model
  3. Run the processor
func NewLinearRegression(in, out int) *LinearRegression {
	return &LinearRegression{
		Weights: nn.NewParam(mat.NewEmptyDense(out, in)),
	}
}

func (m *LinearRegression) NewProc(g *ag.Graph) nn.Processor {
	return &Processor{
		BaseProcessor: nn.BaseProcessor{
			Model:             m,
			Mode:              nn.Training,
			Graph:             g,
			FullSeqProcessing: false,
		},
		w: g.NewWrap(m.Weights),
	}
}

func (p *Processor) Forward(xs ...ag.Node) []ag.Node {
	ys := make([]ag.Node, len(xs))
	for i, x := range xs {
		ys[i] = p.Graph.Mul(p.w, x)
	}
	return ys
}

Now for the fun part, writing the main function to actually run and train our model. Let’s say for something super simple like $y=2x+1$

Your final code should look like the following

package main

import (
	"fmt"
	"github.com/nlpodyssey/spago/pkg/mat"
	"github.com/nlpodyssey/spago/pkg/mat/rand"
	"github.com/nlpodyssey/spago/pkg/ml/ag"
	"github.com/nlpodyssey/spago/pkg/ml/initializers"
	"github.com/nlpodyssey/spago/pkg/ml/losses"
	"github.com/nlpodyssey/spago/pkg/ml/nn"
	"github.com/nlpodyssey/spago/pkg/ml/optimizers/gd"
	"github.com/nlpodyssey/spago/pkg/ml/optimizers/gd/sgd"
)

var (
	_ nn.Model     = &LinearRegression{}
	_ nn.Processor = &Processor{}
)

// LinearRegression – we'll use this struct to store the weights of our
// function using spago's nn.Param
type LinearRegression struct {
	Weights *nn.Param `type:"weights"`
}

// Processor – this will run our regression
type Processor struct {
	nn.BaseProcessor
	w ag.Node
}

// NewLinearRegression ...
func NewLinearRegression(in, out int) *LinearRegression {
	return &LinearRegression{
		Weights: nn.NewParam(mat.NewEmptyDense(out, in)),
	}
}

// NewProc ...
func (m *LinearRegression) NewProc(g *ag.Graph) nn.Processor {
	return &Processor{
		BaseProcessor: nn.BaseProcessor{
			Model:             m,
			Mode:              nn.Training,
			Graph:             g,
			FullSeqProcessing: false,
		},
		w: g.NewWrap(m.Weights),
	}
}

// Forward ...
func (p *Processor) Forward(xs ...ag.Node) []ag.Node {
	ys := make([]ag.Node, len(xs))
	for i, x := range xs {
		ys[i] = p.Graph.Mul(p.w, x)
	}
	return ys
}

func main() {
	inputDim := 1  // takes variable 'x'
	outputDim := 1 // takes variable 'y'
	learningRate := 0.0001
	epochs := 100
	seed := 8686 // seed for random params initialization
	model := NewLinearRegression(inputDim, outputDim)
	criterion := losses.MSESeq                                                // mean squared error
	updater := sgd.New(sgd.NewConfig(learningRate, 0.0, false))               // stochastic gradient descent (no momentum etc.)
	optimizer := gd.NewOptimizer(updater, nn.NewDefaultParamsIterator(model)) // link the model to the optimizer

	// Random params initialization
	rndGen := rand.NewLockedRand(uint64(seed))
	nn.ForEachParam(model, func(param *nn.Param) {
		if param.Type() == nn.Weights {
			initializers.XavierUniform(param.Value(), 1.0, rndGen)
		}
	})

	// Create dummy data for training with a very basic linear equation i.e., y=2x+1.
	// Here, ‘x’ is the independent variable and y is the dependent variable.
	n := 110
	xValues := make([]float64, n)
	yValues := make([]float64, n)
	for i := 0; i < n; i++ {
		xValues[i] = float64(i)
		yValues[i] = 2*xValues[i] + 1
	}

	for epoch := 0; epoch < epochs; epoch++ {
		// get a new computational graph (cg)
		cg := ag.NewGraph()

		// Converting x and y values to graph nodes
		var inputs []ag.Node
		var labels []ag.Node
		for i := 0; i < n; i++ {
			inputs = append(inputs, cg.NewScalar(xValues[i]))
			labels = append(labels, cg.NewScalar(yValues[i]))
		}

		// get output (i.e. prediction) from the model, given the inputs
		outputs := model.NewProc(cg).Forward(inputs...)

		// get loss for the predicted output
		loss := criterion(cg, outputs, labels, true) // true = reduce mean

		// get gradients w.r.t to parameters
		cg.Backward(loss)

		// update parameters
		optimizer.Optimize()

		fmt.Printf("epoch %d, loss %.6f\n", epoch, loss.ScalarValue())
	}

	fmt.Printf("Learned coefficient: %.6f\n", model.Weights.ScalarValue())
}

Now let’s run our program which will train our model to learn the coefficient of the function we used to generate our test points.

$ go run main.go
epoch 0, loss 12344.992712
epoch 1, loss 4476.185291
epoch 2, loss 1623.076722
epoch 3, loss 588.583429
epoch 4, loss 213.492069
epoch 5, loss 77.489710
epoch 6, loss 28.177345
epoch 7, loss 10.297441
epoch 8, loss 3.814462
epoch 9, loss 1.463834
epoch 10, loss 0.611532
epoch 11, loss 0.302500
epoch 12, loss 0.190450
epoch 13, loss 0.149823
epoch 14, loss 0.135092
epoch 15, loss 0.129751
...
epoch 95, loss 0.126712
epoch 96, loss 0.126712
epoch 97, loss 0.126712
epoch 98, loss 0.126712
epoch 99, loss 0.126712
Learned coefficient: 2.013699 # wahoo!

We got pretty close! If we wanted to solve for something like say $y = 2x + 50$. We would have to tune our hyperparameters.

For example, a learning rate of 0.00000001, 1000 epochs, and 1100 test points yields:

$ go run main.go
epoch 0, loss 1305405.200653
epoch 1, loss 1294912.981077
epoch 2, loss 1284505.113139
epoch 3, loss 1274180.918700
epoch 4, loss 1263939.725069
epoch 5, loss 1253780.864968
epoch 6, loss 1243703.676479
epoch 7, loss 1233707.503008
epoch 8, loss 1223791.693240
epoch 9, loss 1213955.601094
epoch 10, loss 1204198.585685
...
epoch 990, loss 754.597889
epoch 991, loss 751.047094
epoch 992, loss 747.524845
epoch 993, loss 744.030914
epoch 994, loss 740.565072
epoch 995, loss 737.127093
epoch 996, loss 733.716753
epoch 997, loss 730.333831
epoch 998, loss 726.978106
epoch 999, loss 723.649359
Learned coefficient: 2.023235 # also really close!

For ml applications of linear regression, it’s hard to ignore the feature rich ecosystem that python provides, but it’s great to know that spago is available for teams, primarily writing backend applications in go, to build ml into their system without having to context switch between languages.

This is a big post, and if you were following along and something didn’t go quite right, let me know; an email or twitter dm works just fine!

Sources:

https://ml-cheatsheet.readthedocs.io/en/latest/

https://github.com/nlpodyssey/spago