What is small batch learning

A batch gradient descent

In implementing the linear perceptron algorithm, we used a gradient descent algorithm to minimize the cost function. When we update the weights, we use all of the training data. This gradient descent algorithm is called the batch gradient descent algorithm. When the training data set reaches a scale of several million or even hundreds of millions of data, the batch gradient descent algorithm is somewhat powerless. Every time the weight is updated, all records are used for evaluation, resulting in wasted costs. Therefore, at this point we can use a random gradient descent or a small batch gradient descent (100 randomly selected data).

Two stochastic gradients

The stochastic gradient descent is also referred to as iterative gradient descent or online gradient descent. The main difference between the stochastic gradient descent algorithm and the batch gradient descent algorithm is that the weight update strategy is different. The weight update process consists of the data from the entire training set, and the batch gradient descent weight update applies to the entire training set


The weight update of the stochastic gradient descent applies to individual training data


From the batch gradient descent and stochastic gradient descent weight update formulas, it can be seen that the stochastic gradient descent weight update is more frequent than the batch gradient descent weight update because the batch gradient descent weight is updated only once for the entire training set. And the number of times the stochastic gradient descent is updated is equal to the size of the training set. Compared to the batch gradient descent, the convergence speed of the stochastic gradient descent is faster. Since the stochastic gradient weight update is based on a single training sample, the error surface is not as smooth as the gradient descent, which also makes it easier for the stochastic gradient descent to jump out of a small range of local optimal solutions. Let's compare the change in the stochastic gradient descent loss function to the number of iterations and the change in the batch gradient descent function to the number of iterations. Since the value of the lgx function is less than 0 for x <1, the speed of convergence of the stochastic gradient descent can be clearly seen. Faster than batch gradient descent


In order to get more accurate results through stochastic gradient descent, it is very important to train the training data in a random manner. Therefore, each iteration must interrupt the training set to prevent entering the loop. Using the stochastic gradient descent algorithm, we can adjust the learning rate η of the stochastic gradient descent algorithm according to the number of iterations, e.g. B. c1 / (η + c2), where c1 and c2 are constants. If you have to be careful, the optimal solution obtained by the stochastic gradient descent algorithm is not necessarily the global optimal solution, but tends to be the global optimal solution. By using an adaptive learning rate, the global optimal solution can be further approached.

Python implements a stochastic gradient descent algorithm

Three, stochastic gradient descent algorithms implement online learning

We can also apply the stochastic gradient descent algorithm to an online learning system. The online learning system trains the model in real time when new data is generated without resetting the weight of the model. This algorithm is typically used in massive data and web systems. This algorithm can also reduce the cost of storing data. After the training is over, we can save the model and then discard the data.

Python implementation code

Four, small batch learning

The small batch learning algorithm is located between the random gradient descent algorithm and the batch gradient descent algorithm. Each time the small batch learning algorithm is trained, 50 or 100 pieces of data are randomly selected from the training set for training. Usually this method is used to compare Many. Compared to the batch gradient descent, its weights are updated more frequently and the speed of convergence is faster. Compared to stochastic gradient descent, we can use vector operations to replace the for loop and improve the algorithm's performance.