What is small batch learning
A batch gradient descent
In implementing the linear perceptron algorithm, we used a gradient descent algorithm to minimize the cost function. When we update the weights, we use all of the training data. This gradient descent algorithm is called the batch gradient descent algorithm. When the training data set reaches a scale of several million or even hundreds of millions of data, the batch gradient descent algorithm is somewhat powerless. Every time the weight is updated, all records are used for evaluation, resulting in wasted costs. Therefore, at this point we can use a random gradient descent or a small batch gradient descent (100 randomly selected data).
Two stochastic gradients
The stochastic gradient descent is also referred to as iterative gradient descent or online gradient descent. The main difference between the stochastic gradient descent algorithm and the batch gradient descent algorithm is that the weight update strategy is different. The weight update process consists of the data from the entire training set, and the batch gradient descent weight update applies to the entire training set
The weight update of the stochastic gradient descent applies to individual training data
From the batch gradient descent and stochastic gradient descent weight update formulas, it can be seen that the stochastic gradient descent weight update is more frequent than the batch gradient descent weight update because the batch gradient descent weight is updated only once for the entire training set. And the number of times the stochastic gradient descent is updated is equal to the size of the training set. Compared to the batch gradient descent, the convergence speed of the stochastic gradient descent is faster. Since the stochastic gradient weight update is based on a single training sample, the error surface is not as smooth as the gradient descent, which also makes it easier for the stochastic gradient descent to jump out of a small range of local optimal solutions. Let's compare the change in the stochastic gradient descent loss function to the number of iterations and the change in the batch gradient descent function to the number of iterations. Since the value of the lgx function is less than 0 for x <1, the speed of convergence of the stochastic gradient descent can be clearly seen. Faster than batch gradient descent
In order to get more accurate results through stochastic gradient descent, it is very important to train the training data in a random manner. Therefore, each iteration must interrupt the training set to prevent entering the loop. Using the stochastic gradient descent algorithm, we can adjust the learning rate η of the stochastic gradient descent algorithm according to the number of iterations, e.g. B. c1 / (η + c2), where c1 and c2 are constants. If you have to be careful, the optimal solution obtained by the stochastic gradient descent algorithm is not necessarily the global optimal solution, but tends to be the global optimal solution. By using an adaptive learning rate, the global optimal solution can be further approached.
Python implements a stochastic gradient descent algorithm
Three, stochastic gradient descent algorithms implement online learning
We can also apply the stochastic gradient descent algorithm to an online learning system. The online learning system trains the model in real time when new data is generated without resetting the weight of the model. This algorithm is typically used in massive data and web systems. This algorithm can also reduce the cost of storing data. After the training is over, we can save the model and then discard the data.
Python implementation code
Four, small batch learning
The small batch learning algorithm is located between the random gradient descent algorithm and the batch gradient descent algorithm. Each time the small batch learning algorithm is trained, 50 or 100 pieces of data are randomly selected from the training set for training. Usually this method is used to compare Many. Compared to the batch gradient descent, its weights are updated more frequently and the speed of convergence is faster. Compared to stochastic gradient descent, we can use vector operations to replace the for loop and improve the algorithm's performance.
- Are there immoral tax laws
- What are maharatnas
- Facebook Live has been removed
- What are an alternative to Mac Pro
- How do you access your favorite sites
- What are the limitations of the differential amplifier
- What is nfte
- What Marvel characters are immortal
- What are the biggest myths about philosophy
- Who designs the Indian government's mobile apps
- What is special about Mosin Nagant from 1939
- What is Hailee Steinfeld's best role
- Which degree should I aim for?
- Is MBA according to BSc helpful in agriculture
- What is SA SX in KVPY
- Which company recently issued welding shares?
- What is waste management
- How many swear words do you know
- What are the functions of the CIA
- Can someone tell what cabron means
- Why are online ads never streamed live?
- Who rejected the Nobel Peace Prize
- Can a Christian believe in extraterrestrials?
- What does Dhaka mean in Bangla