Introduction

CSE 257 Search and Optimization by Sicun Gao was my first course on machine learning as part of my Masters. I was drawn to this course since everyone I asked told me this was more of a math than an ML course. If you know me, you would know my dissent towards ML in general, since all I ever had exposure to were libraries that did the work for you, or some basic techniques like gradient descent without knowing why it actually works, and why can people not explain the actions by ML algorithms since they seem deterministic. This course, full credit to the professor, changed my mindset about this.

Numerical Optimization

The very first class started with level sets and gradients (spoiler: which gets used till the end of the course). While I had never heard of the term level sets used with gradient descent, it intuitively made sense as to that is how it would work, bolstering this fact moving forward. Then we learnt about Taylor expansion in n dimensions, necessary and sufficient conditions, half place descent directions versus Newton direction and the cheap compute estimate for Hessian. Does not look like anything in ML, and I like it.

Here, we learn about the 3 components of the swiss knife of probabilistic methods of ascent/descent algorithms:

  • Simulated Annealing: A seemingly very simple (too simple, in fact) algorithm with roots in material physics rather than math that guarantees a convergence to a global minimum unlike the gradient descent which only reliably works on convex shapes, or at best gets a local minimum.
  • Cross Entropy Methods: A maximum likelihood estimator technique that flips the idea of searching the state space to searching a space of distribution of the points in the state space. This was an idea that looks like adding complexity by in fact reduces some.
  • Search Gradient: Why randomly sample a lot in CEM when you can use gradient descent to now move the distribution to the place where we need it to be? Now we have Search Gradient.

This is more of an Algorithms 101, where we learn about graph traversal (spoiler: becomes important later) such as breadth/depth first search, Dijkstra (uniform cost search) and the informed search algorithm, A*. It also goes into random sampling based seach like Rapidly Exploring Random Trees which mimic brain cells finding a connection.

Next comes the very interesting parts, through the lens of Tic-Tac-Toe, i.e., Adversarial search with minimax trees. Making use of zero sum games, we see how the game paths can be represented as trees, and how everything in the tree can be evaluated to see which path would be the winner. We see extensions for non zero sum games and then get into the chance/random players with expectimax trees. We get a reference to how AlphaGo works, With the info of how the trees can quickly become unmanageably large even with pruning, this sets up perfectly for the content in the next section.

Markov Decision Process and Reinforcement Learning

With the first slide being a Super Mario game played by the computer, attention was definitely captured. Skateboarding to class or off the cliff and the 2048 games were good too. This setup to the MDP which defines, states actions and transitions as a way to model the problem was smooth. We see an example of Blackjack and policies that can be used by the player (agent) and the values you can assign to each state and the reference back to expectimax for environment transitions. Slowly, things start to look like ML from the basics. Discount factors as a way to curb infinite policies leading to Bellman Equations to determine the value of a state as a way to compute the optimal policy which was proved by a contraction of all things was really amazing to see as it unfolded on the board. Policy iteration as a way to improve a given policy moving it closer to the optimum was seen as well.

The board that makes complete sense only if you attend classes.

We also go into the Monte Carlo utility estimation, Temporal Difference prediction and exploration versus exploitation strategies which lead us to the first main star of the show, Q-Learning. We derive the equations for estimating the values, try it out and voila! It works! But the Q value table is now dependent on the number of state action pairs, which is usually exponential. Instead of saving each value, generate the value by storing a function that can estimate it. How can we do that? Instead of updating the values, update the function that can estimate these values, like we did in CEM earlier and throw in gradient descent. Boom, we have Deep Q-Learning. We have values now, but we are not using the trees that we learnt previously. What about them, you ask? That’s what the next section does.

We start with a problem of the (multi armed) bandit where there are two (or more) biased coins (or slot machines) and the bandit needs to pick the one to win the most money. Trying out all of them multiple times will make you lose money while trying only one may cause you to miss the good one. Playing the game of minimizing regret, we see three choices:

  • Explore-Then-Commit: Play each option for a certain number of times, then pick the best you found and play that forever. You do waste a lot of turns on a bad coin, but playing till infinity or picking the right choice of the times you play all options might make it okay.
  • Epsilon Greedy: With a small probability epsilon, play a random option, so as to explore your options and for the other probability, play the best option seen till now. This balances exploration versus exploitation to get the best possible regret of explore-then-commit.
  • Upper Confidence Bound: While epsilon greedy made a probabilistic choice, converting that to a deterministic one, we get this. We always exploit based on the UCB formula for the value, which in turn, allows exploration when it has not been done, and exploitation when exploration is complete for a level. This also brings in the concept of online learning, learning while playing the game, so if the option picked starts losing more, it will switch over to the next best option.

Plugging the UCB back into MCTS, we now have something that will learn as it plays and balances exploration and exploitation without increasing regret, which is what AlphaGo does in the naive form! We are already in the realm of advanced algorithms now!

Now we step into constraint solving problems in both discrete and continuous space as a search problem. This includes backtracking search trees which starts with a guess, propogation under the constraints called arc-consistency, and once we hit a dead end, backtracking. Then we look at propositional logic and SAT solving as an application of the backtracking algorithm, the various forms of representation and the ways of going about solving it.

Conclusion

Saying that this was one of the best courses I have taken at UCSD will be an understatement. I felt like I learnt something I would otherwise have never looked towards. The homework questions attached to each of the topics was one of the best ways I have ever seen homeworks being utilized. It made the student learn because if you do not understand, you really cannot solve the problems. And ChatGPT will not come close to helping you on this one, which was seen in the policy of the homeworks and the finals being not just open book, but open internet. So, that was one successful course completion in my books.