1 00:00:02,360 --> 00:00:04,215 Take it away, Nick. 2 00:00:04,215 --> 00:00:05,819 I'll just go ahead and let you go. 3 00:00:05,819 --> 00:00:07,139 And I'm going to set my view 4 00:00:07,139 --> 00:00:08,790 to follow the speaker 5 00:00:08,790 --> 00:00:10,290 so that if the recording is 6 00:00:10,290 --> 00:00:11,909 keyed by what I have mindset to, 7 00:00:11,909 --> 00:00:14,189 you'll be the primary person on the screen. 8 00:00:14,189 --> 00:00:15,300 Please introduce yourself. 9 00:00:15,300 --> 00:00:16,649 Give us a little bit of background as 10 00:00:16,649 --> 00:00:18,990 well. Sure. 11 00:00:18,990 --> 00:00:22,470 Hey guys. My name is Nick Embry. 12 00:00:22,470 --> 00:00:24,599 I'm a recent graduate of 13 00:00:24,599 --> 00:00:28,440 the MS program and Computer Science at PSU. 14 00:00:28,440 --> 00:00:31,679 I also took a number of classes and 15 00:00:31,679 --> 00:00:33,930 the systems science department and have 16 00:00:33,930 --> 00:00:35,279 a certificate and computer 17 00:00:35,279 --> 00:00:36,585 modeling and simulation. 18 00:00:36,585 --> 00:00:38,995 A bunch of classes with Wayne. 19 00:00:38,995 --> 00:00:41,839 And stimulation is sort 20 00:00:41,839 --> 00:00:43,039 of been interesting to 21 00:00:43,039 --> 00:00:45,949 me for a long time 22 00:00:45,949 --> 00:00:47,585 and in many different ways. 23 00:00:47,585 --> 00:00:49,640 Today I'm going to be talking 24 00:00:49,640 --> 00:00:51,890 about an algorithm called Monte Carlo tree 25 00:00:51,890 --> 00:00:55,115 search that's been used 26 00:00:55,115 --> 00:00:57,964 in game AI a lot over the past decade or so. 27 00:00:57,964 --> 00:01:03,455 In particular, with Go players. 28 00:01:03,455 --> 00:01:07,459 I learned about the algorithm. 29 00:01:07,459 --> 00:01:09,544 I've been working on 30 00:01:09,544 --> 00:01:12,230 a project with Bart Massey in 31 00:01:12,230 --> 00:01:15,619 the computer science department to create 32 00:01:15,619 --> 00:01:18,619 a game AI player for a game that I like 33 00:01:18,619 --> 00:01:22,339 called dominion card game. 34 00:01:22,339 --> 00:01:25,220 And it's an algorithm 35 00:01:25,220 --> 00:01:27,229 that I don t think is taught anywhere. 36 00:01:27,229 --> 00:01:28,999 At PSU. It sort of slips 37 00:01:28,999 --> 00:01:30,170 through the cracks of AI. 38 00:01:30,170 --> 00:01:31,700 It's certainly not a machine-learning. 39 00:01:31,700 --> 00:01:33,709 It's not even really 40 00:01:33,709 --> 00:01:34,759 in reinforcement learning, 41 00:01:34,759 --> 00:01:36,559 but it seems like such a, 42 00:01:36,559 --> 00:01:42,380 such a core algorithm to a lot of 43 00:01:42,380 --> 00:01:47,089 the breakthroughs that we've 44 00:01:47,089 --> 00:01:49,955 seen in game AI over the past decade. 45 00:01:49,955 --> 00:01:51,830 And I just thought this would be 46 00:01:51,830 --> 00:01:55,320 a good venue to share what I learned. 47 00:01:56,050 --> 00:02:00,575 So before talking about the algorithm itself, 48 00:02:00,575 --> 00:02:02,269 I just wanted to quickly go 49 00:02:02,269 --> 00:02:04,894 over a brief timeline 50 00:02:04,894 --> 00:02:07,549 of some events in 51 00:02:07,549 --> 00:02:11,399 game AI over the past couple of decades. 52 00:02:12,100 --> 00:02:15,784 So 1997, 53 00:02:15,784 --> 00:02:20,009 Deep Blue defeated Garry Kasparov in chess. 54 00:02:21,430 --> 00:02:24,319 Deep blue was using 55 00:02:24,319 --> 00:02:29,180 a traditional search faith based approach, 56 00:02:29,180 --> 00:02:31,264 combined with some special sauce 57 00:02:31,264 --> 00:02:33,290 that was in the form of 58 00:02:33,290 --> 00:02:35,809 what's called an evaluation function 59 00:02:35,809 --> 00:02:38,119 designed by expert level chess players. 60 00:02:38,119 --> 00:02:39,560 And we'll get a little bit more into 61 00:02:39,560 --> 00:02:42,240 what an evaluation function is. 62 00:02:42,820 --> 00:02:45,139 The next year, 63 00:02:45,139 --> 00:02:47,299 reinforcement learning-based vacuum 64 00:02:47,299 --> 00:02:48,350 and player called TD 65 00:02:48,350 --> 00:02:50,360 Gammon played a series against one of 66 00:02:50,360 --> 00:02:51,620 the top backgammon players at 67 00:02:51,620 --> 00:02:54,034 the time, Malcolm Davis. 68 00:02:54,034 --> 00:02:59,419 This is an important note 69 00:02:59,419 --> 00:03:01,370 because TD Gammon is 70 00:03:01,370 --> 00:03:02,870 an example of reinforcement learning, 71 00:03:02,870 --> 00:03:03,980 which is not what I'm 72 00:03:03,980 --> 00:03:05,180 going to be talking about today. 73 00:03:05,180 --> 00:03:07,549 But I think any discussion 74 00:03:07,549 --> 00:03:09,200 of game AI has to give it a shout out 75 00:03:09,200 --> 00:03:10,730 because it's really kind of at 76 00:03:10,730 --> 00:03:14,639 the frontier of game AI today. 77 00:03:16,270 --> 00:03:19,639 In 2002, 78 00:03:19,639 --> 00:03:22,489 team working in operations 79 00:03:22,489 --> 00:03:24,889 research Ching et al, 80 00:03:24,889 --> 00:03:27,710 published an algorithm, what they 81 00:03:27,710 --> 00:03:30,814 called adaptive multistage sampling. 82 00:03:30,814 --> 00:03:33,334 But it contains all 83 00:03:33,334 --> 00:03:38,134 of the hallmarks of Monte Carlo tree search. 84 00:03:38,134 --> 00:03:42,230 And really the paper that coined 85 00:03:42,230 --> 00:03:45,364 the term Monte Carlo tree search referred 86 00:03:45,364 --> 00:03:50,119 back to this paper Chang at all, 87 00:03:50,119 --> 00:03:51,740 and called it a Monte Carlo 88 00:03:51,740 --> 00:03:52,759 tree search algorithms. 89 00:03:52,759 --> 00:03:54,814 So this is really the originary algorithm. 90 00:03:54,814 --> 00:03:56,450 And I wanted to mention 91 00:03:56,450 --> 00:03:58,160 that because I think the first 92 00:03:58,160 --> 00:03:59,750 time I was in one 93 00:03:59,750 --> 00:04:01,835 of the systems science seminars, 94 00:04:01,835 --> 00:04:04,310 I heard Marty say that one of 95 00:04:04,310 --> 00:04:07,249 the legacies of cybernetic system science, 96 00:04:07,249 --> 00:04:09,769 operations research, et cetera, 97 00:04:09,769 --> 00:04:11,479 is that all the good ideas 98 00:04:11,479 --> 00:04:14,059 get stolen by computer science. 99 00:04:14,059 --> 00:04:17,074 So I just wanted to 100 00:04:17,074 --> 00:04:18,634 validate that that is 101 00:04:18,634 --> 00:04:20,884 absolutely true in this case, 102 00:04:20,884 --> 00:04:22,789 Monte Carlo tree search. 103 00:04:22,789 --> 00:04:25,954 The phrase was popularized in a paper 104 00:04:25,954 --> 00:04:27,259 four years later and it 105 00:04:27,259 --> 00:04:29,915 exploded after that in computer science. 106 00:04:29,915 --> 00:04:32,749 But it was, it 107 00:04:32,749 --> 00:04:38,249 was discovered devised in 2002. 108 00:04:39,040 --> 00:04:41,329 So in 2007 and 109 00:04:41,329 --> 00:04:43,069 checkers player called Chin up, 110 00:04:43,069 --> 00:04:46,010 developed by a team led by Jonathan Shaffer, 111 00:04:46,010 --> 00:04:47,749 was demonstrated to solve 112 00:04:47,749 --> 00:04:48,680 the game of checkers. 113 00:04:48,680 --> 00:04:50,420 And what that means is that they published 114 00:04:50,420 --> 00:04:53,705 a proof that their AI player 115 00:04:53,705 --> 00:04:55,145 can force a draw, 116 00:04:55,145 --> 00:04:56,929 at least against 117 00:04:56,929 --> 00:04:59,879 an opponent that plays optimally. 118 00:05:00,820 --> 00:05:02,960 And you'll notice that there's 119 00:05:02,960 --> 00:05:06,769 a huge gap in 1997 to 2007 120 00:05:06,769 --> 00:05:10,609 in-between a chess player 121 00:05:10,609 --> 00:05:13,355 beating a strong master level human player. 122 00:05:13,355 --> 00:05:15,890 And this, this proof. 123 00:05:15,890 --> 00:05:17,569 I think that speaks to 124 00:05:17,569 --> 00:05:22,640 the depth between those two contexts, 125 00:05:22,640 --> 00:05:23,810 being able to play and 126 00:05:23,810 --> 00:05:25,890 being able to exhaustively. 127 00:05:25,890 --> 00:05:27,760 Searches space. 128 00:05:27,760 --> 00:05:32,169 And this also speaks to why it took 20 years 129 00:05:32,169 --> 00:05:36,535 after a computer player 130 00:05:36,535 --> 00:05:41,275 beat a master level chess player to win. 131 00:05:41,275 --> 00:05:42,699 A computer player first 132 00:05:42,699 --> 00:05:45,325 beat a master level Go player. 133 00:05:45,325 --> 00:05:49,119 In 2016. An AI player developed by 134 00:05:49,119 --> 00:05:51,550 DeepMind called AlphaGo defeat 135 00:05:51,550 --> 00:05:54,470 at least CDL and the game of Go. 136 00:05:54,540 --> 00:05:57,910 Least acetal was the second best Go player 137 00:05:57,910 --> 00:05:59,094 in the world at the time. 138 00:05:59,094 --> 00:06:01,390 And the creation of 139 00:06:01,390 --> 00:06:03,205 a top-level computer Go player was 140 00:06:03,205 --> 00:06:05,859 really major achievement at that time. 141 00:06:05,859 --> 00:06:10,340 It was, it was 142 00:06:10,340 --> 00:06:12,305 definitely a major breakthrough. 143 00:06:12,305 --> 00:06:17,210 A lot of people had said that we were decades 144 00:06:17,210 --> 00:06:20,839 away from being a game 145 00:06:20,839 --> 00:06:23,489 that a computer could beat a human adult. 146 00:06:23,530 --> 00:06:29,669 So Monte Carlo tree search in particular was, 147 00:06:29,669 --> 00:06:36,169 was behind the engine of the AlphaGo player. 148 00:06:36,169 --> 00:06:38,480 So I'm going to 149 00:06:38,480 --> 00:06:42,394 continue with the algorithm now. 150 00:06:42,394 --> 00:06:44,569 But I do think it's 151 00:06:44,569 --> 00:06:47,659 important to keep in mind and 152 00:06:47,659 --> 00:06:52,869 wonder about why there's 153 00:06:52,869 --> 00:06:55,075 this 20 year gap between beating 154 00:06:55,075 --> 00:06:57,339 a master chess player 155 00:06:57,339 --> 00:06:59,079 and beating and master Go player. 156 00:06:59,079 --> 00:07:01,405 And also what role 157 00:07:01,405 --> 00:07:03,145 Monte Carlo tree search 158 00:07:03,145 --> 00:07:05,899 plays in that transition. 159 00:07:09,720 --> 00:07:16,030 So the algorithm is essentially 160 00:07:16,030 --> 00:07:17,739 a combination of 161 00:07:17,739 --> 00:07:20,874 more traditional search space algorithms 162 00:07:20,874 --> 00:07:24,249 and a Monte Carlo algorithm. 163 00:07:24,249 --> 00:07:25,960 So I'm going to talk about each of 164 00:07:25,960 --> 00:07:27,790 those separately and then 165 00:07:27,790 --> 00:07:29,770 combine them at the end for 166 00:07:29,770 --> 00:07:31,630 the Monte Carlo tree search algorithms. 167 00:07:31,630 --> 00:07:34,549 So I'm going to start with search algorithms 168 00:07:34,549 --> 00:07:35,929 and then move on to 169 00:07:35,929 --> 00:07:38,239 explaining a simple Monte Carlo algorithm. 170 00:07:38,239 --> 00:07:40,099 And then talk about 171 00:07:40,099 --> 00:07:41,390 the combination and 172 00:07:41,390 --> 00:07:44,279 the Monte Carlo tree search algorithms. 173 00:07:49,810 --> 00:07:51,394 Cool. 174 00:07:51,394 --> 00:07:55,490 So in traditional search algorithms, 175 00:07:55,490 --> 00:07:59,119 you essentially search from 176 00:07:59,119 --> 00:08:01,580 whatever point you find yourself at a game, 177 00:08:01,580 --> 00:08:03,380 you search through all of 178 00:08:03,380 --> 00:08:05,389 the possible moves that you could make. 179 00:08:05,389 --> 00:08:07,159 And then all of the possible moves 180 00:08:07,159 --> 00:08:08,524 that your opponent could make. 181 00:08:08,524 --> 00:08:10,399 And then all of the possible moves that you 182 00:08:10,399 --> 00:08:13,169 could make after your opponent played. 183 00:08:14,430 --> 00:08:18,355 For a really simple game like Tic-Tac-Toe. 184 00:08:18,355 --> 00:08:21,699 We could do that exhaustively. 185 00:08:21,699 --> 00:08:24,220 We could literally look 186 00:08:24,220 --> 00:08:26,350 at all of the different states. 187 00:08:26,350 --> 00:08:30,669 There are about 1,000 states in tic-tac-toe. 188 00:08:30,669 --> 00:08:32,469 So it's definitely feasible 189 00:08:32,469 --> 00:08:35,199 on even a slow computer. 190 00:08:35,199 --> 00:08:38,200 But this approach of 191 00:08:38,200 --> 00:08:41,199 exhaustively searching and trying to find 192 00:08:41,199 --> 00:08:44,259 a route through the game state 193 00:08:44,259 --> 00:08:46,194 that will guarantee a win 194 00:08:46,194 --> 00:08:48,595 or at least urine TO draw. 195 00:08:48,595 --> 00:08:53,530 Kinda breaks down when you hit larger games. 196 00:08:53,530 --> 00:08:55,014 So like I said, 197 00:08:55,014 --> 00:08:56,739 Go has about 1,000 198 00:08:56,739 --> 00:08:59,209 legal states or Tic-Tac-Toe, 199 00:08:59,209 --> 00:09:01,220 It's about 1,000 illegal states. 200 00:09:01,220 --> 00:09:06,635 Checkers has about 100 billion legal states. 201 00:09:06,635 --> 00:09:08,509 Then. 202 00:09:08,509 --> 00:09:12,620 Chess has about 20 greater size 203 00:09:12,620 --> 00:09:15,050 of magnitude of checkers. 204 00:09:15,050 --> 00:09:18,784 Checkers has ten to the 20 legal states. 205 00:09:18,784 --> 00:09:22,534 Chess has ten to the 40 legal, legal states. 206 00:09:22,534 --> 00:09:24,110 And go, I believe 207 00:09:24,110 --> 00:09:26,285 is somewhere around ten to the eight. 208 00:09:26,285 --> 00:09:30,769 So it's just impossible and will always 209 00:09:30,769 --> 00:09:32,030 be impossible for some of 210 00:09:32,030 --> 00:09:35,315 these games to search the entire state space. 211 00:09:35,315 --> 00:09:40,259 Ten to the 84 go. I believe that's true. 212 00:09:42,730 --> 00:09:48,750 So what can we do instead? 213 00:09:50,080 --> 00:09:53,180 So here's a tree 214 00:09:53,180 --> 00:09:54,679 with a branching factor of three. 215 00:09:54,679 --> 00:09:56,299 Branching factor means that 216 00:09:56,299 --> 00:09:59,059 each node in the tree has three children. 217 00:09:59,059 --> 00:10:00,620 In a real game, this might be 218 00:10:00,620 --> 00:10:01,790 variable and sometimes you 219 00:10:01,790 --> 00:10:02,569 might have more mood, 220 00:10:02,569 --> 00:10:03,785 sometimes you might have less. 221 00:10:03,785 --> 00:10:08,419 But let's imagine that 222 00:10:08,419 --> 00:10:09,529 we have a tree with 223 00:10:09,529 --> 00:10:11,794 a branching factor of instead of three, 224 00:10:11,794 --> 00:10:13,490 we have a tree with a branching factor 225 00:10:13,490 --> 00:10:16,049 on average around 100. 226 00:10:16,180 --> 00:10:20,525 So each node has 100 children. 227 00:10:20,525 --> 00:10:24,575 Now, if we have limited resources, 228 00:10:24,575 --> 00:10:26,810 our searches quickly going to break down. 229 00:10:26,810 --> 00:10:28,025 We're just going to run out of time. 230 00:10:28,025 --> 00:10:29,675 We're gonna get stalled on 231 00:10:29,675 --> 00:10:32,525 trying to make a single move. 232 00:10:32,525 --> 00:10:35,255 So we need a way to 233 00:10:35,255 --> 00:10:38,269 focus our search on only the parts of 234 00:10:38,269 --> 00:10:40,939 the tree that we think are most promising in 235 00:10:40,939 --> 00:10:44,730 order to find the best move. 236 00:10:44,800 --> 00:10:49,384 So how and with our search, 237 00:10:49,384 --> 00:10:51,499 can we prioritize certain parts of 238 00:10:51,499 --> 00:10:53,509 the tree and prune off other parts of 239 00:10:53,509 --> 00:10:55,310 the tree so that we get to 240 00:10:55,310 --> 00:10:58,680 a search space that's tractable. 241 00:10:59,380 --> 00:11:02,190 Um, so the, the 242 00:11:02,190 --> 00:11:04,460 most popular way to do this is. 243 00:11:04,460 --> 00:11:06,875 Called a heuristic function. 244 00:11:06,875 --> 00:11:09,049 And you use 245 00:11:09,049 --> 00:11:12,305 a heuristic function to basically score 246 00:11:12,305 --> 00:11:17,989 different states using, using this function. 247 00:11:17,989 --> 00:11:19,700 And then you search 248 00:11:19,700 --> 00:11:21,650 states that look better first. 249 00:11:21,650 --> 00:11:23,209 And this is called best-first 250 00:11:23,209 --> 00:11:25,055 search because you tried to find 251 00:11:25,055 --> 00:11:26,779 what looks like from 252 00:11:26,779 --> 00:11:29,674 your current point of view, your best move. 253 00:11:29,674 --> 00:11:32,809 And then you look downstream of 254 00:11:32,809 --> 00:11:34,445 that move to make sure 255 00:11:34,445 --> 00:11:36,515 that it is in fact your best move. 256 00:11:36,515 --> 00:11:39,650 And there might be a little bit of trade-off. 257 00:11:39,650 --> 00:11:40,699 You might search your 258 00:11:40,699 --> 00:11:42,470 five best moves if you have a lot of 259 00:11:42,470 --> 00:11:43,850 moves to see if any of 260 00:11:43,850 --> 00:11:46,429 them see which one is in fact the best. 261 00:11:46,429 --> 00:11:49,234 But you use a heuristic function 262 00:11:49,234 --> 00:11:52,099 to sort of q un, 263 00:11:52,099 --> 00:11:56,099 to which moves might be the best. 264 00:11:56,500 --> 00:12:00,380 And these heuristic functions are 265 00:12:00,380 --> 00:12:04,054 often tuned by experts 266 00:12:04,054 --> 00:12:07,190 and whatever field the searches being 267 00:12:07,190 --> 00:12:11,284 done in four chest, It's pretty easy. 268 00:12:11,284 --> 00:12:14,239 Typically, one big part of 269 00:12:14,239 --> 00:12:16,655 the heuristic function is 270 00:12:16,655 --> 00:12:18,455 based on pieces captured. 271 00:12:18,455 --> 00:12:20,420 So just traditionally, 272 00:12:20,420 --> 00:12:22,504 ponds are worth one point. 273 00:12:22,504 --> 00:12:26,749 Bishops and nights or three points. 274 00:12:26,749 --> 00:12:28,894 Rocks are worth five points, etc. 275 00:12:28,894 --> 00:12:30,619 So if you have a move 276 00:12:30,619 --> 00:12:33,665 where you have that piece captured, 277 00:12:33,665 --> 00:12:38,895 then say it's a rook than that. 278 00:12:38,895 --> 00:12:43,320 That move, we'll have a plus five. 279 00:12:44,500 --> 00:12:49,070 So the way that 280 00:12:49,070 --> 00:12:50,930 we often handle this for 281 00:12:50,930 --> 00:12:53,164 nodes that represent opponent's moves. 282 00:12:53,164 --> 00:12:54,244 In case you were wondering, 283 00:12:54,244 --> 00:12:57,034 is that we use 284 00:12:57,034 --> 00:12:58,760 an algorithm called neck Emacs. 285 00:12:58,760 --> 00:13:01,549 And that just means essentially that we 286 00:13:01,549 --> 00:13:03,170 prefer nodes that are 287 00:13:03,170 --> 00:13:05,209 strongest for both players. 288 00:13:05,209 --> 00:13:07,640 So when it's your move, you look, 289 00:13:07,640 --> 00:13:09,470 you focus your search on 290 00:13:09,470 --> 00:13:10,849 nodes that have the highest 291 00:13:10,849 --> 00:13:12,485 heuristic value for you. 292 00:13:12,485 --> 00:13:14,059 And when it's your opponents 293 00:13:14,059 --> 00:13:15,259 move on the tree, 294 00:13:15,259 --> 00:13:17,329 you look at the nodes that have 295 00:13:17,329 --> 00:13:19,534 the highest heuristic, the value for them. 296 00:13:19,534 --> 00:13:21,050 So you work on 297 00:13:21,050 --> 00:13:23,060 the assumption that your opponent 298 00:13:23,060 --> 00:13:24,979 will also be playing good moods and 299 00:13:24,979 --> 00:13:27,870 you try to find the best move, given that. 300 00:13:28,690 --> 00:13:32,840 That is a brief rundown of 301 00:13:32,840 --> 00:13:34,940 sort of traditional search methods and 302 00:13:34,940 --> 00:13:37,235 in particular, best-first search. 303 00:13:37,235 --> 00:13:39,334 I want to continue to give 304 00:13:39,334 --> 00:13:41,120 just a brief explanation 305 00:13:41,120 --> 00:13:45,870 of Monte Carlo of corrections. 306 00:13:46,120 --> 00:13:49,445 So I guess just to 307 00:13:49,445 --> 00:13:53,450 motivate Monte Carlo methods, 308 00:13:53,450 --> 00:13:54,874 now that we've looked 309 00:13:54,874 --> 00:13:57,755 at traditional search methods, 310 00:13:57,755 --> 00:14:01,085 why would we want to use something different? 311 00:14:01,085 --> 00:14:03,229 Why not just continue to 312 00:14:03,229 --> 00:14:04,699 use the traditional search 313 00:14:04,699 --> 00:14:06,650 with heuristic function? 314 00:14:06,650 --> 00:14:11,210 These approaches are not at all irrelevant. 315 00:14:11,210 --> 00:14:13,235 They can be tuned up with a neural network 316 00:14:13,235 --> 00:14:17,549 and they're very effective in many cases. 317 00:14:18,790 --> 00:14:22,790 So the problem in 318 00:14:22,790 --> 00:14:24,980 particular with Go and a lot of 319 00:14:24,980 --> 00:14:27,559 other games is that can be very 320 00:14:27,559 --> 00:14:30,244 difficult to define good features. 321 00:14:30,244 --> 00:14:33,080 We don't have this sort of 322 00:14:33,080 --> 00:14:35,209 simple system of scoring 323 00:14:35,209 --> 00:14:36,439 that we have in chess. 324 00:14:36,439 --> 00:14:38,090 And in fact, her go, it's very, 325 00:14:38,090 --> 00:14:40,024 very difficult to find 326 00:14:40,024 --> 00:14:42,589 a simple function that will 327 00:14:42,589 --> 00:14:44,345 heuristically value 328 00:14:44,345 --> 00:14:46,890 certain positions on the board. 329 00:14:48,070 --> 00:14:50,389 And so if it's hard to 330 00:14:50,389 --> 00:14:52,084 build the quality heuristic function, 331 00:14:52,084 --> 00:14:53,120 then basically 332 00:14:53,120 --> 00:14:55,115 our best-first search is out the window. 333 00:14:55,115 --> 00:14:56,479 Or you're going to have a very 334 00:14:56,479 --> 00:14:58,220 weak player if you try to 335 00:14:58,220 --> 00:15:02,460 use very rough heuristics. 336 00:15:02,950 --> 00:15:08,450 So Monte Carlo methods for gamey, 337 00:15:08,450 --> 00:15:13,250 I basically get around this by just 338 00:15:13,250 --> 00:15:16,190 letting us try out different actions and 339 00:15:16,190 --> 00:15:17,659 seeing which one seems 340 00:15:17,659 --> 00:15:19,920 best based on trying them. 341 00:15:23,830 --> 00:15:27,949 So imagine that the roots face 342 00:15:27,949 --> 00:15:33,425 here is where you currently are in the game. 343 00:15:33,425 --> 00:15:34,669 And each of these arrows 344 00:15:34,669 --> 00:15:35,824 are pointing to a node 345 00:15:35,824 --> 00:15:38,584 that you might pick as it moves. 346 00:15:38,584 --> 00:15:40,490 So the one on 347 00:15:40,490 --> 00:15:43,115 the left could be moving upon forward. 348 00:15:43,115 --> 00:15:44,330 The one in the middle could 349 00:15:44,330 --> 00:15:45,439 be moving your night. 350 00:15:45,439 --> 00:15:47,240 And the one on 351 00:15:47,240 --> 00:15:49,620 the right could be moving your bishop. 352 00:15:51,790 --> 00:15:56,089 Monte Carlo approach. You just try each one. 353 00:15:56,089 --> 00:15:58,910 You adjust the state of 354 00:15:58,910 --> 00:16:01,594 the board given what that move would be. 355 00:16:01,594 --> 00:16:04,670 So you move your arm forward and then you 356 00:16:04,670 --> 00:16:05,840 play out the rest of 357 00:16:05,840 --> 00:16:08,359 the game from that point. 358 00:16:08,359 --> 00:16:10,535 So you simulate the rest of the game. 359 00:16:10,535 --> 00:16:13,115 And in the most simple approach, 360 00:16:13,115 --> 00:16:16,069 you just have both players play completely 361 00:16:16,069 --> 00:16:20,090 randomly for the entire rest of the game. 362 00:16:20,090 --> 00:16:23,915 You might think about that as sampling, 363 00:16:23,915 --> 00:16:26,119 sampling the state-space of 364 00:16:26,119 --> 00:16:28,325 that for that move. 365 00:16:28,325 --> 00:16:30,170 So you make a move and 366 00:16:30,170 --> 00:16:31,894 then you take a random, 367 00:16:31,894 --> 00:16:32,810 a completely random 368 00:16:32,810 --> 00:16:33,965 sample of the state-space. 369 00:16:33,965 --> 00:16:35,285 You just have both players. 370 00:16:35,285 --> 00:16:36,935 You can do a random walk 371 00:16:36,935 --> 00:16:38,930 through the rest of the state space. 372 00:16:38,930 --> 00:16:41,750 Both players are playing completely randomly. 373 00:16:41,750 --> 00:16:46,009 And if you do that over and over again, 374 00:16:46,009 --> 00:16:50,554 you get and you keep track 375 00:16:50,554 --> 00:16:52,460 of how each move 376 00:16:52,460 --> 00:16:55,715 performs based on these simulations. 377 00:16:55,715 --> 00:16:58,670 Then you have a sense 378 00:16:58,670 --> 00:17:00,785 of all other things being equal, 379 00:17:00,785 --> 00:17:03,689 which move is the best? 380 00:17:03,750 --> 00:17:08,079 Now, this stimulation of 381 00:17:08,079 --> 00:17:11,155 the rest of the game is called a rollout. 382 00:17:11,155 --> 00:17:13,870 And each time the result 383 00:17:13,870 --> 00:17:15,790 of the simulated game is computed, 384 00:17:15,790 --> 00:17:17,709 the node that represents the action 385 00:17:17,709 --> 00:17:20,875 taken before the rollout is updated. 386 00:17:20,875 --> 00:17:26,019 So this is each of 387 00:17:26,019 --> 00:17:27,850 our actions has been 388 00:17:27,850 --> 00:17:31,944 simulated ones in this picture. 389 00:17:31,944 --> 00:17:33,820 For the one on 390 00:17:33,820 --> 00:17:35,530 the left and the one in the middle, 391 00:17:35,530 --> 00:17:37,840 those are both games lost. 392 00:17:37,840 --> 00:17:39,219 And for the one on 393 00:17:39,219 --> 00:17:41,660 the right, that's a game one. 394 00:17:42,300 --> 00:17:45,220 So we took, we moved 395 00:17:45,220 --> 00:17:49,055 our pawn randomly played out the game lost. 396 00:17:49,055 --> 00:17:51,934 We moved our night. 397 00:17:51,934 --> 00:17:54,260 We'd randomly played out the game we lost. 398 00:17:54,260 --> 00:17:55,730 We moved our Bishop. 399 00:17:55,730 --> 00:17:58,979 We randomly played out the game we want. 400 00:17:59,080 --> 00:18:01,100 Okay, So we have 401 00:18:01,100 --> 00:18:03,409 a single sample of each, each action, 402 00:18:03,409 --> 00:18:06,110 and that's certainly not enough to give us 403 00:18:06,110 --> 00:18:07,759 competence that either the bishop 404 00:18:07,759 --> 00:18:10,055 is actually the correct move. 405 00:18:10,055 --> 00:18:15,304 So how do we decide how to sample 406 00:18:15,304 --> 00:18:18,919 these actions and how often to sample 407 00:18:18,919 --> 00:18:23,219 these actions before actually making a move. 408 00:18:23,650 --> 00:18:26,179 So the very simplest way 409 00:18:26,179 --> 00:18:27,349 would just be to continue 410 00:18:27,349 --> 00:18:28,954 performing rollouts on 411 00:18:28,954 --> 00:18:30,365 each of these uniformly. 412 00:18:30,365 --> 00:18:34,100 So let's say we have 30 rollouts to spend. 413 00:18:34,100 --> 00:18:35,690 We have enough time and 414 00:18:35,690 --> 00:18:38,434 compute to do 30 rollouts per move. 415 00:18:38,434 --> 00:18:40,310 Well, if we have three moves, 416 00:18:40,310 --> 00:18:41,495 then we just spent 417 00:18:41,495 --> 00:18:43,055 ten rollouts on each of them. 418 00:18:43,055 --> 00:18:45,814 So we'll simulate, each will take 419 00:18:45,814 --> 00:18:47,209 each move ten times and we'll 420 00:18:47,209 --> 00:18:49,265 simulate ten games for each month. 421 00:18:49,265 --> 00:18:50,750 And then we'll see which one 422 00:18:50,750 --> 00:18:53,100 looks the best after we do that. 423 00:18:56,050 --> 00:18:58,760 So there are some issues 424 00:18:58,760 --> 00:19:00,964 with this method though, 425 00:19:00,964 --> 00:19:05,009 which would be simple uniform sampling. 426 00:19:05,140 --> 00:19:10,879 So e.g. what if two of our three actions 427 00:19:10,879 --> 00:19:13,625 in the previous slide results in 428 00:19:13,625 --> 00:19:16,910 a 75 ish percent win rate. 429 00:19:16,910 --> 00:19:19,489 While it quickly becomes clear that 430 00:19:19,489 --> 00:19:22,609 one action results in a 0% win rate, 431 00:19:22,609 --> 00:19:25,835 we always lose when we take one action. 432 00:19:25,835 --> 00:19:29,134 Now, we would prefer 433 00:19:29,134 --> 00:19:32,570 not to keep sampling that action that is 434 00:19:32,570 --> 00:19:34,640 very clear is a bad action and we 435 00:19:34,640 --> 00:19:37,130 prefer to instead stopped 436 00:19:37,130 --> 00:19:40,835 spending our rollouts on that action and used 437 00:19:40,835 --> 00:19:44,075 the extra rollouts to determine which 438 00:19:44,075 --> 00:19:47,944 of the two good moves is actually the best. 439 00:19:47,944 --> 00:19:50,210 So between these two moves 440 00:19:50,210 --> 00:19:52,934 to where we're winning about 75%. 441 00:19:52,934 --> 00:19:55,555 Let's see if one of them is actually 442 00:19:55,555 --> 00:19:58,330 81 of them is actually 70%. 443 00:19:58,330 --> 00:20:01,100 If we spend five more rollouts. 444 00:20:03,750 --> 00:20:06,339 One way to achieve 445 00:20:06,339 --> 00:20:09,220 that and avoid using resources on 446 00:20:09,220 --> 00:20:11,874 actions that are clearly inferior is 447 00:20:11,874 --> 00:20:15,790 to use the epsilon greedy sampling algorithm. 448 00:20:15,790 --> 00:20:20,980 And epsilon greedy sampling algorithm is 449 00:20:20,980 --> 00:20:25,030 basically you take your sample 450 00:20:25,030 --> 00:20:27,009 the action with the highest win rate 451 00:20:27,009 --> 00:20:28,735 and most of the time. 452 00:20:28,735 --> 00:20:30,669 So whatever action looks 453 00:20:30,669 --> 00:20:32,514 best, you sample it again. 454 00:20:32,514 --> 00:20:37,089 But some proportion of 455 00:20:37,089 --> 00:20:40,010 the time you sample something random. 456 00:20:40,050 --> 00:20:43,554 And adding that randomness, 457 00:20:43,554 --> 00:20:44,619 which is in the form of 458 00:20:44,619 --> 00:20:46,059 a parameter called epsilon, 459 00:20:46,059 --> 00:20:47,559 which is why it's called epsilon. 460 00:20:47,559 --> 00:20:50,769 Greedy sampling allows you to 461 00:20:50,769 --> 00:20:52,600 make sure that you 462 00:20:52,600 --> 00:20:54,445 didn't get it wrong the first time. 463 00:20:54,445 --> 00:20:58,810 So say you try one action, it looks good. 464 00:20:58,810 --> 00:21:00,920 You keep doing it. 465 00:21:00,960 --> 00:21:04,150 But actually it's not 466 00:21:04,150 --> 00:21:05,604 quite as good as the others. 467 00:21:05,604 --> 00:21:08,245 Epsilon greedy gives you a way to focus 468 00:21:08,245 --> 00:21:11,484 your resources on the best moves, 469 00:21:11,484 --> 00:21:14,560 but also make sure 470 00:21:14,560 --> 00:21:17,210 that they are in fact the best moves. 471 00:21:19,290 --> 00:21:23,154 That's better than just uniform sampling. 472 00:21:23,154 --> 00:21:26,500 But if the number of samples that we have 473 00:21:26,500 --> 00:21:30,145 time to perform is limited, 474 00:21:30,145 --> 00:21:31,749 this is not the most 475 00:21:31,749 --> 00:21:33,790 efficient method of sampling. 476 00:21:33,790 --> 00:21:36,039 So if for some reason we're 477 00:21:36,039 --> 00:21:39,849 unlucky with the early samples and they 478 00:21:39,849 --> 00:21:41,979 are unrepresentative of how good 479 00:21:41,979 --> 00:21:44,800 the moves actually our than it may take 480 00:21:44,800 --> 00:21:48,459 awhile before the occasional random actions 481 00:21:48,459 --> 00:21:51,280 defined by r epsilon parameter, correct. 482 00:21:51,280 --> 00:21:54,580 Our value estimation and we settle 483 00:21:54,580 --> 00:21:56,350 upon what's actually 484 00:21:56,350 --> 00:21:58,940 appears to be the best move. 485 00:22:00,360 --> 00:22:04,315 One attractive approach to 486 00:22:04,315 --> 00:22:06,880 these problems are what 487 00:22:06,880 --> 00:22:10,519 are called upper confidence bound algorithms. 488 00:22:13,590 --> 00:22:20,050 And upper confidence bound algorithms operate 489 00:22:20,050 --> 00:22:23,319 on the upper confidence 490 00:22:23,319 --> 00:22:27,835 bound for how good a given action is. 491 00:22:27,835 --> 00:22:29,589 So let me explain 492 00:22:29,589 --> 00:22:30,744 that maybe a little bit more. 493 00:22:30,744 --> 00:22:32,620 Imagine a, B, C, 494 00:22:32,620 --> 00:22:33,849 and D are all 495 00:22:33,849 --> 00:22:35,410 different actions that can be 496 00:22:35,410 --> 00:22:37,704 taken from a given point. 497 00:22:37,704 --> 00:22:40,420 And Q, a and Q if P 498 00:22:40,420 --> 00:22:41,769 are the number of 499 00:22:41,769 --> 00:22:43,785 wins that we've gotten so far. 500 00:22:43,785 --> 00:22:45,860 So here we see 501 00:22:45,860 --> 00:22:50,510 that Q if P has the highest value estimation, 502 00:22:50,510 --> 00:22:54,575 but q of a has a larger competence interval. 503 00:22:54,575 --> 00:22:56,540 So Q of a is then sample plus, 504 00:22:56,540 --> 00:23:00,155 let's say we're less sure 505 00:23:00,155 --> 00:23:03,605 about what the actual value 506 00:23:03,605 --> 00:23:05,330 of that action is, 507 00:23:05,330 --> 00:23:07,055 how good it actually is. 508 00:23:07,055 --> 00:23:09,020 So it has a larger confidence bound, 509 00:23:09,020 --> 00:23:10,835 whereas q of b, we've 510 00:23:10,835 --> 00:23:12,605 sampled that a number of times already, 511 00:23:12,605 --> 00:23:13,909 we're pretty confident that 512 00:23:13,909 --> 00:23:19,084 its value is within a certain amount. 513 00:23:19,084 --> 00:23:23,555 So UCB algorithms, 514 00:23:23,555 --> 00:23:25,475 upper confidence bound algorithms, 515 00:23:25,475 --> 00:23:29,420 operate on the upper bound of the action. 516 00:23:29,420 --> 00:23:31,129 So in this case, 517 00:23:31,129 --> 00:23:32,810 the next action to be 518 00:23:32,810 --> 00:23:35,854 sampled would be the action. 519 00:23:35,854 --> 00:23:37,969 And we would continue to 520 00:23:37,969 --> 00:23:40,624 sample the action until 521 00:23:40,624 --> 00:23:46,970 either its value went down 522 00:23:46,970 --> 00:23:49,510 far enough that it's 523 00:23:49,510 --> 00:23:52,970 competence found was no longer the highest, 524 00:23:52,970 --> 00:23:57,575 or its competence bound shrink enough 525 00:23:57,575 --> 00:23:59,630 that another action had 526 00:23:59,630 --> 00:24:03,090 a higher competence interval. 527 00:24:03,520 --> 00:24:06,244 So again, 528 00:24:06,244 --> 00:24:09,139 even though B has the highest score, 529 00:24:09,139 --> 00:24:13,790 because we've sampled a so many fewer times, 530 00:24:13,790 --> 00:24:16,654 it has the highest upper confidence bound. 531 00:24:16,654 --> 00:24:18,769 So it could be the best. 532 00:24:18,769 --> 00:24:20,840 And we will choose it 533 00:24:20,840 --> 00:24:23,945 for our next simulation. 534 00:24:23,945 --> 00:24:27,544 So the point of using 535 00:24:27,544 --> 00:24:28,220 upper confidence 536 00:24:28,220 --> 00:24:30,589 bound algorithms is that they 537 00:24:30,589 --> 00:24:32,960 provide a nice compromise between 538 00:24:32,960 --> 00:24:36,815 sampling nodes that are promising, 539 00:24:36,815 --> 00:24:39,140 knows that look like they're good. 540 00:24:39,140 --> 00:24:43,084 And it continuing to explore those nodes and 541 00:24:43,084 --> 00:24:45,725 sampling nodes or actions 542 00:24:45,725 --> 00:24:48,244 that we haven't sampled before much. 543 00:24:48,244 --> 00:24:50,089 So if there's a move 544 00:24:50,089 --> 00:24:51,769 that we haven't looked at much, 545 00:24:51,769 --> 00:24:54,920 this algorithm will catch that move and make 546 00:24:54,920 --> 00:24:58,549 sure that we sample at enough so that we're, 547 00:24:58,549 --> 00:25:01,070 we're confident that it 548 00:25:01,070 --> 00:25:05,309 is either a good move for a bad move. 549 00:25:08,140 --> 00:25:10,519 I just wanted to include this. 550 00:25:10,519 --> 00:25:13,579 This is the classic formulation 551 00:25:13,579 --> 00:25:16,130 of upper confidence bound algorithms. 552 00:25:16,130 --> 00:25:18,419 It's called UCB one. 553 00:25:19,420 --> 00:25:23,209 A sub t is the action 554 00:25:23,209 --> 00:25:26,059 that we take at a given time in the game. 555 00:25:26,059 --> 00:25:28,099 So it's the, it's the first view. 556 00:25:28,099 --> 00:25:29,855 This would be a sub one. 557 00:25:29,855 --> 00:25:32,209 And that move could be moving 558 00:25:32,209 --> 00:25:33,229 upon or moving in 559 00:25:33,229 --> 00:25:35,910 recruiting the bishop or whatever. 560 00:25:36,160 --> 00:25:42,710 And the action that's chosen to be sampled 561 00:25:42,710 --> 00:25:49,369 is the one that maximizes the score so far. 562 00:25:49,369 --> 00:25:50,614 So the number of wins 563 00:25:50,614 --> 00:25:52,610 with that action so far. 564 00:25:52,610 --> 00:25:54,830 Plus a constant that 565 00:25:54,830 --> 00:25:56,449 can be set experimentally. 566 00:25:56,449 --> 00:25:58,429 People play around with this constant. 567 00:25:58,429 --> 00:26:00,410 Then we have this term that 568 00:26:00,410 --> 00:26:01,789 is basically 569 00:26:01,789 --> 00:26:03,845 defining the competence interval. 570 00:26:03,845 --> 00:26:06,244 It's in terms of 571 00:26:06,244 --> 00:26:10,129 how many times the node 572 00:26:10,129 --> 00:26:11,269 has been visited and 573 00:26:11,269 --> 00:26:13,260 how many times it's warm. 574 00:26:14,340 --> 00:26:19,914 So I'm just kinda hand-wavy. Passed the math. 575 00:26:19,914 --> 00:26:21,265 If you're interested. 576 00:26:21,265 --> 00:26:23,199 It's based on turn-off bounds. 577 00:26:23,199 --> 00:26:24,670 I don't fully understand it. 578 00:26:24,670 --> 00:26:27,054 I read the paper circle times. 579 00:26:27,054 --> 00:26:28,420 I'd be happy to send it to you. 580 00:26:28,420 --> 00:26:31,009 That happens to be your cup of tea. 581 00:26:32,670 --> 00:26:34,569 Okay, cool. 582 00:26:34,569 --> 00:26:37,045 So we've looked at 583 00:26:37,045 --> 00:26:40,059 simple Monte Carlo approach outlined above. 584 00:26:40,059 --> 00:26:43,165 We've looked at the simple search approach. 585 00:26:43,165 --> 00:26:46,254 Now, moving on to Monte Carlo tree search, 586 00:26:46,254 --> 00:26:49,930 the problem with the Monte Carlo approach 587 00:26:49,930 --> 00:26:51,339 outlined above is that it 588 00:26:51,339 --> 00:26:53,980 often fails in practice when it's applied 589 00:26:53,980 --> 00:26:56,719 to I'm combinatorial games, 590 00:26:56,719 --> 00:26:58,234 in particular other games. 591 00:26:58,234 --> 00:27:00,649 And for a fairly intuitive reason, 592 00:27:00,649 --> 00:27:03,050 especially large games. 593 00:27:03,050 --> 00:27:05,615 The approach above select 594 00:27:05,615 --> 00:27:08,660 the best action given enough time, 595 00:27:08,660 --> 00:27:10,129 given that the game will 596 00:27:10,129 --> 00:27:12,124 be played out randomly. 597 00:27:12,124 --> 00:27:14,599 So that is, given that 598 00:27:14,599 --> 00:27:15,680 the state-space did the game 599 00:27:15,680 --> 00:27:17,644 will be randomly sampled. 600 00:27:17,644 --> 00:27:19,789 But there are plenty of 601 00:27:19,789 --> 00:27:23,194 situations where this approach falls flat. 602 00:27:23,194 --> 00:27:26,419 Games are often played in 603 00:27:26,419 --> 00:27:28,460 a much smaller space than all of 604 00:27:28,460 --> 00:27:31,204 the legal moves would indicate. 605 00:27:31,204 --> 00:27:32,839 And if you sample 606 00:27:32,839 --> 00:27:34,130 all of those moves randomly, 607 00:27:34,130 --> 00:27:35,344 then you're going to end up with 608 00:27:35,344 --> 00:27:37,520 a skewed result. 609 00:27:37,520 --> 00:27:41,119 In addition, in many games, 610 00:27:41,119 --> 00:27:42,439 you have this idea of 611 00:27:42,439 --> 00:27:44,075 a tactical sequence of moves 612 00:27:44,075 --> 00:27:46,490 where no move would 613 00:27:46,490 --> 00:27:47,750 be really good without 614 00:27:47,750 --> 00:27:49,460 the other supporting it. 615 00:27:49,460 --> 00:27:53,059 So a simple Monte Carlo algorithm 616 00:27:53,059 --> 00:27:54,170 will usually 617 00:27:54,170 --> 00:27:56,780 fail to select a move that is very 618 00:27:56,780 --> 00:27:59,659 good when followed by one specific move. 619 00:27:59,659 --> 00:28:01,819 But if you don't make 620 00:28:01,819 --> 00:28:03,784 those two moves consecutively, 621 00:28:03,784 --> 00:28:06,059 the first mood is very bad. 622 00:28:08,380 --> 00:28:12,740 So as we saw with the state-space search, 623 00:28:12,740 --> 00:28:14,000 a good way of representing 624 00:28:14,000 --> 00:28:16,969 a series of nodes is a game tree. 625 00:28:16,969 --> 00:28:19,399 So you can think of 626 00:28:19,399 --> 00:28:21,289 Monte Carlo tree search as 627 00:28:21,289 --> 00:28:24,740 a algorithm that tries to 628 00:28:24,740 --> 00:28:25,790 combine the best 629 00:28:25,790 --> 00:28:27,364 of traditional search methods 630 00:28:27,364 --> 00:28:30,364 and Monte Carlo methods 631 00:28:30,364 --> 00:28:33,064 in a flexible way that still 632 00:28:33,064 --> 00:28:37,230 doesn't rely on heuristic function too much. 633 00:28:41,650 --> 00:28:46,609 So the Monte Carlo tree search algorithm 634 00:28:46,609 --> 00:28:49,144 can be divided up into four parts. 635 00:28:49,144 --> 00:28:51,905 Selection, Expansion, 636 00:28:51,905 --> 00:28:55,295 Simulation, and backpropagation. 637 00:28:55,295 --> 00:28:57,559 And I'm just gonna go over those four parts. 638 00:28:57,559 --> 00:29:00,000 It's a pretty simple algorithm. 639 00:29:01,870 --> 00:29:04,009 In the first case, 640 00:29:04,009 --> 00:29:05,825 it's basically just the same 641 00:29:05,825 --> 00:29:09,665 as the simple Monte Carlo approach. 642 00:29:09,665 --> 00:29:11,240 So you sample, you 643 00:29:11,240 --> 00:29:13,249 have these three actions available. 644 00:29:13,249 --> 00:29:14,510 You live your pond, you moved 645 00:29:14,510 --> 00:29:16,279 your roof, you move your bishop. 646 00:29:16,279 --> 00:29:17,870 You sample each of them, 647 00:29:17,870 --> 00:29:18,680 you play out of game 648 00:29:18,680 --> 00:29:20,760 randomly from each of them. 649 00:29:25,450 --> 00:29:29,210 After that initial sample. 650 00:29:29,210 --> 00:29:31,325 Just like before. 651 00:29:31,325 --> 00:29:33,199 The next node you picked 652 00:29:33,199 --> 00:29:35,270 her sampling is the one whose value is 653 00:29:35,270 --> 00:29:37,789 highest according to that UCP one equation 654 00:29:37,789 --> 00:29:39,859 that we saw a few slides ago. 655 00:29:39,859 --> 00:29:41,599 So I'm just going to hand wave through 656 00:29:41,599 --> 00:29:42,860 that and 657 00:29:42,860 --> 00:29:44,719 assume that the nodes likely to 658 00:29:44,719 --> 00:29:46,549 be sampled next is the rightmost ones, 659 00:29:46,549 --> 00:29:47,690 since they've all had 660 00:29:47,690 --> 00:29:49,805 the same number of samples taken. 661 00:29:49,805 --> 00:29:52,489 And the rightmost one is the best so far. 662 00:29:52,489 --> 00:29:53,929 So we would assume that it would 663 00:29:53,929 --> 00:29:56,700 have the highest upper competence. 664 00:29:58,030 --> 00:30:01,534 But rather than repeating 665 00:30:01,534 --> 00:30:02,929 the rollout from this node. 666 00:30:02,929 --> 00:30:04,340 So rather than simulating 667 00:30:04,340 --> 00:30:07,160 another game from exactly the same point, 668 00:30:07,160 --> 00:30:10,160 that action on the tree is then expanded 669 00:30:10,160 --> 00:30:11,570 to all possible actions 670 00:30:11,570 --> 00:30:13,565 that follow that action. 671 00:30:13,565 --> 00:30:16,084 So it's been stimulated once, 672 00:30:16,084 --> 00:30:17,299 after it's been simulated. 673 00:30:17,299 --> 00:30:21,530 Once you expand underneath that action, 674 00:30:21,530 --> 00:30:24,739 then you simulate each 675 00:30:24,739 --> 00:30:27,120 of those possible moves. 676 00:30:32,230 --> 00:30:36,630 So each of those are simulated. 677 00:30:39,100 --> 00:30:45,304 Importantly, the results are 678 00:30:45,304 --> 00:30:47,780 backpropagated to the parents. 679 00:30:47,780 --> 00:30:51,574 So you can see here the action on the left. 680 00:30:51,574 --> 00:30:53,915 We won. The action in the middle. 681 00:30:53,915 --> 00:30:57,659 We want the action on the right, we lost. 682 00:30:58,030 --> 00:31:01,835 But those statistics backpropagated 683 00:31:01,835 --> 00:31:04,114 up to their parents on the tree. 684 00:31:04,114 --> 00:31:06,830 So you can see the parent node 685 00:31:06,830 --> 00:31:10,115 originally had one when 686 00:31:10,115 --> 00:31:11,645 for its initial rollout. 687 00:31:11,645 --> 00:31:13,924 And now it has two more wins 688 00:31:13,924 --> 00:31:16,295 and one loss from its children. 689 00:31:16,295 --> 00:31:20,389 Now it has a score of three out of four. 690 00:31:20,389 --> 00:31:22,910 That score will be used in 691 00:31:22,910 --> 00:31:25,010 the next iteration to 692 00:31:25,010 --> 00:31:28,549 choose where on the tree to sample next. 693 00:31:28,549 --> 00:31:33,980 Again, using that idea of searching 694 00:31:33,980 --> 00:31:35,900 the most promising places 695 00:31:35,900 --> 00:31:37,459 on the tree and also 696 00:31:37,459 --> 00:31:40,234 searching the places that 697 00:31:40,234 --> 00:31:42,660 we haven't done much before. 698 00:31:43,120 --> 00:31:47,555 So that process continues on. 699 00:31:47,555 --> 00:31:49,819 Typically for some set period 700 00:31:49,819 --> 00:31:51,994 of time you have 3 s to make a move. 701 00:31:51,994 --> 00:31:54,560 Or for some set number of rollouts. 702 00:31:54,560 --> 00:31:56,599 Say you have 10,000 or 703 00:31:56,599 --> 00:31:59,240 1 million games that 704 00:31:59,240 --> 00:32:01,520 you simulate for each move. 705 00:32:01,520 --> 00:32:05,134 And then usually the 706 00:32:05,134 --> 00:32:06,799 direct child of the root. 707 00:32:06,799 --> 00:32:08,600 So the action that 708 00:32:08,600 --> 00:32:10,715 is available to take right now with 709 00:32:10,715 --> 00:32:12,590 the most overall visits or 710 00:32:12,590 --> 00:32:14,749 the highest score there usually the 711 00:32:14,749 --> 00:32:17,390 same same action is 712 00:32:17,390 --> 00:32:20,810 the one that is selected to actually take. 713 00:32:20,810 --> 00:32:25,025 Could I ask a quick question? Yeah. Go ahead. 714 00:32:25,025 --> 00:32:27,829 I'm wondering about the three out of four. 715 00:32:27,829 --> 00:32:30,694 I saw there was a zero the first time. 716 00:32:30,694 --> 00:32:33,049 Then there's two wins, 717 00:32:33,049 --> 00:32:35,409 wouldn't it be two out of four? 718 00:32:35,409 --> 00:32:38,089 There? Let's see. 719 00:32:38,089 --> 00:32:40,040 Isn't that first one? 720 00:32:40,040 --> 00:32:45,200 So I'm at the first result. 721 00:32:45,200 --> 00:32:46,865 There was a win. 722 00:32:46,865 --> 00:32:49,504 So the left one, 723 00:32:49,504 --> 00:32:51,019 we lost the center one, 724 00:32:51,019 --> 00:32:53,675 we lost the right one we want. 725 00:32:53,675 --> 00:32:55,804 So we do have three out of four. 726 00:32:55,804 --> 00:32:56,255 Okay. 727 00:32:56,255 --> 00:32:57,529 I just somehow I 728 00:32:57,529 --> 00:32:58,610 thought it was the other way around, 729 00:32:58,610 --> 00:33:00,560 but yeah, but I can see it clearly now. 730 00:33:00,560 --> 00:33:02,509 Thank you for it may have been may have been 731 00:33:02,509 --> 00:33:03,650 different on I had 732 00:33:03,650 --> 00:33:05,420 a very similar slide earlier. 733 00:33:05,420 --> 00:33:07,295 It's okay. Thank you. 734 00:33:07,295 --> 00:33:11,884 Yeah, sure. I'm cool. 735 00:33:11,884 --> 00:33:13,985 This is just wrapping up. 736 00:33:13,985 --> 00:33:17,450 This is all the steps at the same time. 737 00:33:17,450 --> 00:33:18,920 You can see that this is 738 00:33:18,920 --> 00:33:22,580 after, after six rollouts. 739 00:33:22,580 --> 00:33:24,289 So all the roll-outs are 740 00:33:24,289 --> 00:33:26,510 accounted for at the root node. 741 00:33:26,510 --> 00:33:29,059 And then underneath you can sort of 742 00:33:29,059 --> 00:33:32,179 see how the tree progresses. 743 00:33:32,179 --> 00:33:35,510 One thing that is kind of cool to watch if 744 00:33:35,510 --> 00:33:37,039 you're able to see 745 00:33:37,039 --> 00:33:38,525 this happening in real time, 746 00:33:38,525 --> 00:33:40,879 is that it will often be 747 00:33:40,879 --> 00:33:44,345 the case that one move. 748 00:33:44,345 --> 00:33:46,850 The algorithm starts exploring 749 00:33:46,850 --> 00:33:49,624 one particular move because it's the best. 750 00:33:49,624 --> 00:33:53,165 I'm from a myopic point of view. 751 00:33:53,165 --> 00:33:56,524 Then as the tree sort of circulates deeper, 752 00:33:56,524 --> 00:33:58,760 it will find a series 753 00:33:58,760 --> 00:34:00,544 of moves that is now the best. 754 00:34:00,544 --> 00:34:02,149 And you'll see the tree start 755 00:34:02,149 --> 00:34:04,880 to grow in that direction. 756 00:34:04,880 --> 00:34:08,839 Almost like a almost like an outdoor tree, 757 00:34:08,839 --> 00:34:11,309 like reaching towards the sunlight. 758 00:34:11,830 --> 00:34:13,925 Yeah. 759 00:34:13,925 --> 00:34:15,244 Cool. 760 00:34:15,244 --> 00:34:17,780 That is all I have prepared for today. 761 00:34:17,780 --> 00:34:19,640 I'm happy to answer any questions. 762 00:34:19,640 --> 00:34:21,440 Circle back to anything that was unclear. 763 00:34:21,440 --> 00:34:23,719 I'm happy to draw something, whatever. 764 00:34:23,719 --> 00:34:27,349 Thanks for letting me talk about this. 765 00:34:27,349 --> 00:34:29,255 It's been fun. 766 00:34:29,255 --> 00:34:31,939 Could you could you go ahead and just 767 00:34:31,939 --> 00:34:34,085 tell us about some of the big victories? 768 00:34:34,085 --> 00:34:35,419 You said there's a couple of 769 00:34:35,419 --> 00:34:40,100 classic results, milestones achieved. 770 00:34:40,100 --> 00:34:41,689 Sure. Could you tell us a little bit more 771 00:34:41,689 --> 00:34:44,330 about what this is 772 00:34:44,330 --> 00:34:45,919 opening up in terms of possibilities 773 00:34:45,919 --> 00:34:47,704 or however you would describe that. 774 00:34:47,704 --> 00:34:49,939 Yeah, I mean, so 775 00:34:49,939 --> 00:34:52,309 the Monte Carlo tree search algorithm 776 00:34:52,309 --> 00:34:54,199 from the get-go was 777 00:34:54,199 --> 00:34:56,390 an algorithm that was 778 00:34:56,390 --> 00:34:58,235 built to play Go because 779 00:34:58,235 --> 00:34:59,839 nothing could play go at 780 00:34:59,839 --> 00:35:02,780 a human level much less than master level. 781 00:35:02,780 --> 00:35:07,430 And people were looking for a method that 782 00:35:07,430 --> 00:35:09,755 would allow the Go board 783 00:35:09,755 --> 00:35:12,875 and go play to be tractable for a computer. 784 00:35:12,875 --> 00:35:22,430 And so this started in the late 20 2010s, 785 00:35:22,430 --> 00:35:25,009 the late, late 2000s. 786 00:35:25,009 --> 00:35:28,939 And then you have AlphaGo 787 00:35:28,939 --> 00:35:30,454 actually beat 788 00:35:30,454 --> 00:35:34,820 a master level player a few years later. 789 00:35:34,820 --> 00:35:37,370 And that, 790 00:35:37,370 --> 00:35:42,529 that player was consisted of two things, 791 00:35:42,529 --> 00:35:44,450 the Monte Carlo tree search algorithm 792 00:35:44,450 --> 00:35:45,769 and then a very strong 793 00:35:45,769 --> 00:35:47,929 neural network that helped 794 00:35:47,929 --> 00:35:49,444 to evaluate the nodes. 795 00:35:49,444 --> 00:35:51,379 So in some sense, 796 00:35:51,379 --> 00:35:55,189 the AlphaGo player was sort 797 00:35:55,189 --> 00:35:57,304 of using the flexibility 798 00:35:57,304 --> 00:35:59,405 of the Monte Carlo tree search algorithm, 799 00:35:59,405 --> 00:36:02,780 but also trying its best to add 800 00:36:02,780 --> 00:36:08,429 an evaluating heuristic function as well. 801 00:36:09,220 --> 00:36:12,769 In terms of just a question Is 802 00:36:12,769 --> 00:36:17,074 go a game where 803 00:36:17,074 --> 00:36:20,639 there's a time limit on each move. 804 00:36:21,430 --> 00:36:24,229 In professional play? 805 00:36:24,229 --> 00:36:26,029 I don't remember what being 806 00:36:26,029 --> 00:36:27,379 and it makes me wonder, 807 00:36:27,379 --> 00:36:29,059 does that mean the computer can take as 808 00:36:29,059 --> 00:36:30,829 long as it wanted and 809 00:36:30,829 --> 00:36:31,729 just not going to get 810 00:36:31,729 --> 00:36:33,020 tired like a human player 811 00:36:33,020 --> 00:36:36,470 is going to know. 812 00:36:36,470 --> 00:36:41,300 Certainly. People talk about 813 00:36:41,300 --> 00:36:43,369 these game AI players is if, 814 00:36:43,369 --> 00:36:44,989 you know, oh my god, 815 00:36:44,989 --> 00:36:47,615 a computer plays chess like now. 816 00:36:47,615 --> 00:36:49,415 It's just as smart as we are. 817 00:36:49,415 --> 00:36:51,260 A computer plays go 818 00:36:51,260 --> 00:36:52,609 Now it's just as smart as we are. 819 00:36:52,609 --> 00:36:55,430 But I think the critical thing to remember 820 00:36:55,430 --> 00:37:00,950 when you have these amazing results 821 00:37:00,950 --> 00:37:03,770 is that for instance, 822 00:37:03,770 --> 00:37:05,615 the AlphaGo player was running on 823 00:37:05,615 --> 00:37:07,699 2 million simulated games 824 00:37:07,699 --> 00:37:10,460 each time it made a single move. 825 00:37:10,460 --> 00:37:15,320 Um, and what's honestly 826 00:37:15,320 --> 00:37:17,869 to me really amazing is that a human 827 00:37:17,869 --> 00:37:19,670 can be competitive with this 828 00:37:19,670 --> 00:37:21,709 with 2 million simulated games 829 00:37:21,709 --> 00:37:27,350 each move without having the hardware. 830 00:37:27,350 --> 00:37:29,449 I think to answer 831 00:37:29,449 --> 00:37:30,170 his question about 832 00:37:30,170 --> 00:37:31,595 the time limit on go though, 833 00:37:31,595 --> 00:37:34,144 I think it's more of a social norm of 834 00:37:34,144 --> 00:37:37,295 U only expect your opponent and take so long. 835 00:37:37,295 --> 00:37:39,514 So you've got sort of a built-in kinda 836 00:37:39,514 --> 00:37:42,245 maybe 30 s or so to work with, you know. 837 00:37:42,245 --> 00:37:44,705 Yeah. I know that there are like 838 00:37:44,705 --> 00:37:46,249 I've seen stumper moves 839 00:37:46,249 --> 00:37:47,419 where people take a lot of time, 840 00:37:47,419 --> 00:37:48,319 but I'm not sure if there's 841 00:37:48,319 --> 00:37:49,790 a timer like in chess. 842 00:37:49,790 --> 00:37:51,275 Yeah. 843 00:37:51,275 --> 00:37:53,344 Yeah. 844 00:37:53,344 --> 00:37:56,165 So I have a question about 845 00:37:56,165 --> 00:37:57,769 the simulations that you 846 00:37:57,769 --> 00:37:59,104 play out after you make a move. 847 00:37:59,104 --> 00:38:01,025 You said that you then show you the game out 848 00:38:01,025 --> 00:38:03,439 randomly with a bunch of random loops. 849 00:38:03,439 --> 00:38:05,390 And I've been stuck 850 00:38:05,390 --> 00:38:07,160 on that ever since you mentioned that 851 00:38:07,160 --> 00:38:08,390 because I thought that just seemed like 852 00:38:08,390 --> 00:38:10,220 a really bad way 853 00:38:10,220 --> 00:38:12,380 to evaluate how well it worked, right? 854 00:38:12,380 --> 00:38:13,279 Because if you're just going 855 00:38:13,279 --> 00:38:14,209 to use random moves, 856 00:38:14,209 --> 00:38:15,515 like if you were playing chess, 857 00:38:15,515 --> 00:38:17,210 you'd want the simulation 858 00:38:17,210 --> 00:38:18,815 to actually try to win. 859 00:38:18,815 --> 00:38:21,740 But they're starting to realize that with Go, 860 00:38:21,740 --> 00:38:22,820 it is pretty different. 861 00:38:22,820 --> 00:38:24,079 It's really hard because 862 00:38:24,079 --> 00:38:25,459 of the heuristic problem, 863 00:38:25,459 --> 00:38:27,440 like you were saying, to actually 864 00:38:27,440 --> 00:38:29,915 evaluate any move as being good or not. 865 00:38:29,915 --> 00:38:32,255 So actually playing the game out randomly 866 00:38:32,255 --> 00:38:35,504 kinda make sense in a really bizarre way. 867 00:38:35,504 --> 00:38:37,089 Yeah. 868 00:38:37,089 --> 00:38:39,250 This is also strange to me. 869 00:38:39,250 --> 00:38:41,290 It has helped me to think of it as 870 00:38:41,290 --> 00:38:43,239 randomly sampling the search space, 871 00:38:43,239 --> 00:38:45,759 are randomly sampling the game state. 872 00:38:45,759 --> 00:38:47,140 So all other things being equal, 873 00:38:47,140 --> 00:38:49,510 how good is this? Is this space. 874 00:38:49,510 --> 00:38:52,629 Another thing that helps with thinking 875 00:38:52,629 --> 00:38:56,710 about the, in particular, 876 00:38:56,710 --> 00:38:58,509 like, why would you 877 00:38:58,509 --> 00:39:01,299 not have it play 878 00:39:01,299 --> 00:39:04,105 as well as it could during the simulations? 879 00:39:04,105 --> 00:39:06,175 And people do do this. 880 00:39:06,175 --> 00:39:08,559 I do this, have 881 00:39:08,559 --> 00:39:12,175 some light heuristics on the rollouts. 882 00:39:12,175 --> 00:39:15,024 For my, for my gamey iPlayer. 883 00:39:15,024 --> 00:39:17,819 The problem with doing that is that 884 00:39:17,819 --> 00:39:20,929 if you suspect that 885 00:39:20,929 --> 00:39:22,970 your heuristic is not that good, 886 00:39:22,970 --> 00:39:26,539 then you may be shoehorning 887 00:39:26,539 --> 00:39:30,559 your search or limiting your search to 888 00:39:30,559 --> 00:39:33,380 only spaces that your heuristic finds 889 00:39:33,380 --> 00:39:36,709 good and pruning out spaces 890 00:39:36,709 --> 00:39:39,050 that are good objectively. 891 00:39:39,050 --> 00:39:41,569 But your heuristic would value as bad. 892 00:39:41,569 --> 00:39:43,535 Because your blinders and the whole, 893 00:39:43,535 --> 00:39:45,080 the whole purpose of 894 00:39:45,080 --> 00:39:46,910 the Monte Carlo tree search algorithm 895 00:39:46,910 --> 00:39:48,079 is to be able to find 896 00:39:48,079 --> 00:39:51,350 those foods that expert heuristics 897 00:39:51,350 --> 00:39:52,654 might not be able to find? 898 00:39:52,654 --> 00:39:54,499 The Black Swan move. 899 00:39:54,499 --> 00:39:56,929 Yeah, exactly. 900 00:39:56,929 --> 00:39:58,970 No, that's totally true. 901 00:39:58,970 --> 00:40:00,209 Yeah. 902 00:40:00,209 --> 00:40:02,770 So how long an algorithm that will 903 00:40:02,770 --> 00:40:04,674 eventually find that move, right? 904 00:40:04,674 --> 00:40:05,214 Yeah. 905 00:40:05,214 --> 00:40:07,645 How are you doing with your game? 906 00:40:07,645 --> 00:40:08,994 It's good. 907 00:40:08,994 --> 00:40:09,924 It's good. 908 00:40:09,924 --> 00:40:11,350 It's wrapped up. 909 00:40:11,350 --> 00:40:13,059 No, I mean, what's the game called? 910 00:40:13,059 --> 00:40:15,504 The you're trying to dominion or something? 911 00:40:15,504 --> 00:40:17,304 It's called dominion, yeah. 912 00:40:17,304 --> 00:40:19,434 Yeah, it's been really fun. 913 00:40:19,434 --> 00:40:25,495 Is my AI is sort of at the lower end of 914 00:40:25,495 --> 00:40:28,749 its competitive with the easy level 915 00:40:28,749 --> 00:40:32,670 of a computer, computer game AI. 916 00:40:32,670 --> 00:40:33,849 It's very clunky. 917 00:40:33,849 --> 00:40:35,350 I have to enter all 918 00:40:35,350 --> 00:40:36,940 of the moves and the terminal and 919 00:40:36,940 --> 00:40:38,859 then play the move on 920 00:40:38,859 --> 00:40:40,060 the computer and then 921 00:40:40,060 --> 00:40:42,750 enter the moves and the terminal again. 922 00:40:42,750 --> 00:40:46,130 Yeah. It's, it's been a lot of fun 923 00:40:46,130 --> 00:40:48,979 and learned a lot and there's 924 00:40:48,979 --> 00:40:52,429 a lot more to the algorithm in terms 925 00:40:52,429 --> 00:40:54,980 of variations and details 926 00:40:54,980 --> 00:40:57,110 and things that you can try. 927 00:40:57,110 --> 00:40:59,869 That I didn't talk about, but yeah, 928 00:40:59,869 --> 00:41:00,950 it's been it's been really 929 00:41:00,950 --> 00:41:03,300 great to explore that. 930 00:41:04,690 --> 00:41:07,159 It's like there's, there's kind of 931 00:41:07,159 --> 00:41:08,780 a spectrum like you just like 932 00:41:08,780 --> 00:41:13,340 you to discuss going from Tic Tac Toe to go. 933 00:41:13,340 --> 00:41:17,989 Obvious example of this idea 934 00:41:17,989 --> 00:41:20,180 of using the heuristics and 935 00:41:20,180 --> 00:41:22,384 all of the state-space availability. 936 00:41:22,384 --> 00:41:23,480 Like you start off at 937 00:41:23,480 --> 00:41:24,935 this end of the spectrum where you know, 938 00:41:24,935 --> 00:41:26,690 every possible state and you know, 939 00:41:26,690 --> 00:41:28,415 you can calculate every single thing. 940 00:41:28,415 --> 00:41:29,599 And then he ended up at the other end 941 00:41:29,599 --> 00:41:30,710 of the spectrum where you've got ten 942 00:41:30,710 --> 00:41:31,760 to the 80 states and you 943 00:41:31,760 --> 00:41:33,875 can't calculate everything. 944 00:41:33,875 --> 00:41:36,229 I'm just trying to figure out this 945 00:41:36,229 --> 00:41:38,285 this use of heuristics in here. 946 00:41:38,285 --> 00:41:42,559 Like Yeah, Can you talk more about that, 947 00:41:42,559 --> 00:41:45,815 about using heuristics to yeah. 948 00:41:45,815 --> 00:41:48,140 So I would think 949 00:41:48,140 --> 00:41:49,804 about it as basically, you know, 950 00:41:49,804 --> 00:41:51,649 if you have a state-space 951 00:41:51,649 --> 00:41:54,109 that's too large to explore, all of, 952 00:41:54,109 --> 00:41:59,735 you can either go the heuristic route. 953 00:41:59,735 --> 00:42:02,779 So basically like look for moves that 954 00:42:02,779 --> 00:42:05,000 you follow a pattern 955 00:42:05,000 --> 00:42:07,520 that you already know that you think is good. 956 00:42:07,520 --> 00:42:09,694 Or you can go to the simulation route. 957 00:42:09,694 --> 00:42:11,105 You can just try things 958 00:42:11,105 --> 00:42:13,234 and see if they turn out well, 959 00:42:13,234 --> 00:42:15,499 to the extent that that's possible. 960 00:42:15,499 --> 00:42:19,069 And usually there's a mix. 961 00:42:19,069 --> 00:42:22,579 But again, I would say, 962 00:42:22,579 --> 00:42:25,069 the thing about chess 963 00:42:25,069 --> 00:42:26,659 is a large game Go is a large game, 964 00:42:26,659 --> 00:42:27,769 goes larger game, but they're 965 00:42:27,769 --> 00:42:30,260 both Huge Games, massive gains. 966 00:42:30,260 --> 00:42:32,569 And it just because 967 00:42:32,569 --> 00:42:33,679 of the rules of chess and 968 00:42:33,679 --> 00:42:34,850 because of the way it's played, 969 00:42:34,850 --> 00:42:38,255 it's much easier to take a heuristic route. 970 00:42:38,255 --> 00:42:39,349 For chess. 971 00:42:39,349 --> 00:42:41,930 It's much easier to 972 00:42:41,930 --> 00:42:43,369 come up with a series of 973 00:42:43,369 --> 00:42:45,214 rules from chess experts 974 00:42:45,214 --> 00:42:47,059 and be able to 975 00:42:47,059 --> 00:42:49,474 use those rules and tune them and 976 00:42:49,474 --> 00:42:52,399 have those rules generate a score 977 00:42:52,399 --> 00:42:55,610 for any possible game position. 978 00:42:55,610 --> 00:42:57,709 And then to just hone in 979 00:42:57,709 --> 00:42:58,789 on positions 980 00:42:58,789 --> 00:43:00,379 that look good according to that, 981 00:43:00,379 --> 00:43:03,140 that scoring rule for go, 982 00:43:03,140 --> 00:43:04,655 That's really hard, right? 983 00:43:04,655 --> 00:43:06,544 It's very hard to do that. 984 00:43:06,544 --> 00:43:08,149 What about opinion though? 985 00:43:08,149 --> 00:43:08,840 Wouldn't that be a 986 00:43:08,840 --> 00:43:10,744 little easier for dominion? 987 00:43:10,744 --> 00:43:15,049 Yes, it would be easier for dominion. 988 00:43:15,049 --> 00:43:17,465 And there are 989 00:43:17,465 --> 00:43:20,790 some pretty strong heuristics out there. 990 00:43:21,340 --> 00:43:23,779 But it is also 991 00:43:23,779 --> 00:43:26,779 the case that for any heuristic, 992 00:43:26,779 --> 00:43:30,919 if you use that heuristic in the roll-out. 993 00:43:30,919 --> 00:43:33,080 So rather than random, 994 00:43:33,080 --> 00:43:36,019 the Monte Carlo tree search algorithm will 995 00:43:36,019 --> 00:43:39,034 play at least as good as that heuristic. 996 00:43:39,034 --> 00:43:41,220 If not better. 997 00:43:41,320 --> 00:43:44,899 In probably, probably quite a bit better. 998 00:43:44,899 --> 00:43:46,579 So if you take any 999 00:43:46,579 --> 00:43:48,499 heuristic and you program it in and 1000 00:43:48,499 --> 00:43:51,290 instead of playing randomly 1001 00:43:51,290 --> 00:43:53,494 for the rollout you put in, 1002 00:43:53,494 --> 00:43:56,314 you just plug in your little heuristic. 1003 00:43:56,314 --> 00:43:58,399 Then you run 1004 00:43:58,399 --> 00:43:59,929 the Monte Carlo tree search algorithm 1005 00:43:59,929 --> 00:44:01,069 with all the rollouts using 1006 00:44:01,069 --> 00:44:02,750 that heuristic, it will. 1007 00:44:02,750 --> 00:44:03,949 Always played better than 1008 00:44:03,949 --> 00:44:05,479 that heuristic because it 1009 00:44:05,479 --> 00:44:06,980 has the intelligence to see 1010 00:44:06,980 --> 00:44:09,690 when that heuristic is winning and losing. 1011 00:44:11,380 --> 00:44:14,060 The downside to that is that then 1012 00:44:14,060 --> 00:44:16,040 your rollouts become much slower and you can, 1013 00:44:16,040 --> 00:44:18,560 you can do fewer of them. 1014 00:44:18,560 --> 00:44:20,899 Yeah. It's very fast to 1015 00:44:20,899 --> 00:44:24,229 randomly make moods. Thanks, Nick. 1016 00:44:24,229 --> 00:44:25,949 Yeah. 1017 00:44:29,980 --> 00:44:32,600 Steve, You need to unmute. 1018 00:44:32,600 --> 00:44:34,279 I see someone talking 1019 00:44:34,279 --> 00:44:36,659 but I don't I can't hear them. 1020 00:44:36,760 --> 00:44:39,304 And thank you again. 1021 00:44:39,304 --> 00:44:41,269 I'm back. 1022 00:44:41,269 --> 00:44:45,050 My first familiarity with game-playing. 1023 00:44:45,050 --> 00:44:47,525 It was back a while ago, 1024 00:44:47,525 --> 00:44:50,794 probably in the 60s when I discovered, 1025 00:44:50,794 --> 00:44:52,504 when I ran into 1026 00:44:52,504 --> 00:44:55,280 the military using game-playing, 1027 00:44:55,280 --> 00:44:57,079 course machines were slow. 1028 00:44:57,079 --> 00:44:58,700 It may, I think they 1029 00:44:58,700 --> 00:45:00,649 may have done this during the fifties. 1030 00:45:00,649 --> 00:45:03,155 But the question was, one of, 1031 00:45:03,155 --> 00:45:05,239 what do you do about a nuclear stand 1032 00:45:05,239 --> 00:45:07,010 down with the Soviet Union? 1033 00:45:07,010 --> 00:45:10,504 So they tried playing the game of doing that. 1034 00:45:10,504 --> 00:45:11,569 Course. 1035 00:45:11,569 --> 00:45:13,070 I was very happy that they 1036 00:45:13,070 --> 00:45:14,884 were using some methods 1037 00:45:14,884 --> 00:45:17,120 discovered because 1038 00:45:17,120 --> 00:45:18,710 of the limitations of computers, 1039 00:45:18,710 --> 00:45:20,809 space and time they did 1040 00:45:20,809 --> 00:45:24,240 about for simulations to make that choice. 1041 00:45:24,670 --> 00:45:28,099 So times have passed and I've heard 1042 00:45:28,099 --> 00:45:29,480 nothing more but the military 1043 00:45:29,480 --> 00:45:31,039 are using games. 1044 00:45:31,039 --> 00:45:33,570 Have you heard anything? 1045 00:45:35,290 --> 00:45:43,309 Well, certainly the approaches like 1046 00:45:43,309 --> 00:45:44,719 Monte Carlo tree search and really 1047 00:45:44,719 --> 00:45:47,570 any Monte Carlo method is 1048 00:45:47,570 --> 00:45:49,039 going to need a 1049 00:45:49,039 --> 00:45:50,824 super controlled environment, right? 1050 00:45:50,824 --> 00:45:54,049 Because you need to be able to 1051 00:45:54,049 --> 00:45:58,880 simulate the rest of the game somehow. 1052 00:45:58,880 --> 00:46:00,410 You need to be able to simulate 1053 00:46:00,410 --> 00:46:02,135 games over and over again. 1054 00:46:02,135 --> 00:46:05,150 And in order to do that with confidence, 1055 00:46:05,150 --> 00:46:09,140 and I'm not be confident that you don't 1056 00:46:09,140 --> 00:46:10,280 have some sort of 1057 00:46:10,280 --> 00:46:12,470 butterfly effect in 1058 00:46:12,470 --> 00:46:14,315 the real-world is going on. 1059 00:46:14,315 --> 00:46:16,715 You have to have very constrained, 1060 00:46:16,715 --> 00:46:19,079 a very constrained environment. 1061 00:46:19,150 --> 00:46:21,679 So I mean, I don't know, 1062 00:46:21,679 --> 00:46:22,909 but I would imagine 1063 00:46:22,909 --> 00:46:27,829 that there was a decision that the kinds 1064 00:46:27,829 --> 00:46:31,489 of game-playing that I'm working 1065 00:46:31,489 --> 00:46:33,589 on are just not that 1066 00:46:33,589 --> 00:46:36,840 useful for, say, war games. 1067 00:46:37,780 --> 00:46:44,675 And I do hear about sometimes more like, 1068 00:46:44,675 --> 00:46:45,950 more like war games, people 1069 00:46:45,950 --> 00:46:47,240 getting in a room and playing out 1070 00:46:47,240 --> 00:46:48,770 scenarios in the military 1071 00:46:48,770 --> 00:46:50,330 or State Department or whatever. 1072 00:46:50,330 --> 00:46:51,964 But yeah, 1073 00:46:51,964 --> 00:46:53,810 I don't I don't I haven't heard if the 1074 00:46:53,810 --> 00:46:55,475 military doing much of 1075 00:46:55,475 --> 00:46:58,080 this kind of game playing either. 1076 00:47:00,970 --> 00:47:03,965 I'm curious if you're familiar with 1077 00:47:03,965 --> 00:47:06,350 another similar sounding method 1078 00:47:06,350 --> 00:47:08,899 called Markov chain Monte Carlo. 1079 00:47:08,899 --> 00:47:10,820 And a Markov chain in a tree are 1080 00:47:10,820 --> 00:47:12,680 not all that different from each other. 1081 00:47:12,680 --> 00:47:14,419 So I just know, I just 1082 00:47:14,419 --> 00:47:18,695 wonder what I hear and I have not, 1083 00:47:18,695 --> 00:47:20,749 I have not been happy with 1084 00:47:20,749 --> 00:47:23,825 my attempts to use Markov Chain Monte Carlo, 1085 00:47:23,825 --> 00:47:25,729 but that's just me not having had 1086 00:47:25,729 --> 00:47:28,100 enough time to really dive into it deeply. 1087 00:47:28,100 --> 00:47:32,460 But it's asserted as being a, 1088 00:47:32,460 --> 00:47:37,795 an optimal sampling strategy 1089 00:47:37,795 --> 00:47:39,790 that makes it really attractive. 1090 00:47:39,790 --> 00:47:41,830 And I just wondered if the Markov chain 1091 00:47:41,830 --> 00:47:43,750 is more informal 1092 00:47:43,750 --> 00:47:45,789 or more formal and more 1093 00:47:45,789 --> 00:47:49,600 restrictive than than the tree? 1094 00:47:49,600 --> 00:47:51,520 Yeah. 1095 00:47:51,520 --> 00:47:53,019 You don't know. That's okay. 1096 00:47:53,019 --> 00:47:56,214 I was just curious. Oh, I don't I don't know. 1097 00:47:56,214 --> 00:47:58,240 I didn't think that would be the same. 1098 00:47:58,240 --> 00:48:01,899 I can't think from 1099 00:48:01,899 --> 00:48:03,174 what I remember of 1100 00:48:03,174 --> 00:48:04,990 Markov chains and Markov properties. 1101 00:48:04,990 --> 00:48:06,280 They don't infer 1102 00:48:06,280 --> 00:48:08,619 a combinatorial game where there's 1103 00:48:08,619 --> 00:48:11,089 no hidden information and 1104 00:48:12,120 --> 00:48:15,499 there's no chance involved. 1105 00:48:16,110 --> 00:48:18,370 I don't think there would be a difference, 1106 00:48:18,370 --> 00:48:20,979 but I will I am curious about that too now. 1107 00:48:20,979 --> 00:48:23,394 So I will look that up. 1108 00:48:23,394 --> 00:48:26,335 There are a bunch of variations on 1109 00:48:26,335 --> 00:48:29,574 Monte Carlo that use, 1110 00:48:29,574 --> 00:48:31,705 that aren't Monte Carlo tree search. 1111 00:48:31,705 --> 00:48:35,470 But they use other methods. 1112 00:48:35,470 --> 00:48:38,229 Part of the Markov chain Monte Carlo, 1113 00:48:38,229 --> 00:48:40,090 which seems like it's 1114 00:48:40,090 --> 00:48:41,860 probably important and powerful, 1115 00:48:41,860 --> 00:48:43,420 but which seems to create 1116 00:48:43,420 --> 00:48:45,429 artifacts that we were uncomfortable with, 1117 00:48:45,429 --> 00:48:47,320 is that when it finds 1118 00:48:47,320 --> 00:48:49,120 good results and then 1119 00:48:49,120 --> 00:48:51,085 it branches and finds a failure, 1120 00:48:51,085 --> 00:48:52,540 it resamples 1121 00:48:52,540 --> 00:48:54,789 without rerunning the previous one. 1122 00:48:54,789 --> 00:48:57,810 So you can end up with one particular sample 1123 00:48:57,810 --> 00:49:00,170 because it's better than 99% of them. 1124 00:49:00,170 --> 00:49:02,090 It gets reproduced 99 1125 00:49:02,090 --> 00:49:03,500 times in the synthetic 1126 00:49:03,500 --> 00:49:04,954 sample that is creating, 1127 00:49:04,954 --> 00:49:06,559 that creates really 1128 00:49:06,559 --> 00:49:09,605 weird-looking probability density functions. 1129 00:49:09,605 --> 00:49:11,330 That's weird. 1130 00:49:11,330 --> 00:49:13,730 Why would, why would it do that? 1131 00:49:13,730 --> 00:49:16,489 It strikes me as a little bit analogous 1132 00:49:16,489 --> 00:49:18,679 to backpropagation or something like that. 1133 00:49:18,679 --> 00:49:21,680 Because to me the big innovations are this, 1134 00:49:21,680 --> 00:49:23,914 exploring the upper confidence limit 1135 00:49:23,914 --> 00:49:25,295 as a preference, 1136 00:49:25,295 --> 00:49:26,809 and then the backpropagation are 1137 00:49:26,809 --> 00:49:28,564 the two important things that, 1138 00:49:28,564 --> 00:49:30,680 that change everything and hopefully what 1139 00:49:30,680 --> 00:49:31,699 you've shown and people 1140 00:49:31,699 --> 00:49:32,704 which makes it better, 1141 00:49:32,704 --> 00:49:35,179 it's a better, more likely 1142 00:49:35,179 --> 00:49:36,784 to move you through space 1143 00:49:36,784 --> 00:49:39,380 in an appropriate way. 1144 00:49:39,380 --> 00:49:42,604 Markov chain is also kind of in the game now, 1145 00:49:42,604 --> 00:49:43,820 right now it's really popular in 1146 00:49:43,820 --> 00:49:45,590 the system dynamics field 1147 00:49:45,590 --> 00:49:47,629 for exploring uncertainty effects. 1148 00:49:47,629 --> 00:49:50,569 And I'm just desk dying to learn more. 1149 00:49:50,569 --> 00:49:52,520 I just need to get to the point where I 1150 00:49:52,520 --> 00:49:53,329 can put the time 1151 00:49:53,329 --> 00:49:54,485 into really learn this stuff. 1152 00:49:54,485 --> 00:49:56,360 It does make your head spin sometimes. 1153 00:49:56,360 --> 00:49:58,879 So you look at these complicated algorithms. 1154 00:49:58,879 --> 00:50:01,324 Yeah, that's really interesting. 1155 00:50:01,324 --> 00:50:02,060 I will have to, 1156 00:50:02,060 --> 00:50:03,274 I haven't had my ear to the ground. 1157 00:50:03,274 --> 00:50:04,099 I'll have to do 1158 00:50:04,099 --> 00:50:06,839 some googling might be worth there. 1159 00:50:07,180 --> 00:50:10,200 Other questions? 1160 00:50:15,730 --> 00:50:18,740 Well, then I'm gonna go ahead and turn 1161 00:50:18,740 --> 00:50:20,914 the recording off. Yeah. 1162 00:50:20,914 --> 00:50:21,469 Thanks, Nick. 1163 00:50:21,469 --> 00:50:22,400 That was great. 1164 00:50:22,400 --> 00:50:24,749 Okay. Cool. Yeah. Thank you.