If it were true, like if this distribution were true that they were choosing play with 80 percent probability in heads and 30 percent probability in tails, then yes, we can apply this reasoning. But the opponent strategy can always change. They can always change adversarially to us. Yes? >> There's one thing that I've always been interested in, when you play the Nash or when you play against the opponent.
It seems like they're not going to shift. Even if you're playing the wrong strategy, they wouldn't exploit it off immediately. They have to learn to exploit it. I guess it's safe to definitely model the Nash, but I am curious about this intermediate space where you'd play against how they've been playing in the past, recognize that you need to shift in some way because they may shift as well Casinoslots SA. >> So, yeah, that's a great question. We actually did not, so one of the interesting things about humans playing poker, is that they're actually really good at exploiting. They are phenomenal at it. Way better than computers are currently. So, we actually did a competition against them in 2015 where we lost, and we would sometimes change the bot that they were playing against between days. Within 50 hands of playing, they could figure out how the bot had changed. So yes, if you can make an AI that could figure out how to do this better than humans, then that that might be valuable. But we're playing against really talented humans and we didn't think that we could beat them at that game, but then also why bother playing that game? Why bother trying to play that mind game if we can just approximate a Nash equilibrium and guarantee that we're going to win. So, I would argue that in the two player zero-sum game, if you want to beat top humans in a two-player zero-sum game, the best approach is to just approximate the Nash equilibrium because now, no matter what they're going to do, you're going to beat them. Now, I would argue that if your objectives are different, so for example, if you really want to beat up on a weak player and exploit them, then yeah, you don't want to necessarily play Nash equilibrium. You want to adapt to their weaknesses. This is challenging to do correctly, because if you try to adapt to a weak players weaknesses, you never know if they're just fooling you. Like if you're playing rock, paper, scissors against somebody and they throw rocks three times in a row, and you say, well, he's clearly an idiot who's throwing rock every single time, I'm going to throw paper next time, they could just throw scissors. So, there's no safe way that, except in special cases, there's no safe way to do that kind of opponent exploitation and still guarantee that you're going to beat top humans expectation. So, I think that is an excellent avenue for future research, but I think that in the two-player zero-sum setting, where we're just trying to beat top humans, I think this is the better way to go about it. So, Unsafe Subgame Solving, is very risky for this reason because if you make an assumption about how the opponent is playing, they can always shift to a different strategy and take advantage of that. Now, that said, it turns out that this actually works, yes, so we must account for the opponent's ability to adapt. Now, that said in practice, Unsafe Subgame Solving works unusually well in poker. It turns out that if you just approximate what the Nash Equilibrium strategy is and then assume that the opponent is playing that, and apply Subgame solving in this way, that actually works really well in this domain. But we have found situations where this does not work well, and I think in more general settings, it would not do well. So, we actually use this in a few situations in Libratus. But in general, I would not recommend doing this. Unless the domain is specially structured that it would work. So, Safe Subgame Solving, the idea is instead that we're just going to estimate what the expected value is for the opponent's Subgames, for the opponent's actions for different Subgames, and use that information to determine the optimal strategy for the Subgame that we're in. Now, this works if your expected values are perfect, but if they're not perfect you're obviously not going to compute an exact Nash equilibrium. So, it turns out that there's room for improvement here. By the way, this idea has been around for awhile. It's was first introduced in 2014. It was never really used in practice because it didn't actually give you good results in practice. Because you don't have perfect estimates. But what we came up with, is a way to dramatically improve the performance without giving up any theoretical guarantees. With this thing called Reach Subgame Solving. So, here's an example of how this works. This is going to get a little tricky, so if you have any questions in the next few slides please let me know. So, let's say that we have this slightly larger game now. It's still basically the same structure, there's a coin flip that only player one observes. Player one takes a sequence of actions, they eventually end up in this choice between selling the coin or playing, choosing play. Now, if they choose play, player two has to guess how the coin landed. Well, let's say your estimates are off in this game. Let's say we estimate that for choosing Sell, they will get an expected value of minus one regardless of which state they're in. Well, the best that we can do is just guess 50-50 between heads and tails and guarantee that they get just an expected value zero, for choosing play. But maybe we can use information about the earlier actions to improve upon this. So, maybe there is this earlier action that player one could have chosen, if the coin landed heads, where they could have gotten expected value of 0.5. Well, that means that in the Nash equilibrium, they would choose that action and get expected value of 0.5, and in the other case, they would come down here and choose play and get it fixed value of zero. So, they're getting an average of 0.25 in this game. But we can now shift our strategy as player two, to ensure we get tails more often, which guarantees that player one now gets negative 0.5 in this case. In the heads case that means they will get to 0.5 for choosing play, but that doesn't really matter because they're already getting 0.5 for this earlier deviate action. So, we're not really giving up anything in this situation, we're just making ourselves better off. Because they would never gets to the situation where they would choose the 0.5 anyway. This seems really intuitive but there's a problem with this, which is really subtle. I'm going to have to go to a bigger game, which is going to get even more complicated to really illustrate it. So, here is this bigger game. It's still pretty similar, that a coin that lands heads or tails with 50-50 probability. Player one in both cases now, let's say has this deviate action. In the heads case, they can get 0.5, and at tails case, they get minus 0.5. Or they can choose to continue, in which case we run into this chance node. This chance node is public. It just leads to two different Subgames, so both players observe the outcome of this chance node. It just leads to see different situations that are strategically identical. It's an irrelevant chance node, but it is a chance node. Then after this chance node, player one let's say, chooses play and we estimate the expected value of them choosing play is now zero. So, let's say, we were player two in this game, we observe player one choose play. Which means that we are in one of these two different situations. Either the coin landed heads, they choose to continue and let's say we observed that the chance node end up going left, and then they chose play. Or the coin landed tails, player one chose to continue. We observe the chance node going left and they choose play. So, we're in either this situation or this situation. Well, we observed that they have this deviate action of where they could've gotten expected value 0.5, if the coin landed heads. So, maybe we would say, we say, okay, well, we can increase the expected value for this action to one and lower it to minus one in this case, for example, by always guessing tails. That is okay because since this situation is only encountered 50 percent of the time, the expected value for this action is now just 0.5, and so that matches the deviant actions, so we're not giving up anything. Does anybody see the problem with this? All right. The problem is, if that chance node had gone the other way, if it had gone right, we would apply this same exact reasoning. We would say, okay, well, we can increase this expected value to one, because we're all encountering the situation half the time, so this expected value goes up to 0.5, and now the opponent is getting expected value zero, we're not giving up anything. But if we apply this reasoning regardless of which way this chance node goes, then what that means is our strategy is to always guess tails in both situations. So, in reality, it means that the expected value in this case is one and in this case is one, which means that the expected value is actually one, it's not 0.5. So, now player one could be better off by choosing to continue instead of choosing this deviate action. So, what this illustrates is that when you are doing this Subgame solving, you're doing this real-time reasoning, you can't look at the expected values of what we call the Blueprint Strategy, the pre-computed strategy. You have to think about what the expected values would have been if we had entered that Subgame and applied Subgame solving there too. So, that makes things that way more complicated. But fortunately, with Reach subgame solving, by the way, two prior papers had actually discussed this idea of, okay, we're encountering this situation, let's just increase the expected value for here, because they could have gotten an expected value earlier on and missed this problem that you have to consider about all the subgames that people could end up in. So, two prior papers published about this and they both got it wrong, and our paper, NIPS 2017 recognized this problem and actually came up with a fix that allows you to do this Reach subgame solving, while still guaranteeing that the points, your exploitability is not going to go up. The basic idea is to just only increase the expected value for both of these situations by 0.5. The actual details get a little bit more complicated, but the aren't too important for this talk. But the idea is you just increase the expected values by less depending on how many subgames they could end up in.
0 Comments
|
Friendshttps://www.casinoslots.co.nz/ Archives
March 2019
Categories |