This Trick Makes Scraping Zillow a Piece of Cake

Github repo – https://github.com/cobalt-intelligence/fetch-requests-from-puppeteer

Use fetch requests from within Puppeteer to easily access tough to get to APIs.

Transcription:

Hi, I’m Jordan Hanson. I’m here with cobalt intelligence talking about a cool little trick. I like to use when you’re having a tough website that you’re trying to scrape and you’re kind of struggling. So the example I’m going to use here today is Zillow. For those of you who haven’t script silhouette, it’s not very easy.

They do a pretty good job of protecting it, not protecting it, but it didn’t make a hard right there. Their data is their product, and they’re trying to make it harder to get the example I’m going to show today is this map view. So you can see this map view, and there’s a bunch of houses on there.

What I want to get is the list of house, like all the Zillow for this region, right? So let’s go over here. Five minutes, zoom in, and one more. Here we go see, we have all the Z estimates for all these properties and that’s what I want. That looks cool. I want to get a list of all those properties and all those addresses.

I know because they had this number here that it must have come somehow via Ajax. So I go down here and I see this thing right here. I’m like, oh, sweet. I see the preview I go in here and cat one. I see the search results. I see in map results. I see 499. I’m looking that looks good to me. This is what I’m looking for.

So I’m going to like this. I’m going to say. As bash right there. I’m going to just throw in puppeteer and see if we can easily do this. If this is going to be a piece of cake, import raw text, paste it in there, import send it. What does that mean? Imports Jason, instead of anything written here, what is that?

I don’t know, but that didn’t work. Let’s say this didn’t work. This is what happened when I tried, I don’t know why this didn’t do anything. Is it? Cause we’re, is this, I mean, this is like a trial account I have going on here. I don’t know. It’s not a trial. It’s like not a real account, whatever. Anyway, the point is, if you go and try this a lot of times, it’s not going to work, they’re going to block this.

They’re going to know this is a simple request directly to the API and it does it doesn’t work. You can’t do it. It’s going to prompt for a capture. And so a kind of a trick of what I use is now what you could do obviously is you can log in as puppeteer and then, but trying to web scrape all this data right here, like on a map, that’s a nightmare.

You don’t want to do that. And so kind of a little trick I’ve done is where would I do is I have puppet to navigate to that site. And then I make a request to any API want within that site. You don’t even have to be on this page, who cares? Right. You go to little.com, you can make that request, and then you can get the date out and pull it back because look, that’s what this is doing.

It’s making fetch requests. It has all the cookie data has all the session information all in here in the browser. So why don’t I just let puppet here, handle all that and I’ll just do it myself. So that’s what we’re going to show her how to do right here. So come over here. So, what I did is I took all the, all these parameters right here, the search Cray state.

See right here, this pram spur pagination user search terms. My map bounds, my map, zoom, all that stuff right here. This is different than this one, but we go over here like this, and then I have my wants. Let’s sit there, see matches that right there. I took out requests. I didn’t think I needed it. And what I do is I just have a normal puppeteer one.

I launched a full setup. And then I open it to this page. I just having go to Atlanta, Georgia. I wonder what happens. We’ll try and go to atlanta.zillow.com to see if that makes a difference. It goes to this page. And then we do this thing called evaluate. We go over here to evaluate and we make a fetch request within there.

And so I’m going to my here. Right? I just had this special request. I pass in that stuff. I just did into this. Cause that’s what this is CA requests you are right here. Can barely see it. I know, but it’s all that data pagination right here. It’s all string a fight. And these areas. So we still can find the prams.

We streaming find the wants, and we had this request ID, whatever that is, and then put all this stuff in here now. And then we just send the request in here. So it’s like, what happens is it opens up the browser and then we make the browser, make the API request and it pulls it back for us and sends it out.

So let’s run it and kind of see what happens here. And then I’ll go, I’ll talk a little bit more about the details of how to do that. So this is make sure. If we get any kind of results. So it just pops open, goes to this. Bam, we have map results. I browse like 170 we’ll check one or two. How about, let me see map results 20 to make sure it’s a property, right?

You have to make sure you guys don’t think I’m lying to you. And this code is going to be on get hub. I’ll have a little sample for you out there, so you can go through and add it. Lexi has a property right there, all this stuff. It makes me so happy that this works. Okay. Now look, we have the prams here.

We have the wants. What you do is you have this evaluate now evaluate, you can pass parameters into it. You can’t define those inside here. They’re not like the scope is different because you have to think anything inside evaluate is really just being run as if it’s like you’re evaluating JavaScript in the console.

Just pretend like that’s what you’re doing. In fact, if you go over like this. Okay, this will say browser or not browser page that wait for a time out. It will give it 10 seconds right here with this. Let me say, I actually opened another one. Not that that’s not what we want. Hey, Buckaroo, that, that right there, this console that log inside the evaluate block is not going to show up in this console.

That log right here, this one. No, I won’t do that. Watch.

I just had the pause so we can see it.

Look, see, Hey, Buckaroo, you’re logging in the puppet hears dev tools. So. And then we come in here, but we have to make sure we have, we turn a promise. We need to get all this data out of here. We have to return with a weight, right? We have to, because it’s at the pause, this thing. So we return a new promise.

We passing these things. So then we can do an, a weight. So we return await new promise right here, whatever, go that stuff in there. And then we get the Jason back with a normal fetch request. So they make a fetch request from their. Parsed it, the Jason, and then we return resolve that out, which comes out into this Jason.

And that is how we get it. So this is an awesome way. I do this a lot with sites that have credentials that you have to log into. So I’ll log into the site. And instead of me having to just manually go over this, I’m just doing web automation here, me having to manually go over to those pages and scrape all that data.

If they use an API, I just log into the front and they just make all the API requests that. Because I have all the credentials in there. I could just make fetch requests and get them back rather than having to go through and just scrape it all. So hopefully it’s gonna be helpful before you have the GitHub repo in the comments, and maybe you can use it to your profit and benefit.

Thanks.