Do THIS Before Web Scraping

Before you get hardcore into parsing some HTML like a real professional web scraper, you should do this.

Check for hidden APIs! They can make your web scraping life easy.

Transcription:

Wait, but wait, don’t scrape yet. Hello, I’m Jordan. I’m from cobalt intelligence today. I want to talk a little bit about looking for hidden API APIs. So we’ve talked a little bit before about this works, but I’m just gonna kind of review when you’re any, any website, how the web works is you navigate to a page right here.

Let’s see over here. This is the URL you go to the URL. When you land there, you send out a request, you say, Hey, give me all the data for this page. The server gives it back to you historically. And I say historically, like when the web first started and still today, there are some websites that do it most probably is what it’ll do is it’s going to send out a request and then all the data is loaded up there, right?

At the very first request. A good example of this. Well now there are things called hidden API. And they’re not really hidden. They’re just, this is just how the world works. What happens is the page loads in, and then the page loads in a little bit of JavaScript and the JavaScript says, okay, now I’m going to, we’re gonna get the rest of this data and what that does that makes it, so then you can load stuff after the fact.

So you don’t have to load everything at the beginning. Plus if people click around this place and an example, but look, I click over here. I’m not w the page is not refreshing. I click over this and this house, right. And it’s look at the page, didn’t refresh it all. It went over to the backing up more data.

So when you see that, do you know, bam, there’s some kind of API hidden behind here. And so now you could try to go through and scrape all this. Like you could try to navigate to this page right here and you can try to scrape all this data manually just by parsing the HTML. But the hard part is, is the HTML is going to change potentially, and it could be on different versions, different pages, have it in different spots.

Zillow is famous for this. You can get all these data. You have to push all these buttons, make sure they’re expanded. They’re all in a good spot, in a good spot. Or you just do something like this, where you kind of look over here, you click the network tab and I’m gonna make this bigger for you. Click the network tab.

And when you refresh the page, you look for something in this. Normally it’ll be here in the fetch XHR before. Quick over here. And then you look now Zillow is a tough one because they do a lot of things in the background, but there’s going to be something in here that has all that information, all this information that was loaded will be available in here somewhere.

Let’s see if it’s this one, not this one,

not that one.

This is a different things like that. I think this is, I don’t remember which one it is and they could have changed it. Even. I use a different one. Now, maybe it’s this would, this is the map, but anyway, you can look in here and you can see all this data in there. Another good example that I want to talk about.

It’s over here, this one has all the search results list results. You can see all these things in here. Look all the data they have in here. Look, that’s all of them. There’s the address? There’s the broker name? Interesting. These are houses that are first. All the information in there. And let’s go over here to like the New York secretary of state and is a really good example.

So I’m going to search for something here. I’m going to search for like pizza and enter. So first thing’s first look, it sends out the thing it says, Hey, since at the request is, is, Hey, they’re looking for all statuses. And the search value is pizza, by the way. What’s the plural status. Is this data.

Plural status is okay. Phew. I I’ve questioned myself before and I’m like, oh, look at them. They’re using statuses. Okay. Anyway, every turns the data here and you can see all this stuff in here and then they click on one and then this right here, bam has all this information all in here. So all the information is on this page.

It’s a nice Jason right here in this API. This is incredibly common with any modern. And this is what you should check. First. The API is going to be able to handle more load, and it’s going to have the data in a better format. It’s going to be, this is going to be what you’re looking for. Does every state, every website have this, certainly not in any kind of older site.

Here’s Oregon, for example, look so you can already kind of tell them the site right here. You look at it and you’re like, oh, well I’m probably not. So what go over here? Pizza search. You can see the page, the whole page loads

slowly, even look at it. They’re going to take its time over here like this. And look, there’s some stuff in here, but this isn’t, this isn’t what we want. This is Google analytics. So if it’s not here in the fetch it’s XHR tab, it’s over here in the doc tab. If you come over here, you can see, okay, I see all this stuff then, you know?

Okay. This was loaded on the page load dock is the key fetch X, X, H R. That’s the ones where they’re probably age Ajax. So this Wagner look over here, but you can normally tell if the data loads on, on the page load, when the page initially loads up, then this is it. Okay. So that’s what I want to talk about today.

That’s it easy. Just look for hidden APS. Because most of the time that data is going to be better formatted and it’s going to be a lot easier to request, like, look at this request. It’s simple. You sent over here, you can set up with this payload. Exactly. All in Jason and done, you get the data and then you just parse out the Jason and you’re good to go.

It’s a lot easier way to web scrape. So do that, do it.