Best Practices for Web Scraping Large Amounts of Data (Big Data!)

Two things you should be doing when you are parsing a huge amount of data. And if you are a web scraper, you are probably parsing a huge amount of data.

1. Handle your exceptions!

Stop being stubborn and swallow your pride. The longer a function runs, the higher likelihood that you are going to be exposed to something unexpected. Put a try/catch around your function so it doesn’t stop your script that’s been running forever and ruin your work.

2. Chunkify

Break your long running functions into smaller chunks. AWS Lambda forces this upon you with a 15 minute max time limit. This makes you move the state of your function outside of your code. Store it in a database where it belongs. You’re a lot more protected from something breaking in your function this way. The database will keep track of where you are.

Transcription:

Hello there. My name is Jordan Hanson. I’m from cobalt intelligence. I am a web scraper, and I’m here to talk about some things you should use and best practices when you’re trying to web scrape large amounts of data. Th they go hand in hand data and web scraping. They’re just like, they, you almost can consider yourself a big data guy because you’re going to be using web data.

Lots of it. And all the most of you, a lot of you, web scrapers are using a lot of data. So I want to go over some best practices when you’re had knowing this huge amount of data when you’re web scraping. And normally that means like you’re scraping a bunch of pages. You got to save it somewhere, whether that’s in a file.

Or a database or whatever. And I just want to go over some of those things. There’s two items I really want to touch on. Now I’ve done this a million times and if it’s just me that does this, I would be amazed. I am sure you do it too. So don’t pretend like you, don’t the one number one item is you’re going through here and you’re going to build out this function and you’re like, huh?

You do the math. Okay. It takes about five seconds per page. And I got about 10,000 pages. So it’s going to be running for like a. So in fact, let’s do the math. Let’s say we have 10,000 pages. Okay. I’m kind of curious. Let’s we have 10,000 pages. It takes five seconds. Each. How many seconds are in a day?

Alexa? How many seconds are in a day while Alexa, Alexa, how many seconds are in a day? One day is 86,400 seconds. 86,000. Okay. Let’s take it. It takes 20 seconds. So it’s a little bit longer. Do I have 10,000 pages to 20 seconds? That’s like two days, three days. Okay. So we are going to do this now you’re going to set it up and you’re gonna start it running and you’re going to make sweet.

Now, anytime I see something that’s running longer than like an hour, I do something different than I used to freaking. You got to save that thing to a database or to a file or whatever. If you’re letting that thing, just run. I have done this so many times where it’s running and I’m just like, I’ll just leave it.

Come back a couple of hours, come back. And there’s some freaking error and. That I, that broke it and makes me so freaking mad, which brings me to my first item, which is you need to handle exceptions. You got to handle exceptions, try catches right here. See, look at this. I didn’t even mean to have this page open.

I wasn’t gonna use this as an example. It was just happened to be here, but look, try, catch. This right here is something that is not critical for my page. It is nice to have, I need this to have, but when you’re handling large amounts of data, it’s okay. I’m gonna, at least most of the time you can skip one or two records.

If they break, if something’s weird with them and they break, well, you don’t want is to put your whole application, your whole web scraping script at risk. So you handle your exceptions, put, try catches around them. You don’t have to throw, but I’d try cats, log it out and then let it continue. So you have your loop going forever.

And it fails for some reason, maybe it’s just connection. We said you could, and then you’ve just handled the exception with the try-catch and you could, can continue on your Merry way. Otherwise it’s going to hard fail like three hours in and you’re going to kick yourself cause you just lost three hours.

The second thing, and this is one reason at first, when I used AWS Lambda, I hated the fact that I had 15 minute time limit. It would limit you to 15 minutes and that felt like it sucked me. I was like, I have scripts that run for like hours. Why would I want to do them at two 15? But it makes your code so much better.

If you have to chunk it out, it forces you to move the state of the scraper somewhere else. You can schedule it for that happened every 15 minutes. So it’s essentially running forever, but it forces you to move the state out of the function. If you leave the state inside the function, you run the risk of when it eventually errors.

And it probably will, unless you’re amazing at handling exceptions that it’s okay, because you can pick off right where you live. You get a hundred thousand records of your million, it’s going to air, but that’s okay because they’re all saved in your database and then you can go back and get them and just retrieve.

So the second part is you should break it up into smaller chunks. Now, if this is on Lambda, that’s great. It’s going to be forced on you. You have to do it in 15 minutes. If it’s not Lambda, you should always be. You should be breaking this into smaller chunks dysfunction, whether you do it with a Cron job or whatever, because otherwise you’re doing.

It just force it, move it into another longer a function runs. Something’s going to go happen. It’s like a law called Jordan’s law. The longer a function runs, the more likely something’s going to break. That’s it. That’s the law. I just made it the law. The law is it’s going to break. So break it up into smaller chunks.

So the end, then that forces you to put your, save your data, change the state outside of the function. So it doesn’t break and you have to keep track of where everything is. The database should do that for you, not your function. That was the two reasons, two best practices. I think for handling a large amount of data, when you’re web scraping, I run into them time and time again, and I hope they help.

We’ll see you.