Two Reasons Why Your Web Scraper Isn’t Working on AWS Lambda

This is the classic “”It works on my machine problem””. You’re building your web scraper on your development machine locally but when you go to run it on AWS Lambda it breaks.

This video goes over the top two things you should check that aren’t always apparent. They certainly have caught me a number of times.

1. Check your memory usage! This error is not always apparent, especially if you are trying to look through CloudWatch Logs. It’s frustrating error because your function will suddenly “”hang”” and there is no apparent usage. So check your memory! If max memory used is the same as the memory alloted for the function, you gotta bump it up!

2. You need a proxy. When you web scrape from your local machine you’re coming from residential IP addresses. AWS Lambda is a cloud IP address. Those IP addresses are known and bot protection is a lot more aggressive against them. If your scraper isn’t working on AWS Lambda, throw in a proxy just to check if that’s the problem.

I personally use ScrapingBee and BrightData. Here are my affiliate links – https://www.scrapingbee.com?fpr=jordan-hansen52, https://brightdata.grsm.io/jordanhansen6276

Transcription:

Hello there. My name is Jordan Hansen. I’m from cobalt intelligence. And right now I’m gonna talk about the top two reasons while why your web scraper is not working on Lambda. Now, Lambda it’s B west is product, and it does a really good job for things like web scraping, because you can do it for a small amount.

It does for the job you’re working on and it makes it really easy to do. Deploy your function to AWS. And then AWS will just run it in the cloud for you. This is really handy. I mean, some people are just doing their web scraping from their house. I understand that I’ve done that a lot. But it’s really handy when you’re wants something in the cloud.

Sometimes you’re going to go on vacation or whatever, and you want something there even automate that you can skip. Really powerful, but I’ve, I’ve run into some things that really have been painful. And a lot of times when I’m like running into them, I can’t figure out what the freak is going on. And these are probably the two most common ones that, that bug me because they’re not, it’s not easy to tell what the problem is.

So we’re going with the top two now the first one or the second one, whatever one of them is. I’m going to show here. This is one of my videos. It’s called puppeteer on Lambda. I made it last year. Now this video, I know I can’t really zoom in on the video. I don’t know how to do that. At least like the via zoom in, right.

It doesn’t do anything. Oh, Nope. Maybe is that bigger? This is really weird doing this. Maybe it’s bigger right now. I don’t know. Okay. Look, it’s like we’re on mobile. We just like mobile. There’s okay. Anyway, I’m sitting here debugging. You can’t hear it, do it on the audio piped in, but I’m sitting here trying to debug and I can’t figure it out.

I’m sitting here on this video and I’m like, oh, why isn’t this working on getting this error here? And this was part of the problem. You don’t always notice this. And the problem is the worst thing is, is look right here, memory right here. So I normally use 1 28 megabytes for most of the scraping I do. If you’re using large datasets, if you’re downloading any files and you’re parsing them, or if you’re using puppet here, some kind of head, this browser, you gotta be bigger than 120 megabytes.

And it says it right here. The problem is you don’t like in the logs. It doesn’t say it. Not, not in a red, at least you have to see the bottom and the here. Let me give an example of logs. What it looks like here. Let’s go over and get one second, one second. Go over here to CloudWatch here. Let’s see. Okay. I can’t type.

Okay, well look, we’re going to cloud watch, read, look at the logs. We’re gonna go over here and say log groups and let’s just say whatever. We’ll see. Gulf war. So he looks at a log thing right here. Oh, this is really small. Bigger. How about that? That’s visible, right? Can you, can you, I couldn’t be that a mobile area.

There we go. Go down to the bottom. It says in request ID, even the one before that does even say it, where is it? Come on after one or there, there, see right there, it says memory size 1 28 maximum or used 1 21. It’s not very clear. So a lot of the problem is your COVID will just stop. It doesn’t say anything.

So all of a sudden your code, you have of these logs and they’re trying to deep, like what the freak is going on and it just hangs, it just sits there and your list, like what is going on here? Why isn’t this working? And it’s such an easy fix because you just always just double-check your max been being used.

If your memory size is 1 28 maximum, or you used as 1 28, this right here, this is getting scary to be already 1 41. That’s kind of close. If it’s there, you need to just increase it. It’s so easy. It’s not hard to increase your memory, but it’s easy to miss. So number one, item, check your max memory used. You think the default one to the function could be working for weeks and then suddenly the files are getting bigger or whatever, and then maximum or use is going up.

So that’s item number one or two. Anyway, I think one of the biggest items. That this should always, and as you get better web scraping, you should think about this more often, but what’s going to happen is you’re going to go, go to go to code. You’re going to be coding locally. You’re going to be developing and testing it locally and it works great.

Everything’s comes out. Perfect. And then you set it up to Lambda. It doesn’t work. That’s super frustrating. And when you’re new to Lambda, all you’re going to think is I hate you Lambda, why the frequent you’re working. So we have to remember is that Lambda is coming from AWS. Is. And a lot of people can, it’s a, it’s a range of IP addresses that says, Hey, this is a cloud service.

So people, and it’s not uncommon for them to be more aggressive and anti bot blocking versus those IP addresses. So an easy test is to say, Hey, if I put a proxy in here, does it work? That’s really what you should be doing. You’re good at your, your local development is going to be better because you have residential IP addresses.

Once you send it up to land, but you’re using cloud IP addresses and they get blocked easier. They can recognize it’s a bot easier. So I realize it’s easy to get frustrated here, there you set it up and it doesn’t work. You think all that kind of sucks. It really is just the antibody is going to be more strict against it.

So test with the proxy pretty much, that’s always going to be the problem. If it works locally and you put on Lambda and it doesn’t. Check your proxy or check your maximum reused. So there’s proxies. I use I’ll have the affiliates. And the description use mine. Don’t use mine. I don’t care. It really doesn’t give me that much to me.

So you just, but really useful one scraping B’s probably one of my number one right now, and then bright data and they both have goods and bads about them. Scraping be I did a recent comparison video. It’s going to be the best, apply the fastest and as well as the most reliable for error rates. Anyway, you can use them.

Those are the two top items that most web scrapers are having struggles with when they go to AWS Lambda. Hope that helps you.