Sometimes using a proxy as an API just doesn’t cut it. You need to submit cookies in your request, append them to your headers, and manage your app state. That’s incredibly difficult with a proxy as an API.
Fortunately, ScrapingBee has a proxy mode! https://www.scrapingbee.com/documentation/proxy-mode/
Using this mode, you can easily do web scraping with something like Puppeteer AND have all of the state carry along with you.
Github repo – https://github.com/cobalt-intelligence/using-scrapingbee-for-complicated-scrapes
ScrapingBee Affiliate link – https://www.scrapingbee.com?fpr=jordan-hansen52
Hello there. I’m Jordan Hanson, Hanson. I’m from cobalt intelligence. And today we’re talking about scraping be using it for complicated scrapes. Now it’s very popular. It’s very hype. There’s a lot of really cool. Things called the proxy as an API. And that’s what scraping B is. It’s a proxy as an API and these make it incredibly easy for people to scrape, uh, all they need to do.
You just pass in your, your URL as a query parameter, and then super easy to just continue your scrape. Uh, it gets more complicated though, when you want to do any kind of form submission, or if you want to log in. ’cause then you have to kind of maintain the state yourself because we have new log in. You pass in the headers, you pass in your credentials and they send back like a cookie.
And that cookie, you have to attach to the header and you have to maintain that yourself. You have to keep it in your state. If the form is more complicated, like as a view state, anything like that, um, using an API as a proxy can be kind of a pain. And that’s why really I’ve used Illuminati because they make it really easy to just be able to send you your data as a, in a proxy it’s called a proxy mode or whatever, normal standard proxy, but then an API is a proxy.
So this is an example here. This site right here, it’s called Iowa, the secretary of state of Iowa. And you can see here, look, we’ve over here. We’ll make a submission when we submit that page. Look so as submits the data. But inside of it, it has this view state, and this view state changes with every single, like, as you move around in the pages, the view state is going to change.
And so every time you have to go collect that, update your request, make sure they’ve been validated in all matches. And it is a pain and it is hard to do. And that’s why I pretty much always use puppeteer for it. But if I use, I can’t do an API as a proxy here because I, if I do. It’s going to, I would have to maintain all that state myself.
And so I never do it. I always use an API that can approximate. I can do this. Fortunately, scraping B has something called proxy mode. I’ve been using scraping be more often. I have a video coming up. That’s going to have all, I think it’s going to compare and contrast all these different proxies, including scraped and be Lewminatti bright data, whatever scape, API, and proxy crawl.
And scraping B is definitely up there. It’s like, Reliable anyway, you’ll see the results. Um, but for now, anyway, let’s show this right here. So if we go like this and we go without a proxy, well, we’re going to run it
and you’ll see, bam. So they’ve made P this is Idaho. This is where I am good. And now if I go over here and if I turn on the proxy and I sign in, we’re just gonna make sure it works and make sure this is actually going to change our IP address. So we’re going to go over here. It should pop up. I said Idaho previously, and now it says, um, Chicago.
So it changed my, my proxy. That’s great. Okay. Now because it’s HD, so this is kind of weird to me. 8, 8, 8, 6 is supposed to be for HTTP addresses. You could ignore this HTTP S error and you can use with 8, 8, 6. I would expect it to be able to use 8, 8, 8, 7 for the HTTPS. I didn’t have luck with ADA seven.
It kept failing. I’m not sure why I have a question to support for that. Um, and so what I saw in the example you’ve been, they showed an example in their example, they just ignore HTTPS right here, errors. I don’t know. I don’t really care, I guess. Cause we’re scraping. If you’re doing any kind of logging in though, then you can definitely get your credentials plucked.
Now I don’t, I don’t know what the worry, if there really is a big concern there, but. I was wanting to do it the right way, which seems like I should use eight at seven and use HTTPS how it should be, but whatever. Now let’s see if it works. We should be able to open up this page. Um, you’re going to see bam right there.
Come on. It’s going, it’s a little bit slower with this. I don’t know if it’s because of HTTPS errors or what, so you can see it’s not secure there because it’s, it’s connected over an HTTP comes over here. We may have to extend this timeout because it’s taking a little bit longer than.
Proxies are normally slower Mexican currency a lot. Oh, okay. Wait, I apparently had one still going.
I don’t know. Uh, my plan only allows concurrency one at a time. I didn’t think I had any other going. We’ll see what happens here.
I wonder if it’s still loading the previous.
yeah, that works. There we go. Bam C worked like a charm and went through all those things that have all the different. I can worry about that. Mexicans have currency. If you’re in the basic plan, um, they give you like a hundred thousand requests, which is quite a bit, if you’re not doing a huge amount of scraping, like I just used this for part of my scraping and the max concurrency is one.
And if it’s still loading one from the other, then maybe it fails. Let’s try it one more time. Let’s do admins
and the speed’s not too bad. It’s slower, but it’s not crazy slow. I think it’s probably going to be less expensive than Lumina. But I think Lewminatti probably will do a better job with this kind of scraping judging on the speed. Got pizza in there. Yeah. Okay. Maybe I don’t know what happened, the other one, but there you go.
That’s how we do it. Not so bad. It’s a proxy mode modes with scraping me. It’s pretty cool that you can do that. I haven’t seen this in the other ones. Some of those other proxies API. I’m not sure if they exist. Um, I’ll have the get hub example in there, so you can go ahead and use it and try it out. It’s super easy to try it out and hopefully I can help you with some of your more complicated scrapes where you’re submitting forms or logging in.