Scraping the Delaware Secretary of State – Part 1

This is part 1 of attemping to scrape the Delaware Secretary of State. The goal is to be able to submit a business name and then return the business details, all programmatically.

This video goes over a lot of great things but then ends up in frustration when something is going wrong with the captcha solver. The awesome thing is, shortly after the video I figured out the problem and my next video will show the fix!

Puppeteer extra recaptcha module – https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-recaptcha

2captcha link – https://2captcha.com?from=7390140

Transcript:

Okay. All right. Well, we had to take them live in production. As we said, this technical, we’re going to Delaware. Delaware’s up and ready to go. We have a business. Same somewhere. You want most movers? I think right. Type that we just hit do stuff right now. We’re going to be using the capture stuff. That’s fine.

Anything? Yeah. Right there, baby. Sweet. Business name. I said, Delaware, I think recapture is going to be the biggest part of it. And then we’re going to see what we can do from there. Um, now I’m going to be copying some stuff. We’re going to be using this cool package called puppeteer extra over here. It’s pretty great.

And we’re going to say this. Like that I think. And then we’re going to call this puppetry extra like that. Now I can go into how to do all this and I’m going to do right now. And then I’m gonna go over here.

Yeah. Like that. And then where he’s at function, we’re out. This function called async function. Get. Search business. I’m gonna call it search businesses. You gonna have a business name here, string, and we’re gonna have the browser. This is coming from puppet here. And then we’re going to say constantly Warrell and we’re going to go to this place right here.

And then we are going to say on this page equals await browser go to, uh, is that right now browser? That new page

always when we start this, actually I wouldn’t have a browser yet, but we’ll get one. Um, and then we go await page that go to URL. All right. We started this as quick as I said. We’re going to be handling Delaware. Delaware has recapture. We’re gonna have to handle that too. The goal with this is to kind of handle this.

No, not this, this, Nope. Not this either. How do I get home? Nice. There we go. I’m gonna do it here. There’s API. Um, where now we can scrape a bunch of different states, Delaware. It’s a big one. Very business friendly. We’re transcribing. Let’s go. So we searched businesses. We’re trying to find all the list of businesses, right?

So we’re going to go over here, went away to go to that page. And I want to just wait to see what we see here. This is where we check to see if the catcher pops up. And then we say await search for businesses and we business name, and we say, browser would need a browser. We haven’t, I haven’t yet. And it’s going to look like at this const puppeteer, this thing right here is what I want.

Which is kind of puppet your extra as like a puppeteer helper, Atlas false.

Yeah.

Crews in here are kind of thrown off by my start. I started here, let me say it started here. And I was trying to do something a different way to select. So now I’m trying to do it the better way the right way. Now, here we go. All right. So now. Let’s see what happens here. Oh wait. We should always close our browser at the end of our thing.

We’re just gonna see what we get. It’s pop-up puppet puppeteer. See what happens here.

This is what happens. All right. We’ve got something wrong with our cannot find module. Didn’t I saw this, apparently I did not need install this stuff.

Installed it, what happened? Puppeteer, puppeteer, extra puppet, extra plug and recapture. Always at the stealth one to get rid of it. I kind of right.

That’s what it is though. Huh? It’s failure and stuff. Yeah. Okay. So NPM muscle save up the tier. Extra login stuff.

We install and stuff. Let’s talk, let’s talk. And I’ll be running

now because I said had this false right here. It should pop open. So then broke already. That’s not good.

What function is it? What. No boy, I’m afraid about this documentation updates. Give me the seven or eight. Yeah,

I’m a reinstall puppeteer. I don’t want versus 10. What is it?

This is what I’m looking for. And it’s going to say Virgin. Okay. 10 with tens. I have it on the list. Why is this showing up? I there’s no documentation for it. Oh, that’s where the surprise. Okay. No new page now to have the type mode. What is going on? Okay. Well we have the type here. Oh, maybe it’s my pup, the extra thing, huh, bro.

And one spot. The problem here. This is a promise. I’m not waiting. Nice.

All right, here we go.

I pop up at a puppeteer, the search. We see what we see. Okay. So far, so good. No problem. Now we’re going to go through and find our deal. This is what we’re looking for. We’re in look for this field. It has a nice ID for us. So we’re going to say, wait, page dot,

where I really want to do a sit tight, right? I just want to type in their type of business name. I’m going to say, wait, page, click. And here’s where the thing happens. Maybe. I mean, it happens a lot when I’m just doing myself, so it’s going to happen now. Let’s try again. What’s this just working like 10 seconds.

Okay. See, I entered it there.

Did I? I didn’t see anything click did you, does it take to, okay. Look at this cache I a you think I’m a robot? I probably get worse. I think they get worse. I think there’s more. Okay. So I say, ums movers,

click it in there. Hit search,

click. It didn’t look like it clicked it. Right. You know, sometimes I see clicks not work very well for some reason. I don’t know why it’s something that’s been happening anyway. You can evaluate it. We’ll see. We’ll go. We’ll see what happens here.

pops open, enters it in there. No cookie, I didn’t see cookie. I know you didn’t see clicky. Okay. So that means we’re going to use the other, I’m just not going to fight this. I’m going fast puppet here. Quick, not working. And so we’re going to do is find this L they shouldn’t this evaluate option here was definitely there don’t even tell me it’s not there.

It’s not a viewport thing, whatever like this. I think, I think this is why, why wait, page dot evaluate, and it’ll be my selector right. Then I want my selector here. Or maybe I could just do evil. I wonder maybe this. And I say I element element.click.

No. All right. I’m making stuff up.

Let’s go back to version seven, we’ll say, evaluate, how did this page function like? Oh, I do I see this whole thing? I do, John. I give me an example here. It really fast.

Oh, interesting. This is what I would have thought.

Passive function in there. Hold on. I’ve done this before, but I’m so curious why it’s not working. Why isn’t it click? I mean, it didn’t even say couldn’t find it. It’s there content placeholder. Am I in the wrong button? Check that again.

I think that button.

I mean, what if I said, uh, I know this is teeny tiny. What if I said just, uh, type, submit. I mean, how many summits are there any one? Alright, let’s try that just for fun. Now. I know I can do it this way and it will keep looking why this goes over here.

Hey, there we go. And then capture showed up. How often is that going to happen? Let’s check our click work better with this. So Sylmar with my selector know about that.

okay. Let’s try it again. Now I go like this search. Awesome. Okay. Now we get to this page and now what we do is once we click here, we look for this I-frame is going to be something that holds it. I know it’s small, sorry, team to make it even bigger. Right? That’s what I’m looking for.

It’s the same blue thing right here. I’ll be like this. Now what happens if it’s not there? We’ve gotta remember that. What we do. If there is no recapture

right now, that’ll fail. We’d probably be like a try-catch here, but I don’t want to wait a full five seconds if it’s not there.

I’ll think about that. And then we go like this

and then it’s going to solve it.

And then after it’s solved, I think we’ve got to click this button cause this button is not part of it. We cook it again. Same, same type of thing. How many of these are there? Damn. I just want right. All right. Here we go team.

But I think it didn’t save my search queries. Right. Didn’t it. Well, we’ll see. Yeah.

Oh, look, that means white isn’t purple. I mean, solving. Wait, what we need to log here, I guess.

Okay. Catch it is salt

before click

after click.

Let’s try it again. Now we just like clicked super fast. It says somewhere purple violet means detective green equals salt. It definitely was not solved yet.

so you look at it, clicked it so fast. Like it did this, but it wasn’t done.

Okay, I’ve done this before. Obviously we’re going to go back to the, one of the experts on this right here, the JavaScript web scraping. Okay. Let’s see. Puppeteer avoid being blocked. My, this, there we go.

What did I do over here? I was in Zillow and I went over here. I detect the capture.

Hap right here. Where’s our salt right here. Wait for selector. We make sure it’s there and then we solve it. Okay. Then I do some like this. Huh? It doesn’t have a gate. It again. No, let’s see what this is. So let’s say solves. It goes to the capture. It solves it. All right. Well, we’ll do what it says. Let’s try that.

What did we say? I mean, it worked there, right? Why don’t we believe a little bit here. Have some faith, Jordan, come on. Let’s see what happens, right? We’ll give it a chance. Maybe it’ll work.

you guys waiting? So anxiously. Okay, look, that’s going. Oh, did automatically go. That’s good.

Now we’re back. It worked just okay now, but the problem is it didn’t resubmit our thing. It just took us back to the original forum. So what do we do here? It’s like, we need to call

this thing again.

What if I say he ain’t go return

the same thing, wait to search for business, business name. Roger. If I know if we can accomplish, right. Which essentially just want to start over. It’s going to do a new page, that color shouldn’t color a cookie. Our browser should be there. I think we want to kill our page though.

I’m not close that page. I just want to do this stuff again. No, I just want to do this again. Is that, well, I have this thing. I ain’t changed that. Okay. Now what if we do that? They don’t want to go like this again. Now the question is if a pops another capture, I mean, if we can put a loop or wait to Recurse anyway, this is like, assuming that will happen at once.

Let’s see now, um, I am seriously worried when this doesn’t show up, but let’s assume it’s gonna show up every time we’re going to be coming from a, some thing. That’s obviously a robot solving it, solving, solving. That’s pretty fast too, by the way. Good job capsule solvers.

Okay. I didn’t hit submit. It’s going to go in there and type it in and then not do anything.

so let’s take a little longer. It’s okay. It’s violet. It’s detected, it’s working calm. I know I’m thinking 30 to 45 seconds,

but we’ve all seen the sketches. They can be pretty crummy. They’re probably getting the ones with like the crosswalks and stuff. Like maybe the bikes on it. Maybe it’s not, maybe the bike is part of the crosswalk. We don’t know. They have to click the buttons. Who knows? Boats, fire hydrants. We’ve all been there.

Oh, timed out. I didn’t look at this air over here. What happened? 30 seconds. Okay. Okay. All right. I can increase that time out. Can say, can I say, wait until, yeah, maybe look at this and I say, I’m going to say, let’s make it 45 seconds. I mean, oh, I didn’t know what that meant. Right. That’s a long time. Yeah.

And then I want to take, I want to sit, save the submit button.

I how long it was timed down. I’m just like, wait for it. Okay. Well where maybe it wasn’t working at all. Okay.

Where you went back to the same thing.

You see that and went back to the same thing. That’s what I was worried about. If it happens more than once, then how many times would we solve it? If we solve it twice,

that’s awesome.

Solve it twice. Went back there. We clicked the button and wouldn’t happen again. So we had to be able to Recurse here. So we have to be. I think all this has to be its own function. This

like this, my mind knows what it wants, even if like my body knows what this needs to be. Even if something like this watch is the only thing at this refactor, I’m going to say Nope, module scope, and we’re going to call it a handle. Search and

come on certain capture. It’s like almost like search and rescue. Okay. It’s not that good. Now we’re going to go through here and it’s going to go to the page. That’s going to type in the stuff there should be.

Yeah. I want to import it from there. Okay, it’s in go to the page. It’s gonna type it in. It’s gonna click submit. It’s going to wait to see if there’s a capture there. Now I need to be able to, we need to be able to handle the alternate pass. I think this is going to be critical to it. Well, what if we go look for the positive on it?

So, okay. I’m not a robot. Oh gosh. How bad it’s going to chimneys that wasn’t bad.

All where are you at miss movers? I bet that gets me again. Watch.

Okay. Hey didn’t okay, so this right here was this here before? Ooh, I don’t know. Capture div, come on bro. Cable results.

If we get a result back now, what happens if it doesn’t find anything weapons? I go like this,

it’ll say no records found. So we have options here. Free pass. Can’t find results,

results. Workers found.

Right. And I want to look for that air

and then we’re going to call here

like this now what’s our second path is another happy path. It’ll say.

Capture traffic lights crosswalks to that. Okay. Oh, I do want to look at this table. Results is always there. There’s always a body is always a row. Maybe we look for what type of results.

T R N a type two, because look, there’s no two right here. Write down results.

That’d be something like this.

And I put this in

the show that this is a selector. Um, will exist. Now we should look to see if this exists and it’s just hidden or anything. We want to make sure it is there, but there’s no stuff in it or is it hidden? What’s the thousand, this baby it’s puppet here. Pick it up if it’s just hidden.

Oh, look at that. I’m trying to work in this tiny space here. A plug. I don’t have four monitors. Okay.

It’s there. And I say, display look, display type. It is display. Block is teeny. So it’s going to be there all the time.

I’m going to say, well, we can check with content to see something’s in there. Right. So we say, Hey, is there, is there a text in there? Okay. It found results or captions displayed three.

Okay. Now let’s try them. So if I go here, I type in my, I wish this auto had some auto bumpy in here, so it’s a type this every during time.

Okay. So we got path number two for this. This is happy path. This is most when we hope the most happens. So they click the button and I go like this, and I say, um,

what if I just look for the link? Yeah, yeah. Cable Rus. I say await page dot dollar, dollar, dollar dollar. And I just say like this, I’d say if.

Can I also go like this? What if we do capture check first,

but I feel like I should follow this, which, um, would that be? Niagen put them do whatever. Want this SI one, two. All right. There we go. So first, when I hear now, we’re going to say if table rows dot length is greater than one. Cause the one is the header row. The first row is the header row

got results. Yeah.

Best ever. I think it’s good to applaud ourselves and we found it. Alright. So table rose leaves one, if it’s greater than one. Awesome. Now the other thing is else. If now this would be just like, Hmm, okay. Now we go like this. We could say can’t find results. And we go like this. Um, Air message equals await, paged up a Val.

I’m going to grab my little selector here.

Want to say element, element dot text content. Now, if it equals empty string ELLs, if there’s no error message, because it’s an empty string.

I’m going to say, what if I sit like this?

It’s funny that it doesn’t clear that to me, I’ll even call it trim error message dot trim equals equals equals no records found.

No, no capture, but we didn’t find anything search.

Okay. I guess I gotta just put that in the comments. Is that I feel like that’s a common, this could say I’m going to escape. Cause I’m just stubborn about my single posts. I wouldn’t like, I would like to know our search query is missing. Yeah. Right. Yeah. Let’s see what we got. And then I would say a third.

We look for this, this

capture. We’ll wait. Now we can’t avail this. If it’s on the page, if we go to the capture where this whole fail. Watch we’ll do it. Um, cause this will probably happen. No, uh, it just means we can’t do that shortcut. The was kind kinda bad for that.

No, I say Ellis. If captia

it’s solving time.

I can put my thing in there, whatever.

Um,

now could there be a third path from her? A fourth? I’m not handling it. I have no one else here. All right. What happens if, what happens if it’s of those three,

go, go like this for a capture,

but then this is where we call it again. I always say return await search because our handle search and capture. Searching. Yeah. Gotcha. And what did we call this other one?

The page page want, huh? Yeah. Like that. Okay. That makes sense. Let’s go through a step through it again. All right. We’re in ticket. Check out three ones. We have table row. We took her table rows. This is going to air. If we do this in a pile, they are assuming we’re. If I could see right now, Because as soon as it does plays a capture, this is going to fail because after clicks anyway, we’ll check that.

And so if it hits the capture, it solves it and then it goes here again, it calls us wind and it’s going to head down a few paths again. And so that means it can handle the capture again. Okay. Let’s go apply. Fail every time. Wait, will this happen? There’s no way to stop it. We probably need to keep it anytime you’re recursing we’ve gotta have a way to exit out, like, okay, we have failed five.

We don’t want to sit here solving capsules into eternity.

Damn good luck. It didn’t work because I tried to find this and it couldn’t. So we got this error message. We’re going to like this

because that element is going to be there and I’ll have to get the text out of element. I can’t remember this. Look over here. I had a normal way to do this as say, if I go like just dogs then yeah. That

the air methods has to exist. How do I do that? I want to see if this exists, this will not exist. So I was going to check here first, but wait one to find any

how oh, because. This will be undefined or empty array or something because I tried to search it and find me. So this is going to try to find this, and now we’re going to say error message dot. Now we get an element. It returns an element handle, so, and defined as it possibly defined. Let’s go back. Where’s the return.

This returns.

I don’t know what that question mark means. Oh, no element over eternal. So now we can say if error message and now we have to get the text out of there. So does element handle, we figure out we get the text out of here.

Um,

well we, don’t what we do. If we want to get it from the exact same one. Kinda evaluate here.

Yeah. Like that. That’s why I want an air message dot evaluate,

sure. Element element in her texts. She gave me some stuff here.

Why don’t you like about this? All right, I’ll wait, this,

I’ll take this. Yeah, this is what I don’t like doing. Okay. So now we’re going to evaluate, okay. Let’s do happen, so, okay. Does that make sense? We have our element handle. Now we’re going to evaluate this stuff and hear the inner text and we’re gonna see if he equals. Eric, no work is found and we prepare for the eternal capsule solving.

Are you ready? Okay. It failed. It’s good. Nice. Very nice.

yeah, we navigate did which one died?

Okay. Help me cousin navigate, uh,

No, unless it didn’t wait

for navigation.

I just say time out. I just say like, uh, the only thing, I mean, it could be other things, right? I’m not all knowledgeable here, but it clicked it. And then it said execution and context with the story. What’s almost made me think it started collecting this stuff. So it probably does, you know, has to Navy navigate.

This is what happens if you put a one second time out now, one second time outs are bad. They’re bad. Why are they bad? Because why would you want a hard-coded time? What if only takes half a second? You don’t want to wait an extra half, a second

to worked. Okay. Oh boy, terminal capsule solving time

has never got to trust me.

It’s discouraging to hear. Yeah, this is just what I need to solve captured, like, you know, picks like what a minute per request.

No. Okay. So we’re not making progress here, obviously. Oh, that’s frustrating.

Okay, let’s do this. Let’s make sure we’re properly

like go wait a full 45 seconds. Like let’s make sure it’s supposed to turn green when it’s solved. And so let’s make sure it’s properly solving. What if were hitting? What if we hit the submit button before?

Mountains or Hills, like what just happened?

What.

wait, what? My hitting the what’s going. Wait, what

so confused right now.

didn’t I say, wait for time out forty-five seconds, but it just like instantly progressed.

What happened?

It should hit for a second for 45 seconds then before it does anything.

You come over here and start doing stuff and it just carries on.

Yeah, just sits here. Then you hit submit then refreshes.

Best for a second time out.

Now does this mean we didn’t solve it properly? All right. I’m going to have to dig a little bit and think about this. So we have, here’s the problem right now? We don’t know is our capsule solver working. I mean, I’ve, I’ve used it quite a bit. It’s always been really reliable in the past. I’m not a hundred percent.

I don’t know. So I had to test that. Let’s see. Is it actually solving the capture? I mean, it says right here, it’s going to turn green.

Yeah. Wait, do I have visual feedback on? Yeah. So it should turn green when it’s solved.

Yeah, sure. To an agreement soft. Um, I didn’t see it turn green after 45 seconds. What if it going? We give it 90 seconds. We’ll stay here for a minute and a half. That’d be very exciting. Anyway, this, this is one look at, I’m going to try it out for 90 seconds. See if it’s actually solving stuff, if it’s not, is the problem of solving, but somehow am you, can’t push the button.

Can you, I guess that’s why you need to test, like what happened to the victim over here? Oh, what just happened?

It’s the freaking work.

Yeah. Whatever, maybe actually solved at that time because it waited so long. So to say it search. You bring me to cash. I hit submit. Oh, look back. That’s what I’m worried about is happening here. It’s waiting for navigation. Okay. I gave it a hard time out. It doesn’t know it’s finished.

Okay. I think I got, so I think it just, it hasn’t finished it. Well, the problem is.

How’s it going to know? How does it know when it’s finished?

okay. Wait, hold on. It’s hitting the button before it’s finished solving.

what was over here? Searches. It’s the capture it’s solving now, but like it’s waiting now. So how does it know navigation? If it wasn’t finished it wasn’t finished there. Like, in fact, it’s still spinning up here.

As soon as it completes complete, it hits the button that moves us over. Yeah, that’s why it’s happening. Okay, good. I need to work on this. I’m turning off a pause and this is part one. We knew this was gonna be a trial slogging through let’s go.