Long polling on AWS Lambda and API Gateway

AWS’ API Gateway has a 29 second timeout.

Getting Secretary of State data from some states can take longer than 29 seconds. This video shows the basics of how we use long polling to still be able to access data from these states that take longer than 29 seconds.

I use AWS’ Lambda, API Gateway, and DynamoDB here.

The example used here is with the Secretary of State API – https://cobaltintelligence.com/secretary-of-state

Transcript:

All right. Here we go. We are doing, talking about long polling and AWS and Amazon gateway. Now let’s give some background here. Um, I’m Jordan, Jordan Hanson are here from cobalt intelligence right here. And, uh, part of the services we offer as a secretary of state API, which is allows you to essentially, it’s a big scraper for all the secretary of skates state side.

And you can just type in a business name like, um, right here over here to Idaho. It try it. And it goes out and scrapes that site. Now this is pretty awesome. And we go all through Amazon, the API gateway and AWS, and then we go to, um, let’s see, was it APA gateway? And then we go to Lambda and it pulls all the sprints, like some scraper.

Now I had a problem with things like states like Delaware, because Delaware. Takes so long Delaware, Texas, anywhere with a capture, it’s going to take 30 seconds or more and API gateway, as I found out time’s out effort, 29 seconds. So that’s the longest that request can go. So what I built here is long polling and it works pretty well.

Um, in fact, I’m really happy with it. Uh, it’s still, I still have some more stuff I want to add, but anyway, I want to walk through how this works. So, what I did is I built a table and dynamo DB. I called the SOS search long pole. And how I have this built is I have one Lambda function that queries all the other states.

So you’ll send in the function. You’ll send an okay. I want to search for this business name at Delaware and it comes over here. See, it gets the state, I guess, the search query. And then it makes a call over here to the Lambda function of right here. But the search query and then that function, then I have like a Delaware function.

That’ll go over and do the work. We turn the data over here and then return it back to the user. And that works really well. This is the API gateway part and the other one’s just Lambda. Now the problem comes in when we have, um, when we have something that lasts longer than 20 seconds, which is a lot of them because it has to go scrape.

So I’m using puppeteer to go over there and it solves the caption and it gets the data in Delaware can take like a minute and a half sometimes. And it’s a long pole and this is the steps we take. So what we do is first is it goes over here and I add the tag on these states in, um, in Lambda. So let’s go to Delaware right here.

And so I went over the configuration, I added a tag right here and I called it needs to be tri true. So it goes through here and says, okay, it gets the function data right here. And then it checks to say, Hey, is it. And the tags is our needs. We try one. If there is, let’s start this, um, an asynchronous Lambda in vocation.

Now this was weird to me because most of them, we hit a weight. That means it’s going to wait for it to finish. But apparently if you pass this invocation type event, it doesn’t wait anymore. It just goes up there immediately and just ends and it ends, it just returns it really quickly. And then it continues, which worked perfect for me.

So, what I do is I go over there. I start the function. So essentially it goes over here, it starts the function running. It says, okay, Delaware start working. And in the meantime I pass over, I didn’t make it entered into my database, uh, with a retread ID. And then I return that to the user. It’s like, okay, let’s see, I’m turning.

And we tried due to the user. Uh, I just realized this needs to be returned at an API like this. Otherwise it’s not gonna work for anyone outside of that. Okay, well, I’m gonna update that right now. Here we go. Well, whatever, I think that takes too long, but I’ll go and update that. And so it returns this API, this Retraded to the user.

And then meanwhile, Delaware keeps working and when Delaware finishes, we go over here and it does something like this. It updates that table with these items. It says, okay, no business out of it and find it. And let’s actually start this so we can see. So we’re gonna search for this business right here.

You can barely see it. They’re superior building services. We’re in test and it’ll be really fast. Now it responds like 800 and 800 milliseconds and it gives us a retry ID, which is awesome. Now we’re use that. So this is Delaware requires a longer time to acquire the data. And meanwhile, Delaware’s working right now.

It’s working right now, which is awesome. We go over here to read with a retry.

Yeah. And, um, save, or does it matter to sit test? And it’ll just say, okay, Adam, not complete now. We’ve over here to our database and look, or this entry right here, it’s going to check it. And when Delaware comes over here, it’s going to search for this information. If it can find it, it’s just going to update that data with all the business details.

And then that’s what will be returned when it calls. We go over here like this, that same function that starts it, it checks to see, okay, S date search. We, or we’ve tried. If it gets a retry ID, then it goes ahead and it gets the data. And if it’s complete, then it’ll delete it. And return the data we want and if it’s not complete, then it will retry it.

Hey, it’ll say, Hey Adam, not to complete, just like we saw over here now, Delaware, like I said, it can take like a minute and a half, so likely it’s still going still running here. It takes awhile. Um, in the meantime, I’m going to build out that cause it needs to be, to have a proper response like this like that.

So I’m going to go over this and say fair response. I want it to respond this here

and we’re going to do that in all the other spots. Meantime. Yeah. See, it looks still going. It takes a while. Like I said, Now we need to do the same kind of thing. Whereas response. Good to find anyway.

Okay. It’s by default at the 200. Okay. So I’m going to go back down here to the bottom like this, and we want this, we want those in all those other spots.

Wow. Like right here.

Just like that. Well, that guy and the same thing right here.

And instead of this, we’ll have this go. How are we doing here? Oh, there we go. It did it. So now we have our data because Hey, it found it. It says, okay, now I was all the information and. And we just delete from that entry. So that’s it long polling, pretty easy and pretty cool. I think, um, now the lame thing is you have to keep trying with your reach ID, but at least we’re yeah, we have not have a method to be able to get these data.

That’s a lot longer. It takes a lot longer to get Delaware, um, Texas, whatever. So remember, again, steps first we start the function with that in vocation type event. Right there. This is important. You start with that. That makes it not stick. Um, it makes it not wait all the way. I’ll just, it means they return.

So now Delaware starts going. Then we make an incident in our database that says what the ID. So then we can check it back with a later. Then we just keep when the IB comes in and we just keep checking the database for it. That’s it. Long polling and AWS and APEC. Anyway, with cool secretary of state data.

There we go.


Hey there. Want to be friends?

I’ll be straight with you. I’m going to want to show off my web scraping genius. But I want to hear about your web scraping troubles and successes as well.

And I like you. So I won’t be one of those weird friends who is always calling. What do you say?