Today, Puppeteer goes to AWS Lambda.
There are a few challenges when it comes to getting Puppeteer to work properly on AWS Lambda and we will address all of them in this post.
But first let’s introduce both Puppeteer and AWS Lambda.
Puppeteer is a headless browser. It’s a piece of open source software that is developed and supported by Google’s developer tools team. It allows you to simulate user interaction through a simple API.
This is very helpful for doing things like automated tests or, my personal use case, web scraping.
A picture’s worth a thousand words. How much is a gif worth? With the little bit of code shown in this gif I am able to log in to a Google account. You click, enter text, paginate, and scrape anything you see on the screen. (Sample code)
Alt: Gif showing Puppeteer logging in to Google
AWS Lambda is Amazon calls “Run code without thinking about servers or clusters”. You create a function on Lambda and then execute it. It’s that easy.
Things I do on AWS Lambda: everything. Okay, not everything but almost. I scrape thousands of web pages every night with AWS Lambda functions. I insert data into databases. When users sign up for my services, all of my backend API routes are hosted on AWS Lambda.
Getting started is simple and inexpensive. You only pay for what you use and they have a generous free tier.
Problem #1 – Puppeteer is too big to push to Lambda
AWS Lambda has a 50 MB limit on the zip file that you push directly to it. The Puppeteer package, due to the fact that it installs chromium, is significantly larger than that.
This 50 MB limit doesn’t apply when you load the function from S3, however! See the documentation here.
Alt: AWS Lambda quotas can be tight for Puppeteer
The 250 MB unzipped can be bypassed by uploading directly from an S3 bucket. So I create a bucket in S3, use a node script to upload to S3, and then update my Lambda code from that bucket. The script looks something like this:
“zip”: “npm run build && 7z a -r function.zip ./dist/* node_modules/”,
“sendToLambda”: “npm run zip && aws s3 cp function.zip s3://chrome-aws && rm function.zip && aws lambda update-function-code –function-name puppeteer-examples –s3-bucket chrome-aws –s3-key function.zip”
Problem #2 – Puppeteer on AWS Lambda doesn’t work
By default Linux (and this includes AWS Lambda) doesn’t include the necessary libraries that are required to allow Puppeteer to function.
Fortunately, there already exists a package of chromium that is built for AWS Lambda. You can find it here.
You’ll need to install it and puppeteer-core in your function that you are sending to Lambda. The normal Puppeteer package will not be needed and in fact counts against your 250 MB limit.
`npm i –save chrome-aws-lambda puppeteer-core`
And then when you’re setting it up to launch a browser from Puppeteer it’ll look like this:
const browser = await chromium.puppeteer
executablePath: await chromium.executablePath,
Puppeteer requires more memory than a normal script so keep on eye on your max memory used. I’d recommend at least 512 MB on your AWS Lambda function when using Puppeteer.
Also, don’t forget to run `await browser.close()` at the end of your script! Otherwise you may end up with your function running until timeout for no reason because the browser is still alive and waiting for commands.
Alt: Jordan Hansen headshot
Jordan Hansen is a professional web scraper who lives in Eagle, ID in the United States. His company, Cobalt Intelligence, gets Secretary of State business data for banks and business lenders via API.