Hugo Alliaume

Generate PDFs on Amazon AWS with PHP and Puppeteer


EDIT: 21st april 2020

This article was initially written by comparing 3 solutions and described solution #1.

Since the 21st april 2020, a new solution was added, and it's definitely the best solution, see solution #4.

Some context

Those last months at work, for a new big functionality in our CMS, we had to think to « how to generate a lot of PDFs (~1000 and more in the future) in a really short amount of time? ». Our servers are great, but they weren't powerful enough and scalable to generate a lot of PDFs without slowing performances, that's why we go for Amazon AWS by using Amazon Simple Queue Service and Amazon Lambda.

I assume you have some knowledge about AWS SQS/Lambda and the Symfony Messenger Component before reading this article. More info on Symfony Messenger on AWS Lambda

This is the plan:

The lambda

To handle the message from the queue, the lambda will have to run PHP and Symfony because the actual Messenger component only supports Symfony apps (read and vote for RFC Improve Messenger to support other app consuming/recieving the message).

We will use Bref to run PHP on our lambda. Bref is a Serverless plugin, and Serverless is a framework to build and operate serverless applications. Here is a simplified version of our Serverless configuration file:

 1# serverless.yml
 2service: app
 3
 4provider:
 5    name: aws
 6    runtime: provided
 7    region: eu-west-2
 8    stage: ${opt:stage,'dev'} # we had two stages "dev" (default) and "prod"
 9    environment:
10        APP_ENV: ${self:provider.stage}
11
12plugins:
13    # Include Bref plugin
14    - ./vendor/bref/bref
15
16package:
17  exclude:
18    # Excluding those files/directories will reduce deploy time and lambda size a lot
19    - bin/.phpunit/**
20    - vendor/bin/.phpunit/**
21    - var/log/**
22    - var/storage/**
23    - var/cache/**
24    - "!var/cache/${opt:stage,'dev'}/**" # include cache of targeted stage
25    - var/cache/*/profiler/**
26
27functions:
28  generate_pdf:
29    handler: bin/consume-generate-pdf
30    reservedConcurrency: 50 # 50 lambda invocations at the same time
31    timeout: 60
32    layers:
33      # Use the Bref layer, see https://bref.sh/docs/runtimes
34      - 'arn:aws:lambda:us-west-1:209497400698:layer:php-74:1'
35    events:
36      - sqs:
37          arn: <arn SQS>
38          # We tell Amazon SQS to send only 1 message from the queue to the function,
39          # otherwise if we send more than 1 message and one of them fails, then ALL messages are put again in the queue.
40          batchSize: 1

How to generate a PDF?

We didn't want to use wkhtmltopdf/KnpLabs/KnpSnappyBundle, because we had enough issues in the past to install and use it (missing shared Linux libraries, crash when SSL errors, the render is not predictable and can be different of what Chrome renders ...).

Instead, we thought about using Puppeteer and Browsershot. Puppeteer is a Node.js library which profides an API to control Chrome, and Browsershot is a nice PHP wrapper around Puppeteer.

1$ yarn add puppeteer
2$ composer require spatie/browsershot

But Puppeteer won't work because the lambda doesn't have Node.js support yet. To fix this, we used a layer provided by lambci/node-custom-lambda:

serverless.yml
 1# ...
 2
 3functions:
 4  generate_pdf:
 5    handler: bin/consume-generate-pdf
 6    # ...
 7    layers:
 8      - 'arn:aws:lambda:<region>:553035198032:layer:nodejs12:21'
 9      # Use the Bref layer, see https://bref.sh/docs/runtimes
10      - 'arn:aws:lambda:us-west-1:209497400698:layer:php-74:1'
11    # ...

Then run serverless deploy and... uh? the lambda size is too big?

Yup, it's too big because of the Chrome binary that has been downloaded when installing puppeteer:

 1➜  puppeteer-deps l node_modules/puppeteer/.local-chromium/linux-706915/chrome-linux
 2total 279M
 3drwxr-xr-x 7 kocal kocal 4,0K janv.  2 10:09 .
 4drwxr-xr-x 3 kocal kocal 4,0K janv.  2 10:09 ..
 5-rwxr-xr-x 1 kocal kocal 229M janv.  2 10:09 chrome
 6-rw-r--r-- 1 kocal kocal 1,2M janv.  2 10:09 chrome_100_percent.pak
 7-rw-r--r-- 1 kocal kocal 1,5M janv.  2 10:09 chrome_200_percent.pak
 8-rwxr-xr-x 1 kocal kocal 326K janv.  2 10:09 chrome_sandbox
 9-rwxr-xr-x 1 kocal kocal 5,0K janv.  2 10:09 chrome-wrapper
10drwxr-xr-x 3 kocal kocal 4,0K janv.  2 10:09 ClearKeyCdm
11-rwxr-xr-x 1 kocal kocal 1,5M janv.  2 10:09 crashpad_handler
12-rw-r--r-- 1 kocal kocal  10M janv.  2 10:09 icudtl.dat
13-rwxr-xr-x 1 kocal kocal 345K janv.  2 10:09 libEGL.so
14-rwxr-xr-x 1 kocal kocal  12M janv.  2 10:09 libGLESv2.so
15drwxr-xr-x 2 kocal kocal 4,0K janv.  2 10:09 locales
16drwxr-xr-x 2 kocal kocal 4,0K janv.  2 10:09 MEIPreload
17-rwxr-xr-x 1 kocal kocal 4,3M janv.  2 10:09 nacl_helper
18-rwxr-xr-x 1 kocal kocal 9,5K janv.  2 10:09 nacl_helper_bootstrap
19-rwxr-xr-x 1 kocal kocal 3,7M janv.  2 10:09 nacl_helper_nonsfi
20-rwxr-xr-x 1 kocal kocal 3,7M janv.  2 10:09 nacl_irt_x86_64.nexe
21-rw-r--r-- 1 kocal kocal    1 janv.  2 10:09 natives_blob.bin
22-rw-r--r-- 1 kocal kocal 2,5K janv.  2 10:09 product_logo_48.png
23drwxr-xr-x 3 kocal kocal 4,0K janv.  2 10:09 resources
24-rw-r--r-- 1 kocal kocal  12M janv.  2 10:09 resources.pak
25drwxr-xr-x 2 kocal kocal 4,0K janv.  2 10:09 swiftshader
26-rw-r--r-- 1 kocal kocal 619K janv.  2 10:09 v8_context_snapshot.bin
27-rwxr-xr-x 1 kocal kocal  37K janv.  2 10:09 xdg-mime
28-rwxr-xr-x 1 kocal kocal  33K janv.  2 10:09 xdg-settings
29➜  puppeteer-deps

On AWS Lambda limits page, the deployment package size is:

  • 50 MB (zipped)
  • 250 MB (unzipped)

But when we zip the Chrome binary and its libraries, the size is about 100 MB and so it fails:

1➜  puppeteer-deps l node_modules/puppeteer/.local-chromium/linux-706915
2total 106M
3drwxr-xr-x 3 kocal kocal 4,0K janv.  2 10:13 .
4drwxr-xr-x 3 kocal kocal 4,0K janv.  2 10:09 ..
5drwxr-xr-x 7 kocal kocal 4,0K janv.  2 10:09 chrome-linux
6-rw-r--r-- 1 kocal kocal 106M janv.  2 10:14 chrome-linux.zip

What can we do?

Use a Brotli-fied Chrome

During all my research to make Chrome runnable on AWS Lambda, I've found chrome-aws-lambda, a Node.js package that:

  • ship a Brotli-fied Chrome (~ 36MB) which can run on AWS Lambda (see bin/ directory)
  • provide a small wrapper around Puppeteer which uncompress Chrome on-the-fly

Okay great, we have a Chrome that can by used on AWS Lambda, but now we are facing many solutions.

Solution #1

Download the brotlified Chrome, commit it in our project, and write some PHP to uncompress Chrome at runtime.

Pros:

  • Fatest solution
  • We have a total control over Chrome binaries

Cons:

  • Chrome updates should be applied manually

Solution #2

(I've thought about this solution when writing this article, not when working on the lambda 3/4 months ago.)

Install the package chrome-aws-lambda and write some PHP to uncompress Chrome at runtime.

Pros:

  • Chrome updates are automatically applied

Cons:

  • The binaries are hidden by chrome-aws-lambda, it means that you can't rely on them without using the provided wrapper. Between v1.20.1 and v1.20.2 the bin/ directory structure has been modified and shared libraries are archived with tar. If we had installed chrome-aws-lambda without a fixed version constraint (eg.: 1.20.1), then the PDFs generation might have fails and it would have been really critical for us.

Solution #3

Fork chrome-aws-lambda, write a PHP wrapper, and open a pull request.

Pros:

  • The PHP wrapper would have been available for more users

Cons:

  • Time to wait before potential merging? We had a deadline for our new big functionality
  • Maybe the PR could have been refused
  • Two wrappers to maintain and test

EDIT 21/04/2020: Solution #4

I've found a better solution by using chrome-aws-lambda in a bridge.

Pros:

  • No manual updates
  • No need to handle Chrome binaries uncompressing ourself

Cons:

  • I didn't find anyone yet

Please read article Generate PDFs on Amazon AWS with PHP and Puppeteer: The Best Way to know more about.

Use Chrome, Browsershot and Puppeteer on Amazon AWS

We used the Solution #1 for the stability and lake of time.

Don't deploy Puppeteer's Chrome binary

Since Chrome binary from puppeteer package is to large, we can replace it by puppeteer-core (same puppeteer-core but without Chrome binary), but Browsershot is only compatible with puppeteer.

A solution is to configure Serverless to exclude Puppeteer's Chrome binary folder like this:

serverless.yml
1#...
2
3package:
4  exclude:
5    # ...
6    - node_modules/puppeteer/.local-chromium/** # we will ship a brotli-compressed Chrome binary
7
8#...

Download brotlified Chrome binary

When working on the lambda, the latest version of chrome-aws-lambda was 1.20.1 (see binary files).

We have created a directory chromium, downloaded .br files and put them like this:

1➜  the-lambda git:(master) tree chromium
2chromium
3├── chromium-78.0.3882.0.br
4└── swiftshader
5    ├── libEGL.so.br
6    └── libGLESv2.so.br
7
81 directory, 3 files

Uncompress Chrome binary on-the-fly

Install Brotli binary

We use vdechenaux/brotli-bin-amd64 to download the brotli binary.

1composer require vdechenaux/brotli-bin-amd64

The file bin/brotli-bin-amd64 should now exists.

Create a Chromium class

I prefer to manipulate an object instead of a scalar values. Later we can imagine we had to store Chrome version and using an object will make things easier.

src/Chromium/Chromium.php
 1<?php declare(strict_types=1);
 2
 3namespace App\Chromium;
 4
 5class Chromium
 6{
 7    private $path;
 8
 9    public function __construct(string $path)
10    {
11        $this->path = $path;
12    }
13
14    public function getPath(): string
15    {
16        return $this->path;
17    }
18}

Create a ChromiumFactory class

This is the class which will uncompress Chrome at the runtime into /tmp/chromium folder.

We have profiled this part of code and it takes ~2-3 seconds on a fresh lamda, but it can be much faster if the lambda is re-used (/tmp is not cleared and uncompressed Chrome is still here).

src/Chromium/Factory/ChromiumFactory.php
 1<?php declare(strict_types=1);
 2
 3namespace App\Chromium\Factory;
 4
 5use App\Chromium\Chromium;
 6use Symfony\Component\Finder\Finder;
 7use Symfony\Component\Finder\SplFileInfo;
 8use Symfony\Component\Process\Exception\ProcessFailedException;
 9use Symfony\Component\Process\Process;
10
11class ChromiumFactory
12{
13    private $binDir;
14    private $tmpDir;
15    private $chromiumDir;
16
17    public function __construct(string $binDir, string $tmpDir, string $chromiumDir)
18    {
19        $this->binDir      = $binDir;
20        $this->tmpDir      = $tmpDir;
21        $this->chromiumDir = $chromiumDir;
22    }
23
24    public function initialize(): Chromium
25    {
26        $finder = new Finder();
27        $finder->name('chromium-*')->files()->in($this->chromiumDir);
28
29        foreach ($finder as $chromiumFile) {
30            break;
31        }
32
33        if (!isset($chromiumFile) || !($chromiumFile instanceof SplFileInfo)) {
34            throw new \RuntimeException(sprintf(
35                'Unable to find Chromium binary in "%s" directory.',
36                $this->chromiumDir
37            ));
38        }
39
40        $this->inflate($chromiumFile->getFilename());
41        $this->inflate('swiftshader/libEGL.so.br');
42        $this->inflate('swiftshader/libGLESv2.so.br');
43
44        $chromiumPath = $this->tmpDir.'/'.$chromiumFile->getFilenameWithoutExtension();
45
46        $this->markAsExecutable($chromiumPath);
47
48        return new Chromium($chromiumPath);
49    }
50
51    protected function inflate(string $filename): void
52    {
53        $extension = '.br';
54        $extensionLength = strlen($extension);
55
56        if (substr($filename, -$extensionLength) !== $extension) {
57            throw new \InvalidArgumentException('Not a brotli file.');
58        }
59
60        $outputFilename = $this->tmpDir.'/'.substr($filename, 0, -$extensionLength);
61        @mkdir(dirname($outputFilename), 0777, true);
62
63        // Inflate file only if output file does not exist
64        if (!file_exists($outputFilename)) {
65            $process = new Process(["{$this->binDir}/brotli-amd64", '-d', "{$this->chromiumDir}/{$filename}", '-o', $outputFilename]);
66            $process->run();
67
68            if (!$process->isSuccessful()) {
69                throw new ProcessFailedException($process);
70            }
71        }
72    }
73
74    protected function markAsExecutable(string $filename): void
75    {
76        $process = new Process(['chmod', '+x', $filename]);
77        $process->run();
78
79        if (!$process->isSuccessful()) {
80            throw new ProcessFailedException($process);
81        }
82    }
83}

and configure it like this:

config/services.yaml
 1services:
 2  # default configuration for services in *this* file
 3  _defaults:
 4    autowire: true # Automatically injects dependencies in your services.
 5    autoconfigure: true # Automatically registers your services as commands, event subscribers, etc.
 6
 7  # ... your Symfony services ...
 8
 9  App\Chromium\Factory\ChromiumFactory:
10    arguments:
11      $tmpDir: '/tmp/chromium' # it probably better to use `sys_get_temp_dir()`
12      $binDir: '%kernel.project_dir%/bin'
13      $chromiumDir: '%kernel.project_dir%/chromium'

Use Browsershot with the ChromiumFactory

This is an example of how to use Browsershot and the ChromiumFactory inside a Message handler (specific to Symfony Messenger Component), but you can use them anywhere you want.

I've used league/flysystem-bundle and configured a Scaleway filesystem adapter in order to save my PDF on Scaleway.

src/MessageHandler/GeneratePdfMessageHandler.php
 1<?php declare(strict_types=1);
 2
 3namespace App\MessageHandler;
 4
 5use App\Chromium\Factory\ChromiumFactory;
 6use App\Message\GeneratePdfMessage;
 7use League\Flysystem\FilesystemInterface;
 8use Psr\Log\LoggerAwareInterface;
 9use Psr\Log\LoggerAwareTrait;
10use Spatie\Browsershot\Browsershot;
11use Symfony\Component\Messenger\Handler\MessageHandlerInterface;
12
13class GeneratePdfMessageHandler implements MessageHandlerInterface, LoggerAwareInterface
14{
15    use LoggerAwareTrait;
16
17    private $chromiumFactory;
18    private $s3Storage;
19
20    public function __construct(ChromiumFactory $chromiumFactory, FilesystemInterface $s3Storage)
21    {
22        $this->chromiumFactory = $chromiumFactory;
23        $this->s3Storage = $s3Storage;
24    }
25
26    public function __invoke(GeneratePdfMessage $message): void
27    {
28        $pdf = $this->getBrowsershot()
29            ->setHtml('My html...')
30            ->pdf();
31        // $pdf contains binary file content
32
33        // Let's save it on Scaleway!
34        $this->s3Storage->put('my-file.pdf', $pdf);
35    }
36
37    protected function getBrowsershot(): Browsershot
38    {
39        $chromium = $this->chromiumFactory->initialize();
40
41        $browsershot = (new Browsershot())
42            ->setChromePath($chromium->getPath())
43
44            // recommended arguments
45            ->addChromiumArguments([
46                'disable-dev-shm-usage', // https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md#tips
47                'disable-gpu',
48                'single-process',
49                'no-sandbox',
50            ])
51
52            // we needed those options in our lambda to prevent issues, but you can ignore them
53            ->ignoreHttpsErrors()
54            ->setOption('waitUntil', 'domcontentloaded') // when event `DOMContentLoaded` is fired, external resources that takes longer to load (or timeout after 2 min) are not waited.
55          ;
56
57        return $browsershot;
58    }
59}

And voilà! When executing this code, a PDF should have been generated with Browsershot and Puppeteer and be saved on Scaleway.