WhatsApp integration – part 2: whatsapp-web.js in Lambda

Series overview

  1. Part 1: the beginnings
  2. Part 2: whatsapp-web.js in Lambda
  3. Part 3: Baileys

Introduction

In part 1 we have gone through our decision making process regarding architecture of the solution. Now we will describe the first solution and problems that came with it.

Solution description

The solution was to be built using whatsapp-web.js library. It works by running a headless browser in background, in which native WhatsApp Web is running. Since the later is an official application, this gave us hope of good stability and high coverage of functionality.

As a reminder, we develop the application using Serverless framework and run it in AWS, so that’s a constraint we had. I will use AWS service names in this article freely, but Google should provide all the information needed if any of them are not familiar to you.

The recommended solution is to run the library and listen to events, but for budget and scalability reasons we chose to run it periodically in Lambda and only load diff of new messages. This way he also would have flexibility in managing costs – if users are okay with 10 minute delays in message visibility, we could just set the sync interval to that value and reduce the cost to a small fraction.

The library supports storing session, which is vital for this kind of solution. However, it doesn’t have some kind of compact representation. It simply uses Chromium profile folder, which can blow up to a significant size. As for WhatsApp, all of the data, including authentication keys and the account data, is stored in IndexedDB. We stored this as a zip archive in S3.

We also used DynamoDB for some locking and temporary state management.

The overall flow went something like this:

Problems

Even though the solution looks nice on paper, when confronted with reality, several problems arose. Some of them were just inconveniences, others were worse. We were able to apply fixes / workarounds to some of them, but others were beyond our control.

Long startup time

Since Lambdas are stateless functions, you need to start the client each time. With the default settings it took around 30 seconds, which is just an increased cost for background jobs, but an outright UX suicide for things like sending messages. By tuning the Lambda settings to have more memory, this time decreased to a region between 10 and 15 seconds, which is still a lot, but much better.

Given the Lambda lifecycle, we also considered letting the client live between jobs, which is possible and would make the startup unnecessary (if hot run is triggered). This could be used to our advantage if it was possible to warm up the send message Lambda when it was expected to be used, like when the thread is open.

Unfortunately, while client as alive, the session folder is not stable and archiving it leads to an unrecoverable session. That means you always have to destroy the client before saving the session. Since this caching mechanism isn’t guaranteed (if other Lambda runs, if concurrency creates a new runtime, if enough time passes between runs), it could only be used as a fallback, not as a default mechanism, and since it affects the main saving procedure, it was unusable.

Unfortunately, we found satisfying solution for this in Lambda. In server environment, where the library would run all the time, this wouldn’t be a problem, but for various reasons we didn’t want to go this way yet.

Fetching messages stuck

When connecting an account in our application, the initial process was to load all the historical data.

The code for that would look something like this:

Unfortunately, in some occasions fetchMessages call entered an infinite loop and never returned. That lead to the history sync process never being completed and also to missing messages.

It was a bug in the library and given the open source spirit we have in the company, we have submitted a pull request which has been soon merged.

Blocking media download

Part of loading messages was also to download media. We started to notice a pattern of jobs suddenly being stopped because of timeout.

For a little context: Lambdas have a timeout setting which cannot be exceeded. Maximum is 15 minutes. This may seem like it has to suffice for anything, but reality is never as nice as theory. So it’s a good practice to split any long running jobs to small units, stop before the timeout, do some cleanup and possibly restart the Lambda.

The problem arises when you have a blocking call somewhere and it takes a long time. Because of how node.js execution engine works, no event will be processed until a synchronous function is completed. And from digging into this we have found out that media downloads are implemented in a blocking way and often take more than 5 minutes.

Now you may ask: how can a download be implemented in a blocking way? It has to do with Puppeteer, which is a library wrapping headless browsers and allowing an interaction with them. Whatsapp-web.js internally uses it. The Puppeteer protocol allows communication between the parent process and the browser only via basic Javascript types like strings. So it goes something like this: download media in the browser (getting Buffer or something similar), encoding as BASE64, transferring that to the parent process and returning it to the library caller.

So the culprit was BASE64 encoding.

And because it blocks the node.js event loop, not even timeouts are handled and there is simply no way of stopping it. This is a Lambda environment specific problem, but it may also make the library unresponsive to the events like new message, so it would be good to reimplement the encoding to be async.

Given the open source spirit we have in the company, we have submitted a pull request which has been soon merged.

Concurrency issues

We don’t know the exact workings of WhatsApp, but after working with various integrations and inspecting its behaviour we have at least some insight into it. If my understanding is correct, then it works in such a way that every session is stateful and WhatsApp server will only send you those events it thinks it hadn’t sent you yet. And there is no way to reload everything. This means that if an event is not handled or is lost somewhere, it’s lost forever.

Given the architecture we used, this could happen quite easily. Let’s say a periodic sync job is running and during that, you decide to send a message. That would spawn a new send message Lambda, which would create a duplicate client. Now if any event happens, it may be sent to one of them but not both. Since the session representation is not easily manipulatable, there is no way to merge two sessions and it’s a pure lottery which job will save the session later. If the one that hasn’t received an event, then it’s lost and can’t be recovered in any way.

And while for background jobs some locking may do the job, sending messages is a user operation which should be as quick as possible, so waiting maybe minutes until it can start would be unacceptable.

Unfortunately, we couldn’t find any good workaround for this.

Send message issues

This is a code how sending message might look:

The problem is that sendMessage call isn’t resolved when a message is delivered, but when it was sent and stored into the internal state. You then have to listen to an ACK event to be sure it was really delivered (you should also check id, but that was omitted for simplicity). In reality this wasn’t enough. We have also tried to wait for ACK_DEVICE event, which should mean that an actual device received the message. Unfortunately, neither this was enough in some cases and sometimes it was not triggered even when I saw the message on a device.

So we ended up with messages like these and couldn’t do much with it:

We have tried to apply some arbitrary timeouts after the sending, which helped in many cases, but wasn’t foolproof and certainly no developer would be happy with such a solution.

Session size

Another problem we encountered was session size. We observed that it gets bigger and bigger every time a job is run. For some big accounts at reached size as absurd as 2GB.

Since we had to download and extract this on each run (and then compress and upload in the end), this added another overhead. Startup was thus even slower than usual and cleanup often wasn’t completed in time and was cut short by Lambda running out of time.

Main cause of this was Chromium’s IndexedDB space allocation policy. It seems to use sparse files, so even if real data size stayed the same, because of non-compact storage it lead to expansion of the folder size.

Since at this point the account became unusable due to the long session restoration times, we can’t know at which size it would stop. However, Lambda disk space is limited by ephemeral storage setting, which can be at most 10GB, so if that was hit, no download/extraction optimisation could help us.

To be continued…

The library wasn’t well suited to our architecture and caused us a lot of problems. We have fixed some, but others seemed impossible to tackle, so we were looking at alternatives at this point. Hopefully, one library had released a new, much more stable version in the meantime, so we set our eyes on that and after playing with it a little bit, our expectations were high.

We will talk about that in the last part of the series.

Related Post

Leave a Comment

© 2021 Instea, s.r.o.
All rights reserved.

Contact us

Where to find us