In a previous blog post I wrote about how we use a Message Queue server in a Message Bus design. Since that post was written we’ve moved from ActiveMQ to RabbitMQ and now we are moving to Amazon Web Service Simple Queue Service (SQS). I’m going to cover:
- What is SQS?
- The Message Bus design Pattern
- How we’re using SQS to integrate with outside systems.
- Bonus: I’ll also include how we’re looking at using SNS for notifications.
Intro to AWS SQS
Amazon Web Services are a suite of hosted services run by Amazon that are available on a pay per use basis. SQS (short for Simple Queue Service) is their messaging queue service. SQS operates much like other Queue servers, you can:
- Create queues
- Place messages on queues
- Read messages off of queues.
One important difference with SQS is that there is no guarantee about the order that messages are delivered in. In most other stand alone queue servers messages are delivered in a First-In, First-Out (FIFO) manner. The primary reason that messages can be delivered out of order with SQS is a feature called message visibility. When a message is read from SQS it is no longer visible on the queue. This prevents multiple clients from getting the same message to process. But unlike some other queues, reading a message from the queue does not remove it. A client must explicitly acknowledge a message for it to be removed from the queue. If a client reads a message, but doesn’t acknowledge it, the message will become visible again a short time later. While the message is not visible other messages can be processed, and thus completed before the message becomes visible again.
What is a Message Bus
A Message Bus is a software design pattern that allows for message senders and receivers to be decoupled. This means that message senders and receivers can be added and removed without effecting other senders and receivers attached to the bus.
While we started with a message bus in our design for using SQS as a queue, we are not currently taking full advantage of the Message Bus design pattern as we use one sender class and one process for reading and distributing messages to the correct subscribers.
How we are using SQS
Now that we’ve covered SQS and the basic concept of a messaging bus, lets look at how we put it all together at MerchantOS.
We have many different message types that are generated and placed on our queue (Item, Customer, Integration Specific messages, etc). Since we are generating all of these messages within our own application we have simplified our message generation code to one class that is used across our system. Using one central message producer allows us to reduce the overall architectural complexity of generating messages and placing them on the queue. This also allows use to integrate other message generators in the future as we need or want them.
On the receiving side we have used a few techniques to improve the reliability of receiving and dispatching of messages
Using a CRON to receive messages
To receive messages we use one CRON process that wakes up and forks child processes. These child processes are each responsible for taking one message off of the queue and forwarding it onto its appropriate destination. We have previously used separate processes for each endpoint that needed a message. The downside to this technique was again complexity of monitoring each process and diagnosing when issues arose.
While using one parent process that creates worker processes has potential pitfalls as a point of failure, we have mitigated those by reducing the coupling of the worker processes to the endpoints that they call. One technique we used was to have the workers call the endpoint (typically an RPC endpoint on our API), post it’s message payload and disconnect without waiting for a response. Each endpoint becomes responsible for logging its success or failure. By keeping the workers short lived we reduce the overall impact of one worker failing.
Another way that our queue usage has evolved is by consolidating all messages into one queue. Previously we used per account queues, which allowed for segregating accounts that might be generating many queue messages from those that only generate a few. In the long run we found that it failed to give us any practical benefit.
We now use a two queue system where all messages are first placed on a “priority queue”. These messages are read off and dealt with by the worker processes described above. If one account is generating many messages, or if we have determined that the endpoint they are calling is saturated and cannot process any more messages, they are re-routed to a delayed queue. Messages on the delayed queue are read once the message volume has reduced or the endpoint is ready to receive messages again.
This arrangement of a limited number of message producers, sending to a limited number of queues, being read by a limited number of consumers, has helped to reduce our queue complexity, improve the transparency of the message flow through the system, and given us better monitoring of our queue status.
Our next steps with SQS and our queue development is to create the ability to generate webhook callbacks that can be notified about messages on the queue. We are currently evaluating Amazon SNS among other options to provide a way to generate notifications. With a general notification framework in place it will allow for more complete workflows that include MerchantOS as one component, including potentially being able to create websockets for a richer client interface.
Our queueing infrastructure has evolved over time to help us create decoupled integrations while reducing our application complexity. Amazon SQS is an integral piece in our ability to reduce our application complexity while also allowing us to scale as our user base grows and our messages per user increases. We’ve got some exciting changes coming to our queuing infrastructure that will allow for richer interactions with API clients. Watch our blog for more posts as we release these changes.