Category Archives: Tools

Using Amazon SQS as a Messaging Bus

In a previous blog post I wrote about how we use a Message Queue server in a Message Bus design. Since that post was written we’ve moved from ActiveMQ to RabbitMQ and now we are moving to Amazon Web Service Simple Queue Service (SQS). I’m going to cover:

  • What is SQS?
  • The Message Bus design Pattern
  • How we’re using SQS to integrate with outside systems.
  • Bonus: I’ll also include how we’re looking at using SNS for notifications.

Intro to AWS SQS

Amazon Web Services are a suite of hosted services run by Amazon that are available on a pay per use basis. SQS (short for Simple Queue Service) is their messaging queue service. SQS operates much like other Queue servers, you can:

  • Create queues
  • Place messages on queues
  • Read messages off of queues.

One important difference with SQS is that there is no guarantee about the order that messages are delivered in. In most other stand alone queue servers messages are delivered in a First-In, First-Out (FIFO) manner. The primary reason that messages can be delivered out of order with SQS is a feature called message visibility. When a message is read from SQS it is no longer visible on the queue. This prevents multiple clients from getting the same message to process. But unlike some other queues, reading a message from the queue does not remove it. A client must explicitly acknowledge a message for it to be removed from the queue. If a client reads a message, but doesn’t acknowledge it, the message will become visible again a short time later. While the message is not visible other messages can be processed, and thus completed before the message becomes visible again.

What is a Message Bus

A Message Bus is a software design pattern that allows for message senders and receivers to be decoupled. This means that message senders and receivers can be added and removed without effecting other senders and receivers attached to the bus.

While we started with a message bus in our design for using SQS as a queue, we are not currently taking full advantage of the Message Bus design pattern as we use one sender class and one process for reading and distributing messages to the correct subscribers.

How we are using SQS

Now that we’ve covered SQS and the basic concept of a messaging bus, lets look at how we put it all together at MerchantOS.

Generating Messages

We have many different message types that are generated and placed on our queue (Item, Customer, Integration Specific messages, etc). Since we are generating all of these messages within our own application we have simplified our message generation code to one class that is used across our system. Using one central message producer allows us to reduce the overall architectural complexity of generating messages and placing them on the queue. This also allows use to integrate other message generators in the future as we need or want them.

Receiving Messages

On the receiving side we have used a few techniques to improve the reliability of receiving and dispatching of messages

Using a CRON to receive messages

To receive messages we use one CRON process that wakes up and forks child processes. These child processes are each responsible for taking one message off of the queue and forwarding it onto its appropriate destination. We have previously used separate processes for each endpoint that needed a message. The downside to this technique was again complexity of monitoring each process and diagnosing when issues arose.

Reducing Coupling

While using one parent process that creates worker processes has potential pitfalls as a point of failure, we have mitigated those by reducing the coupling of the worker processes to the endpoints that they call. One technique we used was to have the workers call the endpoint (typically an RPC endpoint on our API), post it’s message payload and disconnect without waiting for a response. Each endpoint becomes responsible for logging its success or failure. By keeping the workers short lived we reduce the overall impact of one worker failing.

Routing Messages

Another way that our queue usage has evolved is by consolidating all messages into one queue. Previously we used per account queues, which allowed for segregating accounts that might be generating many queue messages from those that only generate a few. In the long run we found that it failed to give us any practical benefit.

We now use a two queue system where all messages are first placed on a “priority queue”. These messages are read off and dealt with by the worker processes described above. If one account is generating many messages, or if we have determined that the endpoint they are calling is saturated and cannot process any more messages, they are re-routed to a delayed queue. Messages on the delayed queue are read once the message volume has reduced or the endpoint is ready to receive messages again.

This arrangement of a limited number of message producers, sending to a limited number of queues, being read by a limited number of consumers, has helped to reduce our queue complexity, improve the transparency of the message flow through the system, and given us better monitoring of our queue status.

Integrating SNS

Our next steps with SQS and our queue development is to create the ability to generate webhook callbacks that can be notified about messages on the queue. We are currently evaluating Amazon SNS among other options to provide a way to generate notifications. With a general notification framework in place it will allow for more complete workflows that include MerchantOS as one component, including potentially being able to create websockets for a richer client interface.

Wrap-up

Our queueing infrastructure has evolved over time to help us create decoupled integrations while reducing our application complexity. Amazon SQS is an integral piece in our ability to reduce our application complexity while also allowing us to scale as our user base grows and our messages per user increases. We’ve got some exciting changes coming to our queuing infrastructure that will allow for richer interactions with API clients. Watch our blog for more posts as we release these changes.

AWS DynamoDB For Session Redundancy And Failover

This article will go over what DynamoDB is and how we use it to backup our session data. It allows us to failover from one datacenter (AWS region) to another without losing session data and logging people out of our system.

Amazon Web Services (AWS) DynamoDB

DynamoDB is a cloud NoSQL database hosted by Amazon. You simply create a table and set the read/write capacity you want and Amazon takes care of the rest. No servers to manage or scale. Pretty awesome. Actually it’s so awesome that Amazon DynamoDB is the fastest growing new service in the history of AWS.

What’s is DynamoDB good at?

  • No hassle data store that is performant, scalable, and reliable.
  • Quick reads/writes. You decide the performance you need and pay for that.
  • Relatively cheap.
  • Access data via a key.
  • Store data that is in the 1-10KB range per item (you can store more but it gets expensive).

What’s does DynamoDB suck at?

  • It’s not a relational database. You can’t query it with joins or complex selects.
  • It’s not as fast as memcache (in our experience).
  • For large items or extremely high throughput it can get expensive (compared to running your own memcache service for example).
  • There is no simple way to back up your data (they do have a process by which you can get data to S3 but it’s pretty complicated and involves two other totally separate services from AWS).

Where can you read more about DynamoDB?

Using DynamoDB For Session Backup

In the blog post Scalable Session Handling in PHP Using Amazon DynamoDB they cover how to implement session handling for PHP using DynamoDB. I experimented with this, but what I found was it was too slow for our needs. Our session reads with DynamoDB were taking 20+ms. Our memcache session reads are an order of magnitude faster than that. We also already have memcache session handling implemented and working beautifully.

Why Do We Need To Backup Our Sessions?

For some applications it might be acceptable to log everyone out if you failed from one datacenter to another (heck lots of applications run in only one datacenter and don’t have any failover). Our application is mission critical for our customers – if we’re down they’re losing money. We run completely redundant setups in two different AWS regions on EC2.

If there’s a problem we have to switch customers from one datacenter to the other. Not having the sessions backed up in a way that both datacenters can access would mean everyone using our system would be logged out. This is a real hassle for our users because sometimes the stores are operating with a manager login and associates just have PINs to switch to their profile. If they get logged out they have to have the manager come by and log back in. What if the manager is out on lunch? Out of luck.

We previously stored our session backup in our MySQL database. But this became unscalable as the number of concurrent sessions grew. A few months ago we turned off our MySQL session storage and have been running just on memcache sessions. It improved performance, but it meant we might log everyone out if we had to switch datacenters.

DynamoDB To The Rescue

I looked at a number of different NoSQL type solutions for our session backup. DynamoDB made it to the top of the list because of it’s easy of management, scalability and price.

The basic concept is:

  • Session are stored and read from Memcache (every page hit)
  • Every 15 minutes we write the session to DynamoDB (each session stores its time since last DynamoDB write)
  • If we can’t find a session in Memcache (datacenter failure, or Memcache reboot) we look for it in DynamoDB.
  • Result: Users aren’t logged out if we switch datacenters or reboot Memcache.

DynamoDB PHP Code Samples

I gleaned a lot of this code from AWS blog post on PHP DynamoDB sessions.

readDynamoDB

This reads our session data out of DynamoDB when we need it. We call this from our custom session reading function (see session_set_save_handler)

function readDynamoDB($ses_id)
{
  $this->initDynamoDB();
  $result = '';
  $response = $this->_dynamodb->get_item(
    array( 'TableName' => self::DYNAMODB_TABLE,
           'Key' => array('HashKeyElement' => $this->_dynamodb->attribute($ses_id)),
           'ConsistentRead' => true, )
         );

  $node_name = 'Item';
  if ($response->isOK())
  {
    $item = array();
    // Get the data from the DynamoDB response
    if ($response->body->{$node_name})
    {
      foreach ($response->body->{$node_name}->children() as $key => $value)
      {
        $item[$key] = (string) current($value);
      }
    }
    if (isset($item['expires']) && isset($item['data']))
    {
      // Check the expiration date before using
      if ($item['expires'] > time())
      {
        $result = $item['data'];
      }
      else
      {
        $this->deleteDynamoDB($ses_id);
      }
    }
  }
  return $result;
}

writeDynamoDB

This reads our session data out of DynamoDB when we need it. We call this from our custom session reading function (see session_set_save_handler)

function writeDynamoDB($ses_id,$data,$expire_minutes)
{
  $this->initDynamoDB();
  // Write the session data to DynamoDB
  $response = $this->_dynamodb->put_item(
    array( 'TableName' => self::DYNAMODB_TABLE,
           'Item' => $this->_dynamodb->attributes(
             array( self::DYNAMODB_HASH => $ses_id,
                    'expires' => time() + ($expire_minutes*60),
                    'data' => $data,
                   )
            ),
    )
  );
  return $response->isOK();
}

deleteDynamoDB

This delete our session data from DynamoDB when we are done with it. We call this from our custom session destroy and gc function (see session_set_save_handler)

function deleteDynamoDB($ses_id)
{
  $this->initDynamoDB();
  $delete_options = array( 'TableName' => self::DYNAMODB_TABLE,
                           'Key' => array('HashKeyElement' => $this->_dynamodb->attribute($ses_id)),
                         );
  // Send the delete request to DynamoDB
  $response = $this->_dynamodb->delete_item($delete_options);
  return $response->isOK();
}

Pick Up A Great Website For Cheap

Shark

We’ve all seen those websites that show up in the SERPs that are just there because the domain has been around for years and has managed to pick up a few links along the way. They’re listed in dmoz, PR 4 or 5, but the content is really old. What if there was a way to dig through the list of sites in your niche and find the few that been abandoned by their owners? Don’t you think a some of those owners would be willing to sell you their websites far below their true value? They are! You just need to find them.

Old, Authoritative, Ranking Websites Are Waiting For You

I built a tool that will help you find sites you can buy that are old and authoritative in your niche. There are websites in almost every niche that have built up authority over years but have been abandoned by their owners. These absentee owners are often happy to unload their old websites on you for very little. The trick is finding these hidden gems amongst all the active websites. You can spend hours going through directories and SERPs looking for websites that are abandoned or you can use the tool I built to find them quickly and automatically.

Sitefinder301 – Find A Killer Deal On An Old Website.

Sitefinder301 uses the dmoz Open Directory and the Internet Archive to find sites that are relevant to your niche that haven’t been updated in a long time. If you’re lucky you can find a great site with an owner that is willing to sell it for very little.

How It Works

You supply a dmoz category URL and Sitefinder301 examines all the sites that are in it and finds the ones that haven’t been updated in a long time. It gives you PageRank and Whois information on the sites so you can quickly determine if it’s worth buying and who to contact. It’s a quick and easy way to come up with a short list of acquisition targets out of hundreds of possible websites.

Enough Already Where’s The Goods?

Ok here it is, the product of late night php hacking:

Find Killer Deals On Old Websites: Sitefinder301

Benefiting From Your Acquisition

Here’s a couple ideas for what to do with your new site:

  • Put up new content and rank quickly for valuable terms. Build upon the old sites authority and create something of real value.
  • Put up a page that promotes your product or service and funnels visitors to your current website.
  • Redirect the old site to your current site and benefit from it’s authority and traffic.

Share And Share Alike

Sitefinder301 is open source. You can install it on your SEO tools website or hack it for your own purposes. Download the source here: /makebeta/sitefinder301/sitefinder301code.zip
All I ask is you pass some link friendship back my direction :) .

The Spy Is Dead – Spyjax Offline

The End Of Spyjax

I will no longer be hosting Spyjax.

It’s been fun and very interesting, but it’s time to call it quits. Spyjax is just a side project but it’s eating up server power so I’ve decided to turn it off. I kicked around the idea of turning into a real service / business but I’m just not that interested in it.

Thank You For Commenting and Linking!

It’s been really fun reading everyone’s comments here and on all the blogs that wrote about Spyjax. There were some great discussions around privacy and ethics. Some of the best comments where on the SEO Book blog: Spy on Visitor Browsing History for Competitive Research.

Full Source Code Available

Because I’ve decided to shutdown the service I’m giving away all the code as open source. Anyone can install the Spyjax service on their server. Someone else could even start the service up and let others user it. You can download the source here: Spyjax Source Code.
Spyjax uses PHP and MySQL so most web servers should be able to run it. You’ll need to do a little bit of configuration, mainly in the config.php file. There might be a place or two where you need to put your email address for sign up and feedback emails to work correctly. Also the urls.php file needs your MySQL database host, user name, and password. The service assumes that it’s using a database named “spyjax”. The schema for the database can be found in spyjax.sql. If you have any questions about how to install it contact me using the form on over here.

4,808,202 URLs Were Found

While Spyjax was running it tracked 282,542 visitors and collected 4,802,202 URLs from their browser history. Here’s the top 50 URLs found by Spyjax:

  1. http://www.google.com – 59927
  2. http://www.yahoo.com – 45591
  3. http://www.myspace.com – 29869
  4. http://www.youtube.com – 27057
  5. http://www.msn.com – 22045
  6. http://www.hotmail.com – 16969
  7. http://www.ebay.com – 16123
  8. http://www.facebook.com – 13855
  9. http://www.mapquest.com – 11689
  10. http://www.cnn.com – 9285
  11. http://www.weather.com – 8634
  12. http://www.amazon.com – 7852
  13. http://www.wikipedia.org – 7401
  14. http://www.aol.com – 7381
  15. http://www.imdb.com – 6023
  16. http://www.walmart.com – 5826
  17. http://www.gmail.com – 5740
  18. http://www.apple.com – 5549
  19. http://www.flickr.com – 5197
  20. http://www.chase.com – 5053
  21. http://www.ask.com – 4919
  22. http://www.google.co.uk – 4876
  23. http://www.bestbuy.com – 4682
  24. http://www.usps.com – 4607
  25. http://www.craigslist.org – 4555
  26. http://www.download.com – 4474
  27. http://www.digg.com – 4389
  28. http://www.paypal.com – 4038
  29. http://www.dell.com – 3944
  30. http://www.bankofamerica.com – 3846
  31. http://www.bbc.co.uk – 3798
  32. http://www.southwest.com – 3753
  33. http://www.adobe.com – 3704
  34. http://www.comcast.net – 3657
  35. http://www.t-mobile.com – 3636
  36. http://www.yellowpages.com – 3584
  37. http://www.monster.com – 3578
  38. http://www.nytimes.com – 3556
  39. http://www.live.com – 3550
  40. http://www.hp.com – 472
  41. http://www.orbitz.com – 3456
  42. http://www.whitepages.com – 3339
  43. http://www.microsoft.com – 3310
  44. http://www.capitalone.com – 3251
  45. http://www.ticketmaster.com – 3228
  46. http://www.target.com – 3157
  47. http://www.realtor.com – 3114
  48. http://www.ebay.co.uk – 3064
  49. http://www.kbb.com – 3036
  50. http://www.blogger.com – 3031

Spyjax – Your browser history is not private!

If you’re like most web users, you assume that your browser history is private. For example if you visit a point of sale software company, you assume they can’t see if you’ve been looking at their competitor. Just a few weeks ago I assumed this was the case. Guess what?

Your browser history is not private!

Peeping Tom

In fact with a few well crafted lines of Javascript, websites can examine your browser history and record what pages you have been to. Keep reading and I’ll tell you exactly how it’s done and introduce you to a service that any webmaster can put on their site to see what pages their users have visited. I’ll also tell you exactly what type of information can be retrieved, and how you can protect yourself.

How JavaScript Can Be Used To Steal Your Browser History:

With CSS website designers can make links a different color if they have been visited by the user. For example this link should be colored differently than this other link. The first link you have been to before (it’s the page you are on right now) while the second link you have never visited (because it is fictitious). Now you’re thinking “but how can this be used to steal my history?”. Let’s dive a little deeper.

Javascript Can Examine The Color Of Your Links = Steal Your Browsing History

Javascript can examine the rendered state of an HTML document, called the DOM. One of the properties that is available through the DOM is the current CSS attributes of a node (nodes are HTML tags, one of which is the <a> or link tag).
All a website has to do to see what pages you’ve been to is place a list of links on the page and examine the color of those links. Ajax can be used to retrieve a list of links to test and also send the results back to the server without the user ever knowing.
The code to do this examination can be a little tricky due to cross browser issues. Here is a snippet of Javascript that can do the evaluation (based on the Hey you! Where have you been? blog post by Peter van der Graaf and script from Jeremiah Grossman and Robert Cabri):

 

<pre>function hasLinkBeenVisited(url) {
var link = document.createElement('a');
link.href = url;
document.body.appendChild(link);
if (link.currentStyle) {
var color = link.currentStyle.color;
if (color == '#ff0000')
return true;
return false;
} else {
link.setAttribute("href",url);
var computed_style = document.defaultView.getComputedStyle( link, null );
if (computed_style) {
if (computed_style.color == 'rgb(255, 0, 0)')
return true;
}
return false;
}
}</pre>

The code above assumes that CSS rules are making links that have been visited red (#ff0000) and new links a different color.

Ajax Can Be Used To Examine Thousands Of Links Dynamically

A clever web developer can use Ajax to dynamically load a list of links for each new visitor. A couple hundred links can be grabbed at a time and examined without slowing down the page noticeably. If you spend just a few seconds on a website thousands of URLs can be checked.

The Limitations

This technique does not allow sites to read your entire browser history. It only allows a site to test a predefined list of URLs to see if you have visited any of them. It’s like the card game “go fish”, you can’t see the players cards but you can ask them if they have any particular card. Most likely the way this technology would be used is to examine a list of competing URLs. This could give a website valuable information on who their competitors really are and what information on those sites is being looked at.

How To Stop People From Spying On Your Browser History

There are two sure fire ways to stop people from stealing your browser history.

  1. The nuclear option is to disable JavaScript within your browser. In Firefox you’d just go to Tools -> Options -> Content tab and then uncheck “Enable JavaScript”. This method is very limiting because you probably enjoy all the JavaScript goodness on the web.
  2. Limit your browser history. The less browser history you store the fewer URLs someone can steal from that history. In Firefox you can change the amount of browser history by going to Tools -> Options -> Privacy and then either uncheck the “Remember visited pages” checkbox or change the number of days that history is stored for.

UPDATE: Spyjax Has Been Turned Off

I will no longer be hosting Spyjax. It’s been fun and very interesting, but it’s time to call it quits. Read more here.

Introducing Spyjax

One Line Of JavaScript And You Can Start Spying

Spycat stealin urls

Ok, now that I’ve explained how this all works and how you can protect yourself, I want to introduce you to a small piece of code that I wrote that makes it super easy for you to spy on your website visitors. It’s called Spyjax and here’s how it works.

  1. Sign Up For An Account

    All that’s required is your email address and a password of your choosing. I promise I will not send you any unwanted email or give your email address away to anyone else. Sign Up For Spyjax

  2. Add URLs To Look For

    You can add custom URLs, the top 12 Google results for any search, or just look for the home page of the top 10,000 sites on the web.

  3. Put One Line Of Code At The Bottom Of Your Pages

    A simple <script> tag will insert all the JavaScript needed to spy on your visitors as well as communicate with the Spyjax service to record the results.

  4. Optionally Add A Spyjax Widget To Your Site

    If you just want to have some fun and show people that you’re spying on them you can put one of three Spyjax widgets on your website. There’s one on this site on the right sidebar.

Update: Spyjax Only Gives You Anonymous Data

There have been some concerns raised since I first published this article and released Spyjax. So I just wanted to point out that the service does not link specific websites with identifiable user data. It simply tells you things like 36% of your visitors have been to http://www.google.com/. Basically all the data collected by Spyjax is anonymous and shown in aggregate form. Obviously this same technology could be used to track specific user’s history, especially if you’re on a site that records your identity in some way. In my humble opinion it’s much better to debate these issues in the open than to have this sort of technology floating around without people knowing about it.

So You Just Want The Code?

Well I’m not greedy, so I’m giving it away for free. You can do anything you want with it, just don’t blame me if it breaks or gets you in trouble.

You can download the code here: Spyjax Code. It’s got an open source Attribution Assurance License attached to it.

Check out these services by my company MerchantOS:

  • POS Software – A point of sale and inventory control system for small retailers.
  • Bike Shop Point of Sale – A full POS solution specifically designed for independent bicycle retailers.