Before I get started talking about our experience with Gearmand over the last year you should read up on what it does and how it does it if you do not already know. I first learned about Gearmand at an Atlanta PHP meeting on 2012-11-01. Brian Moon from Dealnews.com gave a great presentation (video linked) on how they use Gearmand at Deal News. He is the maintainer of the Net_Gearman PEAR package and GearmanManager, both great libraries that make it much easier working with Gearmand when getting started. A big thanks to Brian Moon for giving the talk and answer a few questions for me along the way.

Prior to learning about Gearmand we had developed a set of CRON jobs, or rather, looping CRON jobs that handled all of our image processing upon a user uploading their photos. This was working pretty well, but there were small issues from time to time. As anyone who has ever created long running PHP-based CRON jobs knows, its never without it’s issues. We had built in things like automatic dying after a set period of time to help with memory cleanup, max number of photos processed before the script should die, and automatic sleeping when there were not any new photos to process. These all worked well enough but we were never thrilled about how efficient they were. We were always striving to have photos online as fast as possible but there was still a gap from when a photo was received and when we started processing it. Sometimes a gap existed just due to the decision to use a cron and have it sleep and loop. Needless to say, when we were introduced to Gearmand we saw just how perfect of a fit this was for our workload.

Photos were not the only thing we had running through a CRON job. Here is a short list of some of the things we have cut over from being a CRON job to being a job that Gearmand manages.

  • Photo resizing
  • Photo color profile changes
  • Photo watermarking
  • Send emails
  • New user signup tasks
  • Cropping of event/album covers
  • Report generation

When we first started implementing Gearmand we found we were running into issues that hardly anyone was talking about online. Considering we run 100% on AWS and utilize spot instances for a good amount of our processing workload we started seeing issues when a worker node was taken away from us. For those that use AWS and do not know spot instances on AWS, read up, they will save you tons of money! The thing with spot instances is that they can be taken from you at any time effectively without any notice. The issue that arose was that at the time GearmanManager did not let Gearmand know that if the worker died without saying there was an error that the work should be reissued to another worker. After figuring out that this was the issue and submitting a pull request for GearmanManager we were off to the races.

We now have just about 30 jobs that are running through Gearmand on an ongoing basis. The first piece that we cutover was the sending of emails. It was nice to get back to having emails feel like they were sent immediately, especially when the user is reseting their password for example. Each time an email needs to be sent we save it to disk and then dispatch a job. The job gets picked up by a worker and then processed, very quickly. Another thing we found was that sometimes we had a transaction that would take say 1-2 seconds to complete and commit. If a job was dispatched as part of the work in the transaction sometimes the worker would be looking for rows in the database that did not yet exist. Part of our job setup in the worker is to check to see if the needed information is available, if not it sleeps for an increasing amount of time to allow the transaction to finish.

jobs

Gearmand is quick, very quick! The average time it took a photo to get processed through the old cron based queue and be available on the site for viewing was around 6-8 seconds. That included all watermarking, resizing, color profile changes, etc. Typically a user would not even notice this as they were uploading other photos at the same time. We have always told users that asked that we try our best to make sure that all photos are online within 60 seconds. With the move to Gearmand we now average around 2-4 seconds.

queue_time

Our worker nodes scale up and down automatically as the overall needs of the workload changes. If a job dies or does not finish for some reason it immediately get re-queued and another worker picks it up.

From an architecture standpoint, we have two job servers. The job server is chosen at random when work is submitted by the client. While these two machines are basically idle all of the time it provides us with redundancy in case of an issue or we need to do maintenance. One of our most favorite things about the way that Gearmand works is the ability to shut down all worker nodes and the work just pools up on the job servers. Once we turn the worker nodes back on the work just gets worked through.

gearmand_layout

As we go forward with adding new features to ShootProof we will be sure to utilize the Gearmand cluster in anyway we can find. It has proven itself to be a very stable pieces of software that we have come to rely on.

So us code poets over at ShootProof have been hard at work on doing an upgrade to our server architecture and one of the issues that we ran across was how to always know about which app servers are active and should be considered as part of the memcached pool. The goal was to have it totally automated such that we can bring machines up and down and never have to think about updating a list of IPs. The solution we came up with is to have a cron running on all of the app servers that updates a record in SimpleDB. This record cron runs each minute and gathers some information. Mainly it is just updating a timestamp in its SimpleDB record saying that it is alive. Near the end of the script it runs a query against SimpleDB to get a list of all of the active servers. It takes that list, builds an array of IP addresses and puts that in the local memcache instance. The website application then will be able to do a lookup to get the list of IPs to build the cluster at run time. Below is a copy of the script, feel free to pick it apart in the comments.

require_once('lib/sdb.php');
 
define('AWS_ACCESS_KEY_ID', 'XXXXXXXXXXXXXXX');
define('AWS_SECRET_ACCESS_KEY', 'YYYYYYYYYYYYYYY');
 
// machine specific information
$amiId = file_get_contents('http://169.254.169.254/latest/meta-data/ami-id');
$machineData = array(
    'ami_id' => array('value' => $amiId),
    'ip' => array('value' => file_get_contents('http://169.254.169.254/latest/meta-data/local-ipv4')),
    'hostname' => array('value' => file_get_contents('http://169.254.169.254/latest/meta-data/local-hostname')),
    'availability_zone' => array('value' => file_get_contents('http://169.254.169.254/latest/meta-data/placement/availability-zone')),
    'last_updated' => array('value' => time())
);
 
 
$sdb = new SimpleDB(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY);
$domain = 'simple_db_domain_name';
 
$expectedMachineData = null;
 
$objectInfo = $sdb->select($domain, "select * from `" . $domain . "` where `ami_id` = '" . $amiId . "'");
 
if (count($objectInfo)) {
    $attributes = $objectInfo[0]['Attributes'];
 
    $tmpMachineData = array();
 
    foreach ($machineData as $key => $data) {
        $data['replace'] = 'true';
        $tmpMachineData[$key] = $data;
    }   
 
    $machineData = $tmpMachineData;
 
    $expectedMachineData = array(
        'last_updated' => array(
            'value' => $attributes['last_updated']
        )   
    );  
}
 
// put the object into simple db
$sdb->putAttributes($domain, $amiId, $machineData, $expectedMachineData);
 
// find any old records to clean up
$recordsToDelete = $sdb->select($domain, "select * from `" . $domain . "` where `last_updated` <= '" . (time() - 300) . "'");
 
foreach ($recordsToDelete as $toDelete) {
    // if this is the same machine as the one this script is running on let's skip the
    // delete and just run the proceeding update
    if ($toDelete['Name'] == $amiId) {
        continue;
    }   
 
    // let's delete the object
    $sdb->deleteAttributes($domain, $toDelete['Name']);
}
 
// find any active servers
$activeMachines = $sdb->select($domain, "select * from `" . $domain . "` where `last_updated` > '" . (time() - 300) . "'");
 
$ips = array();
 
// make the list of all ips
foreach ($activeMachines as $activeMachine) {
    if (trim($activeMachine['Attributes']['ip']) == '') {
        continue;
    }   
 
    $ips[] = trim($activeMachine['Attributes']['ip']);
}
 
$memcache = new Memcache();
$memcache->addServer('127.0.0.1', '11212');
$memcache->set('cluster-ip-list', $ips, 0);

It is also a good idea to make a separate user in AWS for these scripts so that you can get a new access key and secret that can be locked down to just that domain in SimpleDB with only the needed permissions.

Currently the general rule when using SSL is that you will need one IP for each hostname you want to secure. This will all change once TLS2.0 is widely adopted. For the time being, if you are lucky enough to only want to be securing multiple subdomains off of the same domain with a wildcard SSL cert the keep reading below.

1. Ensure that your apache config includes:

NameVirtualHost *:443

2. Your vhosts:

<VirtualHost *:443>
ServerName subdomain1.example.com
……
SSLEngine on
SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT:!SSLv2:RC4+RSA:+HIGH:+MEDIUM:+LOW
SSLCertificateFile /path/to/your/ssl.crt
SSLCertificateKeyFile /path/to/your/ssl.key
……
</VirtualHost>
<VirtualHost *:443>
ServerName subdomain2.example.com
……
SSLEngine on
SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT:!SSLv2:RC4+RSA:+HIGH:+MEDIUM:+LOW
SSLCertificateFile /path/to/your/ssl.crt
SSLCertificateKeyFile /path/to/your/ssl.key
……
</VirtualHost>
If my understanding is correct of apache, it will enter the first virtualhost it finds that is SSL in this case and use the certificate details in there to decrypt the request. If the hostname does not match at that point it will move along to the next virtualhost that it can match and try there.

I am currently working on a new startup called ShootProof. ShootProof utilizes many of Amazon web services. We recently have been hearing sporadic feedback from our beta testers that sometimes their uploads are slower than they think they should be. Currently the way we are accepting uploads is that we send each file up via XMLHttpRequest to our EC2 instances, doing some quick inspection of the file and then store it in an upload bucket. A few moments later a re-sizer batch job comes along and does resizing/watermarking/other stuff to the photo and moves it into place.

After we started to investigate why some beta testers were sometimes getting slower than ideal upload speed we decided to test out the ability to do out uploads directly to S3. Amazon S3 support HTTP POST uploads which is great as it takes us out of the middle of all of that traffic. Essentially what this means is that users of ShootProof should never be limited upload-wise by our EC2 instance. Also we will not need to constantly spool up and down EC2 instances to handle load spikes. After each upload is completely sent to S3 we will fire off a small notification call that will let us know we have a new photo we need to take care of. Upload traffic to our EC2 instances will drop by at least 99%. Also to be sure that we never miss a new photo that is placed into our S3 upload bucket we will monitor the contents up the upload bucket to ensure that they match what we are expecting. All photos that are uploaded into S3 by the user are marked to have an ACL of private so that they are essentially being put into a dropbox.

Below is a table that shows the tests that we did to come to our conclusion to post directly to S3. The file that was used for this test is a 13.1MB JPEG. All uploads were done using a internet connection that is a full 10Mbit up.

XMLHttpRequest Post (EC2 -> S3) HTTPS S3 Post HTTP S3 Post
13.2 20.6 9.5
14 19.1 9.6
13.7 22.7 10.3
14.3 24.9 9.3
13.7 15.3 9.4
13.9 18.4 9.2
24.1 24.2 9.7
13.7 17.5 9.4
13.8 17.3 9.9
15.2 17.4 9.4
14.96 sec 20.04 sec 9.57 sec Average

With AT&T’s new A-List feature for accounts of at least a certain threshold comes a new problem, which numbers to include in the list. While this might be easy for some, it wasn’t that simple for me so I wrote a little script to compute what my optimal A-List would be. If you are an AT&T wireless customer, give this a shot, it might save you some more or add to your rollover balance! If you have any questions about this script or find a bug you can find my contact information on the about page.