Recounting a Year of Overhauling An E-commerce Solution

New Relic chart
In 2017, our Magento application's response time is below 140 ms. Before the end of 2015, it was still hovering around 1000 ms.

So far, I consider what I did during my first year at Paradox Interactive to be my greatest accomplishment. During that timespan, I reduced our Magento application's response time from 1000 ms to 140 ms. I also increased its reliability, paid back some technical debt and took ownership of the entire stack. During the beginning of 2016, I deployed the biggest improvements. For that whole year, compared to the year before, the conversion rate of our Magento store increased by 59%. Revenue also doubled.

As I've departed from doing Magento development since then, I thought I'd closing out this chapter of my career by recounting two memorable challenges during that eventful year.

Integration Woes

Our e-commerce solution uses Adyen to handle payments. While we only sell digitals products today, we also sold physical products in the form of merchandise back then. Our own API backend delivers the digital orders, while a solution called Shipwire fulfills the physical orders.

Adyen Critical Bugs

Adyen logo

The way we integrated with Adyen was through their Magento plugin, which wraps Adyen's API. The primary goal of Adyen and that plugin is to set orders to complete upon successful payment. However, every now and then we would come across orders that got stuck and never progressed to complete. The reason this was happening was due to a race condition, as a result of how that plugin handled callbacks from Adyen. If a callback says a payment was successful, that plugin would update the corresponding order object. As a callback is an HTTP request, spawning a new Apache process, there exists a window of opportunity where the new process has handled the callback while the original process is still updating the order object.

Adyen released a new version of their Magento plugin, fixing amongst others this particular issue. As this version of the plugin seemed to contain large amounts of refactored code, I thoroughly tested it and discovered a critical bug: orders that only contained digital products would never progress to the complete state. While not evident at first, this was because the plugin didn't take into account that order objects can have an absent shipping address in Magento.

Another problem, relevant to us, was how the plugin addressed the race condition. Instead of processing callbacks immediately, the plugin stores callbacks in the database. A cron job is then run every minute to process callback events older than 5 minutes, which added a delay to what we deliver to our customers. As I couldn't see a better, quick solution, I patched the logic to 1 minute.

At a later time, we needed to upgrade our plugin again. While everything seemed fine, something odd was occurring as orders poured in when we released an expansion for one of our games. For some reason, the amount of orders stuck started piling up. Only after two hours of debugging did I understand what had happened:

if($order->getIsVirtual()) {
    $this->_debugData[$this->_count]['_setPaymentAuthorized virtual'] = 'Product is a virtual product';
    $virtual_status = $this->_getConfigData('payment_authorized_virtual');
    if($virtual_status != "") {
        $status = $virtual_status;
        
        // set the state to complete
        $order->setState(Mage_Sales_Model_Order::STATE_COMPLETE);
        
    }
  }
Magento will throw an exception if you try to set an order's state to complete in Magento

When processing callbacks, and for orders containing only digital products, the plugin executes a line of code that sets the state of the corresponding order to complete. In Magento, the order object is a state machine. Directly changing the state, and to complete in particular, will throw an exception. This block of code also seemed unnecessary. The order object is already complete before it's executed.

The reason orders piled up was because the cron job could only process one successful order per minute. As the cron job runs it loops through each callback and corresponding order, but crashes after the first iteration. I didn't spot this bug while testing because I never made enough orders in quick succession to notice something was wrong. It was also hard to immediately understand what was going wrong, as exceptions from cron jobs triggered by Magento don't end up where they usually go, but to a table called cron_schedule in the database.

While I find Adyen to be a superb payment provider, I learned something important. Coinciding with what I observed while working for a large e-commerce firm, e-commerce is still dominated by physical products. If you sell digital products you have to be extra careful with plugins. They are poorly tested for (evidently not in our case) and work under assumptions that may not be true for digital products. The 5 minute race condition also illustrates this. If you sell physical products, adding a 5 minute delay before an integration can pick the order up for shipment doesn't have as adverse of an effect on user experience as for digital products.

Shipwire Order Fetching Logic

Shipwire warehouse
Shipwire handles inventory and fulfills orders

The selling point of Shipwire is that they handle your physical inventory in their warehouses, and fulfill orders for you. While it, similar to Adyen, offers an API, we were using a Magento integration they had built. You fill in the credentials of an API user of your store, allowing Shipwire to every now and then poll unfulfilled orders.

On occassion, it would miss picking up orders. In contrast to Adyen's Magento plugin, the code Shipwire runs is invisible to us making it hard to debug. To complicate things, Shipwire doesn't communicate through a REST API but SOAP, and you can't manually trigger a polling attempt.

In the end, I added a snipped of logging code to a method that all Magento API calls pass through. After examining what endpoints Shipwire called and the payloads, I realized the flaw. As you'd expect, Shipwire fetches all paid for but not shipped physical orders. But the request also applies a filter, fetching only the orders that have an updated_at timestamp later than the last order Shipwire picked up. While this filter is sensible, it doesn't take into consideration the fact that newer orders can be ahead of older orders in their progression. Some forms of payment take longer than others, and customer service might update an order a day or two later.

As it was clear Shipwire's support doesn't handle technical issues of such detail, I solved this problem by overriding the method that all Magento API calls pass through. The overriding code intercepts all requests from Shipwire that try to fetch orders, and subtracts 30 days from the updated_at filter.

Another solution would have been to write our own integration, directly towards Shipwire's API. An opinion that I've formed is that, as your e-commerce solution matures, you should strongly consider moving away from platform-specific plugins and integrations. You should instead write your own integrations towards a solution's "core" API. The core API is by necessity much more tested and stable. While using platform-specific plugins let you get started quickly, they tend to carry two major drawbacks: they are bloated as they need to cover a wide range of use cases; and they are developed by those who are knowledgeable about either the core API or the platform, but not both.

API Backend 504 Gateway Timeouts

This was without a doubt one of the most elusive bugs that I've encountered.

Our store uses our own backend API for a number of things, with the most important ones being account integration, order fetching, and delivering Steam product keys. On rare occassion, calls to our backend failed which could result in a customer not getting their product keys. Through adding better logging to our Magento codebase, I found out that these failures occured for all endpoints. Each failure would result in a response containing an empty body, with a header of "HTTP/1.1 504 GATEWAY_TIMEOUT".

Besides the difficulty of reproducing this, each request passes through a vast amount of servers and services. Our backend is an extensive Amazon Web Services stack where requests go through NGINX, Elastic Load Balancer, Elastic Beanstalk, Apache and Tomcat before reaching our Scala codebase. The response from our backend then has to go through NGINX, Varnish and Apache on the Magento side. After ruling out Magento, my colleague who works with our backend did a series of investigations.

He tweaked timeout and KeepAlive settings, to no avail. He performed analyses on our logs and found that the number of 504 Gateway Timeouts correlated with the number of requests, but there was neither a relation to latency nor load.

In the end, my colleague discovered what had haunted us for almost a year. As I lack in-depth knowledge about our backend, here's how I understood it: our backend nodes have Apache in front of them. Apache was configured to logrotate every minute. Whenever that happens, Apache reloads, thereby dropping all existing requests.

Managing a Managed Host

A consultancy company used to manage our e-commerce solution. They deployed our store on a managed host, managed by a hosting provider. This meant that neither the consultancy company nor we had root access to the server.

feelsbadman

If you're someone who has some experience managing servers, this is a frustrating situation to be in. Part of that frustration was because it amounted to lots of communication time. We couldn't perform trivial tasks such as setting up Newrelic, adding a virtual host or changing a configuration file without going through the hosting provider. The user which we finally had them create for us had so few permissions at first, that we couldn't even read our application's log files.

Another part of that frustration was that they were lacking in whatever solution they were using, as well as in their sysadmins. They lacked transparency and weren't following best practices (to the limited extent of my knowledge). We had to hold their hands too often, and if a problem occurred they didn't attempt to understand the root cause and take measures to prevent it from happening again. To cut them some slack, the vast majority of those who use a managed host are non-technical. Their other clients are thus less likely to see their shortcomings, meaning they can get away with a poorer level of service.

For instance, I recall three incidents that highlighted the challenges:

Backups to the Same Disk

One time, I backed up our production database before a deploy, with the intention of removing it the next day. That night, I received a flood of alerts from Newrelic. To my horror, I realized Magento was returning 503 errors because our server was out of disk space! While our hosting provider answered my email and freed up space, I realized what had happened the following morning: their solution performs nightly backups, but saves the backups to the same disk! The same backups were also causing our application to hang every midnight.

For this particular incident, I was also at fault as I shouldn't have backed the entire database. I should've just backed up specific tables of interest. That way, the nightly backup wouldn't have used up as much additional space.

Varnish 503 Service Unavailable

Our hosting provider waas using an unnecessary server setup. Our server was set up as HAProxy > Varnish > Apache. Varnish was not configured to do anything, and we didn't need load balancing as we were on a single, powerful and underutilized server.

During four occasions throughout the course of two months, all customers ended up getting 503 errors from Varnish when they tried to login or make a purchase. This was odd, as it was something which had never happened earlier. It was also hard for me to debug, as I had access to neither Varnish nor HAProxy. The little access I had to Apache was restricted to our DocumentRoot directories. To make matters more frustrating, every time I bugged our hosting provider to troubleshoot, the problem somehow disappeared. They would then drop the investigation, leading to the same problem resurfacing a week or two later.

In a desperation attempt the fourth time it happened, I asked our hosting provider if they had checked /var/log/ of our server. It was then that they found "zend_mm_heap corrupted" at the end of the Apache error log, the key clue which solved the mystery: our hosting provider periodically upgrades packages on their managed hosts. This time, we ended up with a version of PHP and an OpCache which could cause segmentation faults. These faults had a tendency of only triggering after Apache has run non-stop for several days. Hence, whenever we contacted our hosting provider to troubleshoot they would inadvertently fix the problem by making random tweaks to configuration and restarting.

What surprised me the most about this, was how they didn't even look in one of the first few places you'd look. Also, throughout the whole process they didn't even inform or play with the thought that the package upgrades could've been behind this critical bug. Going back to the server setup, they would've also been less confused if HAProxy and Varnish were not used at all.

HAProxy Misconfiguration

As a final example, there was also an incident when we asked them to swap our wildcard certificate for an EV certificate. When carrying out the changes, they messed up X-Forwarded-Proto in HAProxy so it had a value of "https https". This was allegedly due to a bug in the control panel they were using. This caused our store to become unavailable as users ended up in a redirection loop. While mistakes do happen, this particular mistake took them 30 minutes to rectify. They simply didn't back up the configuration file, so had trouble even spotting the problem.

The Successful Migration

During the second part of that first year, I had gained a well enough understanding of our e-commerce solution and pulled the trigger on migrating it. The goal was to be able to gain better control, and not let a hosting provider cause us distraction. Also, this gave us the opportunity of using PHP 7 which had just become available.

The migration project involved several phases: picking a hosting provider, setting up servers, testing our solution on PHP 7, writing bash scripts for the migration, and performing test migrations.

A couple of days before the migration, we lowered the TTL of our domain's A records. I deployed our codebase and moved over all the media assets. On migration day, I put both our old store and new store in maintenance mode while our IT manager updated the A records. A bash script was then run to migrate the MySQL database as well as the Redis database. Once completed, I took our new store off maintenance mode. The downtime ended up not being more than 15 minutes. (Had I performed the migration today, I would've taken advantage of replication.)

A challenge with the migration involved the amount of communication and coordination required. I decided the exact date and time of the migration together with Marketing and Sales. This was then communicated to other parts of our organization, as well as to both our old and new hosting provider. I also tasked our old hosting provider with forwarding all traffic to the new server.

Honorable Mentions

Screenshot of in-game store
The minimalistic store with its base theme and made up product catalog

Besides the integration and hosting provider woes, there were a few other memorable challenges.

The original codebase of our store wasn't in the best shape, which was something I improved over time. A prominent problem was that almost all the code used for the integrations were crammed inside of a God class. While troubleshooting integration problems and implementing new features, I broke this class down into several where each had a single responsibility. I also reduced tight coupling and removed needless dependencies. For instance, one requirement is that if a customer changes their address during checkout, the new values need to be synced to our backend. Much of this requirement was implemented in the frontend by sprinkling some jQuery into the checkout templates. This causes an unnecessary distraction whenever you redesign your checkout.

I also built a store view for Magento, intended for selling expansions and DLC inside our games through an in-game browser. The hard part involved making the store fast and minimalistic. To do this, you have to have a good understanding of how Magento and particularly the checkout works. In addition, I also created acceptance tests in Selenium covering the entire purchase flow. In the end, this store never launched as it clashed with Valve more than we had anticipated. This was understandable, as Steam players purchasing through this store would deprive Valve of their 30% share.

Reflections

Looking back at the successful year, I feel an overwhelming amount of gratitude. Much of the success was made possible by what I learned at my previous job (a leading Magento consultancy company). My former colleagues inspired and challenged me to learn more about software development, and particularly about PHP, Magento and object-oriented programming. One colleague taught me something that will stick with me for a long time: you shouldn't just blindly learn how to do something. You need to go beyond that, and seek to understand how things work behind the layers of abstraction.

The successful year was also made possible by my manager and closest colleagues, who gave me a lot of freedom to improve our e-commerce solution. It also illustrates the importance of continuous product improvement. While we tend to get lost focusing on new features and the number of them, it's important not to lose sight of the core features of a product. For those core features, we need to endlessly ask ourselves if they can be improved and carry out these improvements.