Welcome to the engineering blog of Wanelo, featuring technical tales of triumph, daring and woe. Sometimes cats. We are definitely hiring. Please email play AT wanelo.com if you're curious!
First posted on Friday, 20 Jun 2014
One of the most important things people do on Wanelo is – save products. Counting how many people saved a given product is therefore an important operation that we have to perform very often.
When we first launched our rewrite of Wanelo in Rails, displaying counts was simple. To display the number of saves a user has in the view, we just call
@user.saves.count, right? The counts displayed are accurate and update in real-time on each page refresh. This works for a while but then our traffic grows, we get more data, and more users. Our database starts to slow down and while investigating we notice thousands of slow count queries executing all the time. We know we need to address this or our site will reach a point and crash.
Rails Counter Caches
A well known solution to this problem is provided by Rails in the form of counter cache feature in ActiveRecord. You are supposed to add the counter cache column, tell Rails what it is, and the rest is taken care of.
So we drop the configuration into the Save model and deploy with a migration that pre-fills all counter_cache values on users.
class Save < ActiveRecord::Base belongs_to :user, counter_cache: true end
After this, we changed our views to reference
@user.saves_countand our database load drops dramatically. We're free to work on features again, woo! But not for long.
Our traffic keeps growing and we start to notice occasional deadlocks in our database looking like this:
Deadlock found when trying to get lock; try restarting transaction: UPDATE `users` SET `saves_count` = COALESCE(`saves_count`, 0) + 1 WHERE (`id` = 1067)
What now?⟹ Full Post
First posted on Wednesday, 11 Jun 2014
After some recent changes to autovacuum settings on our main PostgreSQL databases, we’ve encountered regular significant replication delay on our four streaming replicas. Why this is happening is an interesting subject for another blog post, but it reminded me of some assumptions built into our codebase, as well as some interesting complications of API design.
One of the key values of our engineering culture is "knowledge" – we want to know as much as possible about what's going on with our production infrastructure. Replication delay is no exception: we track it using several tools, such as Nagios, for which we use our custom written nagios plugin for postgresql replication delay (which alerts us when replication falls behind too far), as well as graphing it and displaying this data on a dashboard, using one of our vendor tools Circonus.
Below graph is an example of how we track replication delay across four separate replicas on a logarithmic scale, and overlay it on top of rate of errors coming from the web application and, separately, background jobs. You see two spikes in replication delays, with second spike also correlating with a minor spike in site errors. The two spikes are related to the delay in replication caused by PostgreSQL deliberately pausing replication on one or more replicas, to allow for a particular query to finish running, and is a configurable behavior.
⟹ Full Post
First posted on Tuesday, 27 May 2014
Keeping your sprout-wrap recipes and cookbooks up to date is a good thing to do! It keeps your software secure (because bugs like Heartbleed happen), and it allows your developers to enjoy improving their workstation workflow and processes with the most recent software and tools.
Sprout-wrap has been a moving target lately, and has undergone some recent growing pains. Many cookbooks have moved, and as a result your old recipes and cookbooks will grow stale unless you change your
soloistrcfile to point to the newest cookbooks and recipes. The good news is, this should be a one time "upgrade." After pointing towards the newest cookbooks a simple
librarian-chef updatewill suffice for keeping up to date. This blog post will hold your hand while you perform this one-time upgrade, as I found it to be a bit tricky!
Upgrade sprout-osx-base to sprout-base first
If you're using the
sprout-osx-basecookbook, it has been renamed, and this proves to be tricky when there are dependencies from external cookbooks.
Cheffile, and keep the old
sprout-osx-basecookbook there for now.
cookbook 'sprout-base', :git => 'git://github.com/pivotal-sprout/sprout-base.git' cookbook 'sprout-osx-base', :git => 'git://github.com/pivotal-sprout/sprout-base.git'
Cheffilewill allow for your cookbooks to be backwards compatible. If old cookbooks reference
sprout-osx-base, their dependencies will resolve properly. Similarly, when new cookbooks reference
sprout-basethey will resolve to those same recipes. We'll remove the
sprout-osx-basecookbook at the end.
Try running a⟹ Full Post
librarian-chef install. If the cookbooks have been extracted to separate git repositories already you'll see an error message like this:
First posted on Thursday, 10 Apr 2014
This week was arguably one of the worst weeks to work in systems operations in the history of the Internet. The revelation of what has been called Heartbleed (CVE-2014-0160), a bug in OpenSSL that allows attackers to read memory from vulnerable servers (and potentially retrieve memory from vulnerable clients) has had many administrators scrambling. This bug makes it trivial for hackers to obtain the private keys to a site's SSL certificate, as well as private data that might be in-process such as usernames and passwords.
While there is a huge potential for multiple blog posts regarding our learnings from this week, in this post I'll focus on the current state of affairs, as well as a timeline of events.
tl;dr — wanelo.com was affected by Heartbleed. As of 1am April 8, the public-facing parts of Wanelo were no longer vulnerable. Through the rest of this week we have followed up to ensure that internal components are also secure. This afternoon we deployed new SSL certificates and revoked our old ones. We have no indication that our site was hacked, but there is no way to be certain.⟹ Full Post
First posted on Monday, 31 Mar 2014
Capistrano has been around for almost as long as Rails has been around, perhaps short by just a year or so. Back in the early days it introduced much needed sanity into the world of deployment automation, including documenting in code some of the best practices for application deployment, such as the directory layout that included 'releases' folder with the ability to roll back, 'shared' folder with the ability to maintain continuity from release to release. Capistrano was built upon the concept of having roles for application servers. Finally, being written in Ruby, Capistrano always offered remarkable levels of flexibility and customization. So it should not come as a surprise that it became highly popular, and that subsequent infrastructure automation tools like Chef and Puppet include Capistrano-like deployment automation recipes.
These days it is not uncommon to bump into Python, Java, or Scala applications that are deployed to production using Capistrano (which itself is written in ruby). It's because a lot of the assumptions that Capistrano makes are not language or framework specific.
It's worth noting that in it's entire history of existence, Capistrano have not had an upgrade so dramatically different from the previous version, that in some way it requires rewiring some of your brain neurons to grasp new concepts, new callbacks, and the new mappings between roles and servers, for example.
This blog post represents a typical tale of "We upgraded from version X to version Y. It was hard! But here's what we learned.". And amazingly, despite having been released more than 4 months ago, there is still a massive shortage of quality Capistrano 3 documentation (or upgrade paths) online. With this post I am hoping to bridge this gap a tiny bit, and perhaps help a few folks out there upgrading their deployment scripts.⟹ Full Post
First posted on Friday, 21 Mar 2014
On Tuesday night this week Wanelo hosted a monthly meeting of SFPUG - San Francisco PostgreSQL User Group, and I gave a talk that presented a summary to date of Wanelo's performance journey to today. The presentation ended upo being much longer than I originally anticipated, and went on for an hour and a half. Whoops! With over a dozen questions near the end, it felt good to share the tips and tricks that we learned while scaling our app.
The presentation got recorded on video, but it's not a very good quality unfortunately.
In the meantime, you can see the slides for it :)⟹ Full Post
First posted on Monday, 10 Mar 2014
This week entire Wanelo crew packed up and went up to Tahoe City, a small town on the shore of beautiful Lake Tahoe. We've done a hackathon before, but never outside of our main office HQ in San Francisco.
On Sunday after dinner everyone pitched their ideas and tried to get a team assembled to work on a project. There have been a total of 19 project submissions, and given that we have 15 engineers, I would call this a huge success.⟹ Full Post
First posted on Monday, 27 Jan 2014
When Wanelo gets a brand new workstation the first thing we install on it is Sprout. Sprout is a collection of OS X-specific recipes that allow you to install common utilities and applications that every Ruby developer has and will appreciate.⟹ Full Post
First posted on Wednesday, 18 Dec 2013
Deploying at Wanelo tends to be high-frequency and low-stress, since we have most aspects of our systems performance graphed in real time. We can roll out new code to a percentage of app servers, monitor app server and db performance, check error rates, and then finish up the deploy.⟹ Full Post
On the other hand, many sites are moving more and more functionality client-side these days, so it’s becoming increasingly important to know when there are problems in the browser.
First posted on Wednesday, 11 Dec 2013
MRI Ruby has a global interpreter lock (GIL), meaning that even when writing multi-threaded Ruby only a single thread is on-CPU at a point in time. Other distributions of Ruby have done away with the GIL, but even in MRI threads can be useful. The Sidekiq background worker gem takes advantage of this, running multiple workers in separate threads within a single process.⟹ Full Post
If the workload of a job blocks on I/O, Ruby can context-switch to other threads and do other work until the I/O finishes. This could happen when the workload reaches out to an external API, shells out to another command, or is accessing the file system.
If the workload of a process does not block on I/O, it will not benefit from thread switching under a GIL, as it will be, instead, CPU-bound. In this case, multiple processes will be more efficient, and will be able to take better advantage of multi-core systems.
So… why not skip threads and just deal with processes? A number of reasons.
First posted on Friday, 18 Oct 2013
A few months back, one of our engineers Atasay Gokkaya published a fantastic overview of how we at Wanelo use Joyent's new innovative object store Manta for a massively parallelized user retention analysis, using just a few lines of basic UNIX commands in combination with map/reduce paradigm.⟹ Full Post
First posted on Monday, 05 Aug 2013
With any Rails app that evolves along with substantial user growth and active feature development, pretty soon a moment comes when there appears to be a decent amount of tangled logic, AKA "technical debt."⟹ Full Post
First posted on Tuesday, 23 Jul 2013
We recently ran into a tricky situation with a custom SMF service we maintain on our Joyent SmartOS hosts. The namespace for the service instance (defined in upstream code) had changed, which meant that as our Chef automation upgraded the service instances to the latest code, we ended up with a lot of duplicate service instances that each had a unique namespace.⟹ Full Post
First posted on Friday, 28 Jun 2013
With millions of people now using Wanelo across various platforms, collecting and analyzing user actions and events becomes a pretty fun problem to solve. While in most services user actions generate some aggregated records in database systems and keeping those actions non-aggregated is not explicitly required for the product itself, it is critical for other reasons such as user history, behavioral analytics, spam detection and ad hoc querying.⟹ Full Post
First posted on Saturday, 25 May 2013
We recently gave a talk at the SFRoR Meetup here in San Francisco about how we scaled this rails app to 200K RPM in six months. There were a lot of excellent questions at the meetup, and so we decided to put the slides up on SlideShare.⟹ Full Post
First posted on Wednesday, 13 Feb 2013
At Wanelo we are pretty ardent fans of PostgreSQL database server, but try not to be dogmatic about it.
I have personally used PostgreSQL since version 7.4, dating back to some time in 2003 or 4. I was always impressed with how easy it was to get PostgreSQL installed on a UNIX system, how quick it was to configure (only two config files to edit), and how simple it was to create and authenticate users.⟹ Full Post
First posted on Wednesday, 06 Feb 2013
This past weekend a number of us were focused on a really important annual prime time television event (the Puppy Bowl, of course). Turns out other people out there were watching some other sporting event, which leads to the rest of this story.⟹ Full Post
First posted on Tuesday, 05 Feb 2013
Wanelo's recent surge in popularity rewarded our engineers with a healthy stream of scaling problems to solve.
Among the many performance initiatives launched over the last few weeks, vertical sharding has been the most impactful and interesting so far.⟹ Full Post
First posted on Friday, 14 Sep 2012
The Wanelo you see today is a completely different website than the one that existed a few months ago. It’s been rewritten and rebuilt from the ground up, as part of a process that took about two months. We thought we’d share the details of what we did and what we learned, in case someone out there ever finds themselves in a similar situation, weighing the risks of either working with a legacy stack or going full steam ahead with a rewrite.⟹ Full Post