30 July 2006

OSCON: LiveJournal's talk

I attended the LiveJournal session on the last day of OSCON.  I've read previous slide decks from them on various scaling issues and their tools, etc.  Also, since I've been thinking about memcached (like everyone else I guess), and MogileFS, it made sense.

The presentation was mostly an overview of the main open source tools they provide: Perlbal, MogileFS, and memcached (see code.sixapart.com page for info).  Perlbal is their very configurable load balancer; MogileFS is a file store, not a file system; and memcached is a caching system.  They did not get into setup details on any of them.  They did cover some general aspects of LiveJournal's setup (and it sounds like VOX is similar), and a few points of interest, some relatively common ideas, some more detailed and useful ones (more on this below). 

MogileFS is quite interesting to me.  It is very similar to Amazon's S3 in general use, in that you store things essentially via keys, not via normal file system directories and files.  So, what's cool about this is that you can have a somewhat infinitely sized system where you don't worry about limitations on directory contents, file names, disk space, and so on.  Further, MogileFS allows for a "class" notion, which you configure to mean how many redundant copies you want.  The typical example is that for photo storage, you'd have an "originals" class where you might have 3-4 copies, and then a "thumbnails" class where you had maybe one or two (since you can regenerate these).  Also, there are namespaces, so you can partition data that way.  You must use a client library with MogileFS, but they are available in many languages.

As a general recommendation with their tools, they noted that you would likely start out with your normal system of some web servers and DB server(s).  Then, you'd add in memcached, next Perlbal in front, and finally add MogileFS (which essentially requires at least two machines, if not three or more depending on how you configure the three pieces involved for MogileFS).

They also noted that they use Gearman for job queues where the jobs don't matter, and either they're done in 10 seconds or they're gone.  They then use dschwartz for queues of durable tasks, that you care about.  You can send in multiple of the same task, and they get properly coalesced into only being done once.

Some of the interesting points I noted were:

  • LiveJournal started on a shared hosting setup, then grew to dedicated, and so on until where they are today.  They are still in a single data center, although looking to either go to multiple, or move (their data center is in SF, and they would like to be in a less environmentally disasterous area).  They used to have a data center in Japan, but it wasn't yielding advantages.  This was info I gathered when I talked to them after the session.
  • They use lots of cheap SATA drives, and they found that, at least today, the 250GB disks are better than the 500GB disks, because they are less susceptable to heat, and can be used to capacity, whereas the 500GB's could only be used up to about 350GB. 
  • Further, they do not use RAID (5 anyway).  They do use RAID 0 or RAID 10 say for database systems, but not for the file systems.  With MogileFS, they don't need it, and it doesn't handle all problems, like power failures and so on.  Also, fsck doesn't run in parallel, so it can take a long time.
  • Continuing this, use lots of small machines, so you spray the disk writes across them, as opposed to big honkin' systems.
  • Make sure MySQL is not a SPOF (single point of failure)!  Single big MySQL boxes are not good, better to have smaller ones that are faster and cheaper and where MySQL can perform better (better/faster IO).
  • They think they'll get rid of Apache within a year or so (Perlbal is enough).  They have BigIP boxes in front of Perlbal's for simple load balancing, and these are nice boxes, but they don't know how truly busy the Apaches are.
  • MySQL is the only thing in their system that blocks.
  • They're moving to dual, dual-core machines in 1U boxes to cut down on heat.  4GB of RAM per box.
  • I didn't get the exact config of MySQL, but they said they have about 25 boxes running it, with varying configs for different things.
  • They don't explicitly talk to the Odeo guys, but do see them regularly.  I had asked about this because Odeo is doing various Ruby versions of some of their tools (memcached and MogileFS I believe).

Note, for an older, seemingly slightly out of date slide deck, check out their presentation from last year's OSCON.  This year's presentation I have yet to find online.

technorati tags:, , , , , ,

0 comments: