ThinkingSphinx Rails plugin fork

I’ve been working on a fork of the ThinkingSphinx Rails plugin that has now been in production for over a week without issue.  The Freelancing Gods have done a great job writing easy-to-read code that’s fairly easy to extend.  Most areas of the code are solid, but we needed to squeeze more out of the plugin to be able to use it on our environment.  In Validclick, we deal with a massive database of keywords that need indexing very frequently.  The table in question is nearly 50mil rows and replicated out to dozens of MySQL servers, therefore schema changes are best avoided.

The original delta indexing method used in the TS plugin requires you add a tinyint column on all tables requiring delta indexing.  The code also automatically fires the reindexing process on each model update.  To address these 2 issues, I forked the original plugin and came up with some solutions.  So far they’ve worked great in production, your mileage may vary.

Complex Delta’ing

For the column issue, I stole some ideas from Evan Weaver’s UltraSphinx plugin.  While Evan has definitely contributed some amazing code (much of which I use daily), trying to extend his Sphinx plugin was a nightmare.  Maybe he’s just too smart for me, but that code gives me a headache when I read it!  The one great thing about his plugin is that you can specify a delta on any column, instead of just a boolean.  I stole that idea and put it in TS.  So now, instead of adding a tinyint/boolean column to your table and adding this in your model:

class User < ActiveRecord::Base
  define_index do
    indexes :name
    set_property :delta
  end
end

You could do this, which will use the table’s existing updated_on field to create the delta index of anything that has been changed in the past day (requiring no altering of your table):

class User < ActiveRecord::Base
  define_index do
    indexes :name
    set_property :delta => {:field => :updated_on, :threshold => 1.day}
  end
end

The code only supports datetime fields now, but that could likely be extended. The original boolean-based deltas are still supported.

Offline Indexing

Having the Sphinx indexing process fire off on each record update just doesn’t scale.  If you have an API or a multi-record edit interface, you could ostensibly incur thousands of (nearly) simultaneous updates which would obviously cause issue on your Sphinx server.
Even just simply having multiple mongrel process and (even worse) multiple servers running could cause indexing collisions or high server load.  So I added configuration setting that allows you to override the default delta indexing functionality.  Just slap this in your environment.rb and TS won’t call the indexer every time a model is changed:

ThinkingSphinx.offline_indexing = true

Note that this requires you to run the meta indexing on your own.  We run it through cron every 20 minutes.

Extra rake tasks

Another snag from the UltraSphinx codebase was some rake tasks.  They’re a bit different than in US, but the same general idea:

rake thinking_sphinx:index         # Index data for all or 1 Sphinx indexes.
rake thinking_sphinx:index:all     # Index data for all Sphinx indexes.
rake thinking_sphinx:index:delta   # Index data for all or 1 Sphinx deltas.
rake thinking_sphinx:index:merge   # Merges the core and delta indexes for all or 1 Sphinx deltas.

All of the tasks except for index:all allow you to set a MODEL environment variable to denote which model you want to operate on.  The default is all pertinent indexes.  To reindex just the Account data, you could issue the following:

MODEL=account rake thinking_sphinx:index

I haven’t found any bugs yet, but there is surely room for improvement in the code.  I haven’t pulled in all of Freelancing Gods’ latest changes, so that’s definitely on the list.  If you feel like following or contributing, my fork is on GitHub:

git clone git://github.com/bassnode/thinking-sphinx.git

Mongrel memory usage

We run around 100 mongrels between all our servers. 90% of them serve hundreds of XML requests a second, around the clock. We’ve been following the ’scale-out’ methodology, but it can only go so far before you have to look at squeezing more performance out of your hardware and software before buying that next $3,000 server.

The problem we were having was, without a restart, 5 mongrels will eat all the memory (2GB) on any of our quad-core, 64bit servers in about 30 hours. Note that this is not necessarily a generalization about mongrel but in my current application of it. The mongrels in question are running rails code that creates around 1GB of text in the production.log per hour. The bigger the production.log gets, the more memory mongrel/rails eats. Clear the log, mongrel gets back its memory - no restart needed. This doesn’t happen on our front-end GUI mongrels, just the ones creating all the logs.
Graph of available memory over the course of 1 day (Ruby 1.8.4 / Mongrel 1.1.1)

Rails production.log grows to around 1GB in size. production.log only truncated at 1AM:

free_mem_ruby184_mongrel111-no_rolloff.png

Rails production.log truncated every hour:

free_mem_ruby184_mongrel111-with_rolloff.png

My first thought was that it was something in my code that was obviously leaking memory. That was until I realized that just truncating the production.log freed the memory. That leads me to believe the issue is either in rails and/or the Logger class.

UPDATE (2008/03/20): Tom Werner discovered leaks in Logger

I’ve tried different versions of things: the situation looks only slightly better on ruby 1.8.6 than 1.8.4. I’ve also tried older versions of mongrel without any luck. I recently added the rails patches by the guys at Pluron which helped out quite a bit with CPU usage (25% drop!) but not much with memory.

I’m mostly posting this for historical purposes. I hope to have some time soon to get down and dirty with Valgrind or something similar to see where the memory is really going. Hopefully I’ll have an update soon about how I solved the problem! Any insight is welcome.

I am a workaholic

The passionate worker doesn’t show up because she’s afraid of getting in trouble, she shows up because it’s a hobby that pays.
I agree completely.  I try to explain this to my wife and she’s hard-pressed to understand.  But it’s true - I love what I do.

An intro of sorts…

This blog has more or less languished over the past couple years.  I never really got into the daily blogging rhythm, so there’s not much here to speak of.  That said, I’ve obtained a lot of experience and learned a lot over the past 2 years at my current job.  I figure I might as well try to blog about it since it could prove useful to someone else.  And, if not, it will at least be a way for me to supplement my horrible memory.

So, a bit of history.  Over the past (almost) 2 years, I’ve worked for Litmus Media as a ruby developer.  We are a web company specializing in click fraud protection.  We were acquired by Think Partnership about the time I came onboard and not much has changed.  I was brought on to write an web-based advertising platform that would allow advertisers to create marketing campaigns to be distributed through our click-protected network.  The platform is called DirectAds and includes a web UI, XML service (for serving the ads) and click processor. 

I choose the Ruby on Rails framework for various reasons.  The biggest reason was that I was tired of my current tools (perl, PHP, and tons of scattered libraries) and Rails looked like it was sizing up to be a great framework.  Plus, ruby just looked so sexy - how could I deny it?  So, 2 years later I can’t complain at all.  Although, like I mentioned, there were certainly hurdles to overcome when using this new, fancy-pants technology.  I plan to expand on my experiences in future posts.

Sebastian Tellier vs. Radiohead

A friend sent me a Sebastian Tellier song the other day that I quite liked. Though, immediately upon hearing it I realized it sounded a lot like something off the new Radiohead album. You be the judge:

Sebastian Tellier - La Ritournelle (Sessions, 2005)
YouTube
flash version w/ drums

Radiohead - Reckoner (In Rainbows, 2007)
YouTube

Not a direct rip either way, but very similar. Also, I’m not sure who wrote their’s first. I’ve read reports online saying Radiohead played a version of Reckoner as far back as 2001. At any rate, it’s a good riff!

Carlton breaks it down

Oh my goodness
A Carlton breakdance instructional? I need it!

Yes, like the Carlton from The Fresh Prince of Bel-Air.

Fox news: Trees are harming our environment

What can you say? These people are purveyors of crap.
Video here

MIDI rules!

Nothing like reading Pops off the 128-number address of your favorite jam to get the party bumpin’
Oh yeah: YouTube link