I’ve been working on a fork of the ThinkingSphinx Rails plugin that has now been in production for over a week without issue. The Freelancing Gods have done a great job writing easy-to-read code that’s fairly easy to extend. Most areas of the code are solid, but we needed to squeeze more out of the plugin to be able to use it on our environment. In Validclick, we deal with a massive database of keywords that need indexing very frequently. The table in question is nearly 50mil rows and replicated out to dozens of MySQL servers, therefore schema changes are best avoided.
The original delta indexing method used in the TS plugin requires you add a tinyint column on all tables requiring delta indexing. The code also automatically fires the reindexing process on each model update. To address these 2 issues, I forked the original plugin and came up with some solutions. So far they’ve worked great in production, your mileage may vary.
Complex Delta’ing
For the column issue, I stole some ideas from Evan Weaver’s UltraSphinx plugin. While Evan has definitely contributed some amazing code (much of which I use daily), trying to extend his Sphinx plugin was a nightmare. Maybe he’s just too smart for me, but that code gives me a headache when I read it! The one great thing about his plugin is that you can specify a delta on any column, instead of just a boolean. I stole that idea and put it in TS. So now, instead of adding a tinyint/boolean column to your table and adding this in your model:
class User < ActiveRecord::Base
define_index do
indexes :name
set_property :delta
end
end
You could do this, which will use the table’s existing updated_on field to create the delta index of anything that has been changed in the past day (requiring no altering of your table):
class User < ActiveRecord::Base
define_index do
indexes :name
set_property :delta => {:field => :updated_on, :threshold => 1.day}
end
end
The code only supports datetime fields now, but that could likely be extended. The original boolean-based deltas are still supported.
Offline Indexing
Having the Sphinx indexing process fire off on each record update just doesn’t scale. If you have an API or a multi-record edit interface, you could ostensibly incur thousands of (nearly) simultaneous updates which would obviously cause issue on your Sphinx server.
Even just simply having multiple mongrel process and (even worse) multiple servers running could cause indexing collisions or high server load. So I added configuration setting that allows you to override the default delta indexing functionality. Just slap this in your environment.rb and TS won’t call the indexer every time a model is changed:
ThinkingSphinx.offline_indexing = true
Note that this requires you to run the meta indexing on your own. We run it through cron every 20 minutes.
Extra rake tasks
Another snag from the UltraSphinx codebase was some rake tasks. They’re a bit different than in US, but the same general idea:
rake thinking_sphinx:index # Index data for all or 1 Sphinx indexes.
rake thinking_sphinx:index:all # Index data for all Sphinx indexes.
rake thinking_sphinx:index:delta # Index data for all or 1 Sphinx deltas.
rake thinking_sphinx:index:merge # Merges the core and delta indexes for all or 1 Sphinx deltas.
All of the tasks except for index:all allow you to set a MODEL environment variable to denote which model you want to operate on. The default is all pertinent indexes. To reindex just the Account data, you could issue the following:
MODEL=account rake thinking_sphinx:index
I haven’t found any bugs yet, but there is surely room for improvement in the code. I haven’t pulled in all of Freelancing Gods’ latest changes, so that’s definitely on the list. If you feel like following or contributing, my fork is on GitHub:
git clone git://github.com/bassnode/thinking-sphinx.git