ThinkingSphinx Rails plugin fork

I’ve been working on a fork of the ThinkingSphinx Rails plugin that has now been in production for over a week without issue.  The Freelancing Gods have done a great job writing easy-to-read code that’s fairly easy to extend.  Most areas of the code are solid, but we needed to squeeze more out of the plugin to be able to use it on our environment.  In Validclick, we deal with a massive database of keywords that need indexing very frequently.  The table in question is nearly 50mil rows and replicated out to dozens of MySQL servers, therefore schema changes are best avoided.

The original delta indexing method used in the TS plugin requires you add a tinyint column on all tables requiring delta indexing.  The code also automatically fires the reindexing process on each model update.  To address these 2 issues, I forked the original plugin and came up with some solutions.  So far they’ve worked great in production, your mileage may vary.

Complex Delta’ing

For the column issue, I stole some ideas from Evan Weaver’s UltraSphinx plugin.  While Evan has definitely contributed some amazing code (much of which I use daily), trying to extend his Sphinx plugin was a nightmare.  Maybe he’s just too smart for me, but that code gives me a headache when I read it!  The one great thing about his plugin is that you can specify a delta on any column, instead of just a boolean.  I stole that idea and put it in TS.  So now, instead of adding a tinyint/boolean column to your table and adding this in your model:

class User < ActiveRecord::Base
  define_index do
    indexes :name
    set_property :delta
  end
end

You could do this, which will use the table’s existing updated_on field to create the delta index of anything that has been changed in the past day (requiring no altering of your table):

class User < ActiveRecord::Base
  define_index do
    indexes :name
    set_property :delta => {:field => :updated_on, :threshold => 1.day}
  end
end

The code only supports datetime fields now, but that could likely be extended. The original boolean-based deltas are still supported.

Offline Indexing

Having the Sphinx indexing process fire off on each record update just doesn’t scale.  If you have an API or a multi-record edit interface, you could ostensibly incur thousands of (nearly) simultaneous updates which would obviously cause issue on your Sphinx server.
Even just simply having multiple mongrel process and (even worse) multiple servers running could cause indexing collisions or high server load.  So I added configuration setting that allows you to override the default delta indexing functionality.  Just slap this in your environment.rb and TS won’t call the indexer every time a model is changed:

ThinkingSphinx.offline_indexing = true

Note that this requires you to run the meta indexing on your own.  We run it through cron every 20 minutes.

Extra rake tasks

Another snag from the UltraSphinx codebase was some rake tasks.  They’re a bit different than in US, but the same general idea:

rake thinking_sphinx:index         # Index data for all or 1 Sphinx indexes.
rake thinking_sphinx:index:all     # Index data for all Sphinx indexes.
rake thinking_sphinx:index:delta   # Index data for all or 1 Sphinx deltas.
rake thinking_sphinx:index:merge   # Merges the core and delta indexes for all or 1 Sphinx deltas.

All of the tasks except for index:all allow you to set a MODEL environment variable to denote which model you want to operate on.  The default is all pertinent indexes.  To reindex just the Account data, you could issue the following:

MODEL=account rake thinking_sphinx:index

I haven’t found any bugs yet, but there is surely room for improvement in the code.  I haven’t pulled in all of Freelancing Gods’ latest changes, so that’s definitely on the list.  If you feel like following or contributing, my fork is on GitHub:

git clone git://github.com/bassnode/thinking-sphinx.git

18 comments so far

I started using thinksphinx because it’s really easy to use, but the delta-feature wasn’t really my thing.

I also looked into the code to change the functionality to only index the record recently updated, but had troubles finding my way in the code :)

so really thanks for this adjustments, really handy!

joren
October 28th, 2008 at 9:31 am

Glad to hear the code is useful to you!

Ed
October 28th, 2008 at 10:50 am

I do seems to have the problem that it doesn’t recognize the ‘enable_star’ configuration in the sphinx.yml file

is this a know issue? And is there a way around it? Or it it just me doing something wrong? :)

joren
November 6th, 2008 at 4:44 am

The issue could be that the option to set that is actually named “allow_star”, not “enable_star”.

When I put allow_star: true into my config/sphinx.yml and rebuild the config, the enable_star code is injected into the conf file.

Ed
November 6th, 2008 at 9:22 am

when I do that, I get the error that my fields are prefixes and infexis at the same time

“ERROR: index ‘order_core’: field ‘contact_name’ is marked for both infix and prefix indexing.”

joren
November 12th, 2008 at 5:21 am

git it fixed by telling the infix lenght to nil

joren
November 12th, 2008 at 5:25 am

Hi,

got a problem putting it into production, i keep getting this error:

“rake aborted!
Problem rotating indexes!
Look in frontend/db/sphinx/production for files with ‘new’ in their name - they shouldn’t be there! You may need to reindex.”

any ideas?

joren
December 2nd, 2008 at 4:12 am

That’s the check_rotate task running which tries to discern if the index rotation completed successfully. There shouldn’t be any *.new files left behind after rotation. The issue you’re talking about I’ve seen in production as well - with larger indices. I’m pretty sure the issue is related to timing. The *.new files go away, but just not before check_rotate gets run. I already sleep(5) in the method to side-step this, but for large indices, it doesn’t seem to be enough. I’ll look at ways around this.

To make sure your indices are ok - do what the exception says and look in frontend/db/sphinx/production for *.new. They shouldn’t be there by the time you log on and look. Is that correct?

Ed
December 2nd, 2008 at 11:49 am

Hi there,

I set ThinkingSphinx.offline_indexing = true in my environment.rb.

The data in my system updates quite frequently, so I plan to perform “rake thinking_sphinx:index:delta” every 10 seconds and “rake thinking_sphinx:index” every 30 minutes. But I have two problems in doing so.

First of all, I have problem to create a cron job to run every 10 seconds. It seems that cron jobs can only be scheduled in minutes.

Secondly, do I need to do anything to trigger the cron? It seems not working. I created the cron file as “/etc/cron.d/sphinx_index.cron”. And the following is the content of the cron file:

SHELL=/bin/bash
PATH=//workspace/CA/BETA_3/EconveyancePro
MAILTO=
HOME=/

# build sphinx index every 30 minutes
*/30 * * * * rake thinking_sphinx:index.

Thank you very much in advance.

Cheers

Canvas

Canvas
December 19th, 2008 at 3:48 pm

Hi there Ed,

One more question, it seems to me that “rake thinking_sphinx:configure”, “rake thinking_sphinx:delta”, “rake thinking_sphinx:index” are enough to build index. Why do we need “rake thinking_sphinx:index:all” and “rake thinking_sphinx:merge”? When should I use them and for what purpose? Thank you very much.

Cheers

Canvas

Canvas
December 19th, 2008 at 3:55 pm

Canvas,
For your cron issue - there could be many reasons it’s not working. Assuming this is running under root: can you check root’s email on that box? That’s where the output of all crons will go unless redirected somewhere else. Seeing the output of the task should point you in the right direction. The issue could be path-related. I don’t use .cron files myself - I do everything via crontab, so I’m not sure if the globals you set will work or not - but that’d be what I’d check. As for getting it to run more than once a minute. Cron can’t do that (to my knowledge). Furthermore, you probably don’t actually want to re-index more than once a minute. Unless you have really tiny indexes, the process will likely step on one another. I’d benchmark the runtime of the indexing process to see. If it only takes a couple seconds and you have to have it run every 10 seconds. Write a daemonized script to do so. Check out the daemon gem - has everything you’d need.

For your second question: The thinking_sphinx:index:all task just runs thinking_sphinx:index. It’s there just to be thorough. thinking_sphinx:merge merges in the delta indexes into the main indexes. According to the Sphinx docs, you should do this every once in awhile to keep up performance. We do it once a week on our ~2GB of indexes.

Ed
December 22nd, 2008 at 7:15 pm

Hi there Ed,

Thank you very much for your information. I tried daemon gem, it works fine.

In the daemon I run “rake thinking_sphinx:index:delta” every 10 seconds. And I then ran into another interesting problem. The delta index built on the first call of “rake thinking_sphinx:index:delta” will be gone if the rake command is called a second time after some time interval. I tried it manually and got the same result. If the rake command can not be called repeatedly, what’s the point of having it? What else do I have to do besides calling the rake command in the Daemon to make it right?

This problem really conflicts with the delta index concept in my mind. I thought the delta index keeps accumulating when “rake thinking_sphinx:index:delta” is repeatedly called untill a “rake thingking_sphinx:index” or “rake thinking_sphinx:index:merge” is called.

One more question, do “rake thingking_sphinx:index” clear delta index? What about “rake thinking_sphinx:index:merge”?

Thanks in advance.

Cheers

Canvas
December 30th, 2008 at 7:37 pm

Hi Ed,

To walk around the delta index problem, I am trying to modify thinking_sphinx_tasks.rb by
adding function call “ts_merge(…)” into rake task
“thinking_sphinx:index:delta”. But I am not sure whether this is the right approach or not.

One more question, should I just run “rake Shingking_sphinx:index” once when the
searchd is started and schedule “rake thingking_sphinx:index:delta” and “rake thingking_sphinx:index:merge” in background daemons to run repeatedly?

Thanks in advance. Any suggestion is appreciated.

Cheers

Canvas
January 2nd, 2009 at 1:08 pm

Canvas,
I don’t know why Sphinx would be removing the deltas upon a second run of the delta rake task. I run the same task hourly on our system and the deltas just keep getting recreated. The deltas should only disappear after a merge action (thinking_sphinx:merge) or a complete re-index (thinking_sphinx:index).

Checkout the docs for more details on the ideas behind indexing:
http://www.sphinxsearch.com/docs/current.html#indexing

Ed
January 3rd, 2009 at 1:09 pm

ActionView::TemplateError (undefined method `per_page’ for #)

How did you resolve this one?

baldrailers
June 17th, 2009 at 3:37 am

I haven’t gotten that error. That looks like something outside of ThinkingSphinx. Can you tell me how to reproduce it?

Ed
June 17th, 2009 at 7:34 am

Hi Ed,

I have been using your fork (version 0.9.9, rails 2.0.2, sphinx 0.9.8.1) with timestamp support in delta index. And it has been working really well. Thank you a lot for the hard work.

I am now upgrading rails to version 2.3.4, sphinx to 0.9.9 and thinking-sphinx to 1.3.9. I am wondering whether you have a similar fork for the latest thinking-sphinx.

Best wishes,

Canvas

Canvas
December 11th, 2009 at 7:21 pm

Canvas,
I haven’t updated the code but I know that the latest TS gem has rolled in the timestamp delta feature. I’m not sure it’s implemented the exact same way as I did it, but it’s likely better :) Checkout the Github page.

Ed
December 12th, 2009 at 11:02 am

Leave a Reply

Name

Mail (will not be published)

Website