I recently had an instance where I wanted to add full-text search to an application. I’ve used Lucene, Solr, and a few others in past lives, but this time I wanted something just as functional but a little more lightweight. After looking around I settled on Sphinx, and so far it’s worked great. By itself, Sphinx is not hard to use, but since I’m in Rails, I figured someone must have a gem or plugin for this. Sure enough, I found Thinking Sphinx. Now, it’s really simple.
Let’s get things installed.
To install Sphinx on Linux (See doc for others):
- Download Sphinx 0.9.8
- tar xzvf sphinx-0.9.8.tar.gz
- cd sphinx
- ./configure
- make
- sudo make install
To install Thinking Sphinx:
First, install the gem. There is a plugin available, but I prefer the gem.
sudo gem install freelancing-god-thinking-sphinx \ --source http://gems.github.com
Add to your config/environment.rb:
config.gem( 'freelancing-god-thinking-sphinx', :lib => 'thinking_sphinx', :version => '1.1.12' )
Finally, to make all the rake tasks available to your app, add the following to your Rakefile:
require 'thinking_sphinx/tasks'
Now, we need to use it, but before we do that a brief introduction to some Sphinx terms is necessary. Sphinx will build an index based on fields and attributes. Fields are the actual content of your search index. Fields are always strings. If you want to find content by keywords then it must be a field. Attributes are part of the index, but they are only used for sorting and grouping. Attributes are ignored for keyword searches, but they are very powerful when you want to limit a search. Unlike fields, attributes support multiple types. The supported types are integers, floats, datetimes (as Unix timestamps – and thus integers anyway), booleans, and strings. Take note that string attributes are converted to ordinal integers, which is especially useful for sorting, but not much else.
Thinking Sphinx adds the ability to index any one of your models. To setup an index, you simply add a define_index block. For example:
class Company < ActiveRecord::Base define_index do indexes [:name, sym], :as => :name, :sortable => true indexes description indexes city indexes state indexes country indexes area_code indexes url indexes [industry1, industry2, industry3], :as => :industry indexes [subindustry1, subindustry2, subindustry3], :as => :subindustry has fortune_rank, created_at, updated_at, vendor_updated_at, employee_bucket, revenue_bucket has "reviewed_at IS NULL", :as => :unreviewed, :type => :boolean set_property :delta => WorklingDelta end end
Most of this should be pretty self explanatory. To index content (fields), you use “indexes” keyword. As you can see, you can have compound fields by using an array. Note that :name and :id must be symbols or Thinking Sphinx will get confused. You can also use some SQL in your indexes statement.
To add attributes, you use the “has” keyword. Thinking Sphinx is pretty good about determining the type of an attribute, but sometimes you need to tell it using :type.
I will explain the set_property :delta => WorklingDelta later.
To build your index, simply run:
rake thinking_sphinx:index
After processing each model, you will see a message like the one below. Ignore it. Everything is working fine. Really.
distributed index 'company' can not be directly indexed; skipping.
However, if you have made structural changes to your index (which is anything except adding new data into the database tables), you’ll need to stop Sphinx, re-index, and then re-start Sphinx – which can be done through a single rake call.
rake thinking_sphinx:rebuild
Once you have your index setup, you can search really easily.
Company.search "International Business Machines"
This will perform a keyword search across all the indexes for Company. If you want to limit your search to a specific field, use :conditions.
Company.search :conditions => { :description => "computers" }
To use your attributes for grouping and such use :with.
Company.search :conditions => { :description => "computers" }, :with => { :employee_bucket => 2 }
With can also accept arrays and ranges. See the doc for more information.
Back to the set_property above. One issue with Sphinx vs. Solr or Lucene is that the Sphinx index is fixed. If you update your model, the change will not be reflected in the index until you rebuild the entire index. To get around this, Sphinx supports delta indexes. A delta index allows you to make a change and have it show up in searches without rebuilding the entire index. Although, rebuilding an index is not a big deal with Sphinx. For example, I can rebuild the Company index defined here in under 2 minutes (1.6 million records).
What does set_property :delta => WorklingDelta do? First, it adds an after_save callback to your model that will use WorklingDelta to perform the delta index step. Given that Workling is in the name you’re probably guessing that I hooked this up to use Workling so delta indexing happens asynchronously.
Add lib/workling_delta.rb:
class WorklingDelta < ThinkingSphinx::Deltas::DefaultDelta def index(model, instance = nil) return true unless ThinkingSphinx.updates_enabled? && ThinkingSphinx.deltas_enabled? return true if instance && !toggled(instance) doc_id = instance ? instance.sphinx_document_id : nil WorklingDeltaWorker.asynch_index(:delta_index_name => delta_index_name(model), :core_index_name => core_index_name(model), :document_id => doc_id) return true end end
Add app/workers/workling_delta_worker.rb:
class WorklingDeltaWorker < Workling::Base def index(options = {}) logger.info("WorklingDeltaWorker#index: #{options.inspect}") ThinkingSphinx::Deltas::DeltaJob.new(options[:delta_index_name]).perform if options[:document_id] ThinkingSphinx::Deltas::FlagAsDeletedJob.new(options[:core_index_name], options[:document_id]).perform end return true end end
Now, whenever a Company object is created, updated, or destroyed, the WorklingDeltaWorker will be called to update the delta index.
If you have a need to perform powerful searches over hundreds of thousands (or even millions) of records give Sphinx and Thinking Sphinx a try. There are some minor feature omissions, but I think the trade-offs for most applications more than make up for them. BTW, scale is not one of the omissions. The largest Sphinx installation, boardreader.com, uses Sphinx to index over 2 billion records. Craigslist.org is probably the biggest with 50 million queries per day.
Leave a Reply