model_dataset_method_design.rdoc

doc/model_dataset_method_design.rdoc
Last Update: 2020-04-01 09:49:55 -0700

Model Dataset Method Design Guide

How you design your model dataset methods can significantly affect the flexibility of your API for your model classes, as well as the performance. The goal of this guide is to provide an example of how to design your model dataset methods for maximum flexibility and performance.

Flexibility: Use Single Method Per Task

In general, it is recommended that you have a single method per task for maximum flexibility. For example, let’s say you need to retrieve all albums released in a given year, ordered by number of units sold descending, and only care about the id, name and number of units sold. One way to do this is in your application code (outside the model), you can call the dataset methods directly:

Album.
  select(:id, :name, :copies_sold).
  where(release_year: params[:year].to_i).
  order(Sequel.desc(:copies_sold)).
  all

One issue with this design is that it ties you to your current database schema, and will make it necessary to change your application code if your schema changes. In general, it is better to encapsulate your code into a dataset method (or a class method, but a dataset method is more flexible):

class Album < Sequel::Model
  dataset_module do
    def all_albums_released_in_year(year)
      select(:id, :name, :copies_sold).
      where(release_year: year).
      order(Sequel.desc(:copies_sold)).
      all
    end
  end
end

Then your application code just needs to call your dataset method:

Album.all_albums_released_in_year(params[:year].to_i)

The advantage of this approach is that you can change your schema at any point in the future, and you should only need to change your model code, you should never need to change other application code.

Performance

After designing your dataset methods for flexibility, stop. Don’t worry about performance until you need to worry about performance. However, assuming you have profiled your application and profiling shows you can benefit from optimizing the above method, you can then consider the performance impact of future design choices.

First, considering that the root cause of the performance issue may not be at the Sequel level, it may be at the database itself. Use EXPLAIN or the equivalent to analyze the query plan for the query in use, and see if there is something you can do to optimize it, such as adding an appropriate index.

Second, assuming the performance issue is at the Sequel level, you need to understand that one of the best ways to improve performance in most ruby code is to reduce the number of objects allocated. Here is the above code with comments showing datasets allocated:

def all_albums_released_in_year(year)
  select(:id, :name, :copies_sold). # new dataset allocated
  where(release_year: year).        # new dataset allocated
  order(Sequel.desc(:copies_sold)). # new dataset allocated
  all
end

Third, you need to understand that Sequel has optimizations specifically designed to reduce the number of objects allocated, by caching intermediate datasets. Unfortunately, those optimizations do not apply in this case. The reason for this is that select, where, and order can potentially receive arbitrary arguments, and enabling caching for them could easily lead to unbounded cache size (denial of service due to memory exhaustion).

To allow intermediate dataset caching to work, you need to signal to Sequel that particular arguments to these methods should be cached, and you can do that by calling methods inside dataset_module blocks such as select and order. These methods will add dataset methods to the model that can cache the returned dataset to optimize performance. Here is an example using these methods:

class Album < Sequel::Model
  dataset_module do
    select :with_name_and_units, :id, :name, :copies_sold
    order :by_units_sold, Sequel.desc(:copies_sold)

    def all_albums_released_in_year(year)
      with_name_and_units.
      by_units_sold.
      where(release_year: year).
      all
    end
  end
end

Performance aside, this does provide a slightly nicer and more readable internal API, though naming such methods can be problematic.

By calling select and order here, Sequel expects that the created dataset methods may be called more than once on the same dataset, and it knows that the arguments to the underlying select and order methods are fixed, so it can cache the resulting datasets. Let’s comment the above example with dataset allocations:

def all_albums_released_in_year(year)
  with_name_and_units.        # cached dataset returned
  by_units_sold.              # cached dataset returned
  where(release_year: year).  # new dataset allocated
  all
end

Note that the order of methods here is important. If you instead change the method chain to filter the dataset first, then no caching happens:

def all_albums_released_in_year(year)
  where(release_year: year).  # new dataset allocated
  with_name_and_units.        # new dataset allocated
  by_units_sold.              # new dataset allocated
  all
end

This is because any time a new, uncached dataset is returned by a dataset method, all subsequent methods in the method chain cannot benefit from caching.

Usually, when you are designing methods to process data based on user input, the user input affects the rows selected, and not the columns selected or the order in which the rows are returned. Sequel is aware of this and has dataset methods that specifically take user input (arguments), interpret them as a filter condition and either:

  • Return all matching rows in an array (where_all)

  • Iterate over all matching rows (where_each)

  • Return first matching row (first)

  • Return first column in first matching row, assumes only a single column is selected (where_single_value)

After calling these methods on a cached dataset a number of times (currently 3), Sequel will automatically build an optimized loader, cache it, and use it for future loads. So the above example changes to:

def all_albums_released_in_year(year)
  with_name_and_units.           # cached dataset returned
  by_units_sold.                 # cached dataset returned
  where_all(release_year: year)  # cached loader used
end

This can significantly improve performance, up to 3x for complex method chains that only return a few rows.

So the general advice on designing dataset methods for performance is:

  • Use dataset_module methods to create named dataset methods that return cached datasets

  • If any filtering is to be done, have it done last using where_all, where_each, first, or where_single_value.

By following this advice, you can significantly increase the performance of your model dataset code.

Further Increasing Performance

The best way to further increase performance at the Sequel level is to switch to using prepared statements. This does require more significant changes to the API. Here’s an example using prepared statements:

class Album < Sequel::Model
  ALBUMS_RELEASED_IN_YEAR = select(:id, :name, :copies_sold).
    where(release_year: :$year).
    order(Sequel.desc(:copies_sold)).
    prepare(:all, :all_albums_released_in_year)

  def self.all_albums_released_in_year(year)
    ALBUMS_RELEASED_IN_YEAR.call(year: year)
  end
end

Note that when using prepared statements, you need to use a class method instead of a dataset method, as the SQL for the prepared statement must be fixed for the class. This limits the flexibility of the method, since you can no longer call it on arbitrary datasets on the class.