Archive for the ‘programming’ category

Designing web application systems that ease constraints and increase request throughput

March 8th, 2012

In 2012 I never really expected to be having conversations about why storing static file data in a database within a web app is generally a bad idea, but i have. So, as i provided my reasoning, it became clear that i could blog about this too. Comments most welcome.

The goals of software development within a multi-tiered system

To me, the aims of software development within a multi-tiered system are to constantly simplify the system, plus reduce the overhead on the bottleneck parts within the system. Web apps in particular have so many interdependent parts, that it’s essential that a developer doesn’t add to the complexity. On top of this, as web apps generally have 2 bottlenecks (the application server and the database server), which both govern the overall throughput of requests through the entire system (impacting performance and cost – because we have to scale as these bottlenecks become blocked), it’s our aim to constantly take load off of these parts, to free them up to focus on the stuff that only they can do. When we do this, we gain immediate performance benefits, plus we bake scalability and stability into the design.

We’ve all seen this with the use of page and fragment caching. We use these techniques because we understand that serving content from the web-server and file system is super fast, plus easy to make faster.

Why storing static files in a database is against the goal

It’s these inherent rules and our experience that scare us when we hear about moving activity from non-bottleneck parts of the system (the filesystem and web server) to the bottleneck parts. It’s almost instinctive that we don’t even need to dig any deeper, because it just makes sense – if you start overloading the bottleneck parts of the system, you are working against the overall goal of the architecture – increase throughput without increasing costs. Also, when you take this step away from what’s normally considered the aim of the design, you end up in new territory. How will you resolve performance issues when they arise (and they will)? You can no longer use the techniques designed to take load off the bottlenecks, because you’re designed the system to overload the bottlenecks. It’s anti-performance by design.

So, when considering serving files to end user’s browsers, certain technologies have become the norm because they offer a risk free, high performance solution, and are known now to be the least constrained parts within the system (on average). Serving files from a file system via Apache or some other web server will always be the fastest and simplest solution, and the attention within the industry to make this faster is ongoing. The industry really cares about making this stuff fast. Also, serving content from the file system puts 0 overhead on the application, meaning that doing so frees up the system and reduces the need to add tin as more users upload and download files from your app. The beauty here is that it no longer really matters how many files have been uploaded to your web app, nor does it even matter much how many people want to concurrently download those files. Why? Because the file system and web server are not system bottlenecks; they really can handle the traffic and both are easy to scale if you need to. Now imagine moving those same files to the database, with each download streamed via the application server (because every byte would need to be streamed through the application). What do you think will happen to the bottlenecks? What will happen to overall performance as more users upload and download files? All of a sudden, you have landed in a really challenging place, where the only really obvious solution is to add tin (be it virtual or physical). Your database would need to be scaled (increasing costs), as would your application server (further increasing costs). Plus, unless you have bottomless pockets, you’ve effectively baked in a ceiling into your app. At some point, you are going to have to start bouncing requests, because the cost of serving them is going to be too high.

So, are there any situations when storing files in database on a web app make sense? The only one i can find in google is when you’re uploading images and you need absolute integrity of the metadata. Storing everything in the database is the only way to achieve this, because there can be some discrepancy between file system metadata and the data of the linked file in the database (such as the created time stamp).

Things to consider if you choose to store static files in postgres

Other than this case, it just doesn’t make sense to put strain on the bottleneck parts of the system.

On top of this, when storing file data in a postgres database, you have a few other things to consider:

  • When using bytea columns, you need to be careful about storing “large” files.
    A “large” file means one that exceeds the amount of memory you have in the database. This is caused by the fact that you have to assemble the whole SQL INSERT command with all the file data, which may consume a lot of memory (and the need to escape the BYTEA data makes this even worse). For example in PHP (because that’s the example i have to hand), there is a memory_limit option, usually set to 8MB (which is quite reasonable size), and you’ll exceed this limit with a 2MB file (you have to escape the data before using it in INSERT). You may increase the memory_limit value a little, but on shared server the admin won’t be very happy about that (and I completely understand his reasons).
  • When using blobs, you need to be careful about blob integrity
    You may delete a BLOB that is still referenced from a table. You may create AFTER UPDATE and AFTER DELETE triggers to remove orphaned BLOBs, or use “lo” contrib package. Anyway none of these solutions enforces the referential integrity.

Summary

All in all, I feel that a system designer needs to focus solidly on request throughput within a system. I don’t believe in early system optimisations, because such optimisations can’t be considered optimisations at all until you know what your bottlenecks are. Instead, I feel that following a few simple rules are the way to go:

  • Keep the design simple – if it *feels* wrong and less-than-simple, it probably is.
  • Use the “best tool for the job”. To enable access to uploaded files, file systems and webservers are simply great at serving static content and reduce overhead on your application and database. It just “makes sense”.

Running Ruby 1.8.7 and 1.9.2, RVM 1.10.3, OS X 10.7.3 and Xcode 4.3

February 24th, 2012

I recently updated to OS X 10.7.3 and RVM 1.10.3. The OS X update seems to have confused rvm by changing to a different compiler, resulting in ‘rvm install’ falling with this error:

The provided compiler ‘/usr/bin/gcc’ is LLVM based, it is not yet fully supported by ruby and gems, please read `rvm requirements`.

The solution that worked for me to install 1.8.7 was to tell rvm which compiler to use:

CC=/usr/bin/gcc-4.2 rvm install 1.8.7

This worked as expected.

However, trying to install 1.9.2 or above using the same method failed. So, i tried reinstalling Xcode, to no-avail. I then uninstalled Xcode devtools (sudo /Developer/Library/uninstall-devtools) and installed osx-gcc-installer. After this, i was able to install 1.9.2 using ‘CC=/usr/bin/gcc rvm install 1.9.2′, but it reported errors

Ruby ‘ruby-1.9.2-p290′ was build using clang – but it’s not (fully) supported, expect errors

Let’s see how well this works out. If i get errors, there must be a different compiler i can force rvm to use.

Well, the lack of Xcode devtools resulted in this error when Installing rmagick (2.13.1) with native extensions:

checking for stdint.h… *** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers. Check the mkmf.log file for more
details. You may need configuration options.

So, i re-installed devtools and tried again. Let’s see how well this works.

Update

1.9.2 is working fine, but 1.8.7 is raising this error:

~/.rvm/rubies/ruby-1.8.7-p358/lib/ruby/1.8/timeout.rb:60: [BUG] Segmentation fault

Time to investigate what’s going on here!

1. rvm uninstall 1.8.7 && CC=/usr/bin/gcc-4.2 rvm install 1.8.7

Same error: ~/.rvm/rubies/ruby-1.8.7-p358/lib/ruby/1.8/timeout.rb:60: [BUG] Segmentation fault

2. rvm uninstall 1.8.7 && CC=/usr/bin/gcc-4.2 rvm install ruby-1.8.7 –force

Same error: ~/.rvm/rubies/ruby-1.8.7-p358/lib/ruby/1.8/timeout.rb:60: [BUG] Segmentation fault

3. rvm install ree

Worked!

So, it seems that this setup requires ree

Ensuring that Heroku does not install gems within the Bundler :development group

January 7th, 2012

Tonight, i ran into an issue with Heroku, where is was failing when installing a gem that i have within my :development bundler group:

group :development, :test, :cucumber do

gem ‘ruby-debug19′

end

As the clear solution was to prevent heroku installing gems that it didn’t need, i found this handy heroku command:

heroku config:add BUNDLE_WITHOUT=”development:test:cucumber”

Running this has told heroku to ignore gems that it doesn’t need, meaning the deploy worked fine. Further details: http://devcenter.heroku.com/articles/bundler

This resolves the heroku error:

Installing linecache19 (0.5.12) with native extensions /usr/ruby1.9.2/lib/ruby/1.9.1/rubygems/installer.rb:483:in `rescue in block in build_extensions’: ERROR: Failed to build gem native extension. (Gem::Installer::ExtensionBuildError)

Testing Paperclip generated expiring S3 urls with RSpec, Cucumber and Timecop

December 21st, 2011

The need

I have a Rails app that is using Paperclip to generate expiring urls for files stored in S3. The urls are set to expire after 1 minute. As much as i trust Paperclip and Amazon, I need tests that prove that these generated urls do in fact expire on time, and that visitors to those files after they’ve expired are prevented from accessing the file.

Disclaimer

This has been a bit of a rush, so no doubt i’ll refactor and tidy the code and this post laster today / in the week.

The solution

I’ve used RSpec and Cucumber to check expiring urls that the system generates to ensure they expire successfully. RSpec simply checks that a generated url includes the Expires parameter and it’s value is set exactly to 60 seconds from now. Cucumber goes further than this by uploading files and checking if they are accessible before and after expiration.

RSpec to simply test that the expiration time generated for a link is correctly set to 1 minute

This test simply asks the model containing the attachment (in this case an “Asset” model), how many seconds from now remain before the attachment expires.

Spec

describe Asset do

it “should return an attachment link that expires within 1 minute” do

asset = Factory.build(:asset)
asset.seconds_until_attachment_expires.should == 60

end

end

This depends on a few new methods in the Asset model class, which take care of extracting the Expires param from the expiring url, and comparing to Time.now.

Asset Model Class

First, we create an instance helper method that returns the number of seconds an object’s url has left before it expires

def seconds_until_attachment_expires

Asset.seconds_until_attachment_expires(expiring_attachment_url)

end

I decided to pass the responsibility of calculating this number to a class method. I did this because the Cucumber tests need to request the same calculation for urls that were generated in the past. If they interacted with an instance of the Asset class, by default it would return a new url each time it was asked. So, rather than clutter up the instance method with a decision about whether to issue a new url or return an existing one, i simply passed the responsibility to the class. That seems to work for now, although I might refactor it later.

Next, we create the class level method that calculates time left until expiration. This accepts a url, meaning we can test urls generated now or in the past

def self.seconds_until_attachment_expires(url)

seconds = attachment_expiration_in_seconds_from_epoch(url) – Time.now.strftime(”%s”).to_i
seconds.round

end

This method simply strips the time from the generated url (via the attachment_expiration_in_seconds_from_epoch method) and rounds the value.

def self.attachment_expiration_in_seconds_from_epoch(url)

url.split(”&”).second.split(”=”).last.to_i

end

Clearly, this is tightly coupled to the format of the generated url string, so a cleaner way should be sought. However, for now, this method is only used in the tests and it does work, so it’ll do for the moment.

Finally, to ensure that Rspec, Cucumber and the app all interact with a url generated exactly 60 seconds from now, we create a model instance method that generates the link. All requests for the link call this method.

def expiring_attachment_url

attachment.expiring_url(60)

end

Cucumber to test actual file access via the browser

Cucumber takes things 1 step further. It interacts with all the same methods that we created on the Asset model, but also goes off and uploads attachments and then tries to access them before and after they’ve expired. We use Timecop to create expired urls, and a Cucumber before hook to ensure all scenarios run from the current time by default.

Scenarios

@selenium
Scenario: Viewing an active attachment on an object

Given some object has been created and a plain text file attached
When I visit the object’s attachment url
Then I should see the contents of the uploaded attachment
And I should not see “Request has expired”

@selenium
Scenario: Viewing an expired attachment on an object

Given some object has been created and a plain text file attached
When I visit the object’s attachment url after it has expired
Then I should not see the contents of the uploaded attachment
And I should see “Request has expired”

features/support/hooks.rb

Before do

Timecop.return

end

NB: For the sake of completeness (even though we’re not calling Timecop from our Rspec specs), to be completely satisfied that Timecop isn’t affecting our specs in any unexpected way, we add the same to spec_helper.rb too:

spec/spec_helper.rb

config.before do

Timecop.return

end

config.before do
Timecop.return
end

steps

This is where Timecop offers a wonderfully simple way of generating expired urls.

And /^I visit the question’s attachment url after it has expired$/ do

#First, go back in time 2 minutes and generate the expiring url, and make sure it’s set to expire in 1 minute
Timecop.freeze(Time.now – 2.to_i.minutes) do

@url = current_object.asset.expiring_attachment_url
Asset.attachment_expires_in(@url).should == 60

end

#Next, return to the current time and make sure the previously generated expiring url has now been expired for 1 minute
Timecop.return
Asset.attachment_expires_in(@url).should == -60

#Finally, go visit the expired url
visit @url

end

When run, cucumber correctly reports that expired urls result in the user seeing the message “Request has expired”, and non-expired urls correctly provide access to the uploaded file.

Summary

Although a rough and ready solution, and most likely needing refactoring, it does provide us with a way to test expiration of uploads to S3.

I hope you found this useful.

Getting up and running with Git and Rails on EC2

November 6th, 2011

Here’s a short list of things to do to get Git and Rails running on EC2:

Install Git

sudo yum install -y git
  • sudo yum install -y git

Install Rails

  • sudo yum install -y rubygems ruby-devel gcc libxml2 libxml2-devel libxslt libxslt-devel mysql mysql-devel
  • sudo gem update –system
  • sudo gem install rails