Wednesday, October 05, 2016

RubyConf 2015 - How to Performance by Eileen Uchitelle


I hope you enjoyed what just happened
before this. (chuckles)
(audience laughs)
Today we're gonna talk about
how to performance.
This talk is going to focus on tools and techniques
to help speed up your Ruby code.
The things we're gonna talk about today
came out of work that I did with Aaron
to speed up integration tests in Rails.
I'm Eileen Uchitelle, and I'm a programmer
at Basecamp, formerly known as 37signals.
I'm on the security and performance team,
so I make sure things run fast and safely,
and you can find me anywhere online
at eileencodes.
I'm also on the Rails committers team
which means I'm allowed to push directly to master,
and on the Rails security team.
I'm not on Rails Core, which seems to mean
I'm not allowed to Tweet from the Rails Twitter account,
and I'm not sure if that makes you feel
more or less secure.
(audience laughs)
This is my dog Arya.
She is of course named after Arya Stark
from Game of Thrones.
(audience laughs)
She has her own Twitter account, @aryadog,
and she would love it if you followed her.
Be warned, though, she likes to talk about poop a lot.
(audience laughs)
I'm from upstate New York, from a city called Kingston.
It was the first capital of New York in 1777,
but then the British burned it down twice.
And so they moved the capital to Albany,
because we couldn't be trusted with it.
(audience laughs)
Where's my British?
I'm really happy to be speaking at RubyConf this year.
Not only is this my first time speaking at RubyConf,
this is my first RubyConf ever.
I always like to mention something that I find
interesting about the city that I'm currently speaking in,
and since this is my first time in San Antonio,
I think it's really interesting that you have
this Riverwalk thing,
where you have a really, really narrow walkway,
with a river in the middle,
and you have to cross bridges
to get to the other side to go to the restaurants
you actually want to go to.
Do you people fall in a lot?
- [Voiceover] Woo!
(audience laughs)
- [Voiceover] 500 a year, I heard.
- That's crazy.
So perhaps you're wondering why
we wanted to make integration tests faster,
other than the obvious, faster is better.
And why I'm giving this talk.
While integration tests and controller tests
both processor out and make assertions
about the response,
they are two completely separate code bases in Rails.
We figured that if we make integration tests
as fast as controller tests, then we can just
combine the two code bases.
If we didn't speed up first, everybody would notice
their test speed got suddenly got slower in Rails 5
and then they would complain to me.
This has the added bonus of when we
consolidated the code bases,
nobody has to change their code.
There's no deprecation notice,
you can just continue writing tests as you were,
but for us, it's faster.
The purpose of this talk is to show you
the techniques and tools that Aaron and I used
to speed up integration tests.
We're going to go over what worked and what didn't,
and ultimately how successful we were.
Before we can talk about speeding up integration tests,
we first need to go over the difference
between benchmarking and profiling.
Benchmarking measures how slow our code is.
This is all about collecting data
on where your code currently stands.
Benchmarking tools are also instrumental
in proving we made our code faster.
We can use benchmarking tools
to measure our code before and after we make changes
so that we can be sure that we've improved the speed.
Profiling tells us what parts of our code are slow.
We can use profiling to pinpoint the exact methods
that are slowing down our code.
Improving performance is a lot harder, if not impossible,
if you don't use these two tools together.
Let's get started.
So back in January, David said to me,
Hey, I like fast things like race cars.
(audience and Eileen laugh)
And I really, really hate it when things are slow.
I want to move Basecamp's controller tests
to integration tests, but right now
they're 20 to 40% slower than controller tests,
and that's just not going to do.
Can you see what you can do about speeding up
integration tests in Rails
so they're as fast as race cars?
Of course I said yes, because this was a good opportunity,
(audience laughs)
good opportunity to improve Rails for everyone.
So if we're going improve performance,
what's the first thing we need to do?
We need to get a baseline.
This gives us something to measure against
so we know we're actually improving
the speed of our code, instead of just thinking
or hoping we are.
To get a good baseline, I decided that
it would be a lot easier to not change Basecamp code,
and rather build an application specifically for testing
the speed of integration tests.
Building a new application has many benefits.
It's public on GitHub, so you can see
all the different kinds of test I wrote
to measure the integration test performance,
and two, it removes anything that can be
getting in the way of helping me build a solid baseline,
like log in code, and all that other stuff,
because that's part of your application,
not part of the integration test code.
Of course hello world examples
won't show me where all of the slowdown is,
but you need to start with the bare minimum,
and build up from there to find the underlying issues.
Controller tests and integration tests
don't look that different.
They both define a test, process or route,
and assert something about the response.
Here, the index tests for both
are getting the index route and asserting
the response was a 200.
That's it.
To build a baseline, I started with Ruby 2.1.5
and Rails 4.2.0, because that's what Basecamp
was using at the time.
I wanted to be sure that integration tests
were slow in isolation, using the same technology
as Basecamp.
First I double-checked that David's assertion
that the integration tests were 20 to 40% slower
than controller tests was true.
He was basing this off the time that it took
to execute, so I first looked at that,
knowing that wasn't going to be my real baseline.
I ran the single controller test
that we looked at earlier,
and found that it showed 1.625 seconds of real time.
I ran the corresponding integration test,
and it took 2.12 seconds of real time.
This calculated to integration test being 23% slower
than controller tests.
That's actually a huge difference.
It would make running your entire test suite
on integration test only, quite a bit slower.
I then decided to run each of those tests
five more times, and got some interesting results.
The time a single test took to run varied a lot,
and after a few runs, it appeared
that there was no difference in speed
between controller tests and integration tests.
But I knew that this was false.
And it points to the fact that time
is not a good baseline for the type of performance work
that needed to be done here.
This is because using time only runs one iteration.
There are a lot of factors at play,
like garbage collection, memory and caching.
To use time we'd have to calculate
the average of all of the runs, and figure out
how many runs got a good average.
Time's not a good baseline, because it varies a lot,
and would have required a lot more work
than I was willing to do to prove that I made it faster.
If time isn't a good baseline,
then how do we get a baseline?
We can't start improving our code without one,
because we don't know where we stand.
A good gem for benchmarking a Ruby code
is benchmark-ips.
It was written by Evan Phoenix,
and the gem measures the number of times
the specified code can run in one second.
That means the more iterations, the faster your code is.
It completely takes the guesswork
out of figuring out the number of runs
you need to do to get a good average.
It also provides the percent of standard deviation,
so you know how far spread your results are out
from the average.
Benchmark-ips allow you to focus on the results
rather than the data points when benchmarking your code.
Here is an example of the benchmark-ips script
I used to measure the difference in speed between
integration tests and controller tests.
First we define the controller test.
It's a simple Git, and we sort the response's success.
Then we do the same for the integration test.
Again, the index route and assert the response
was a success.
Then invoke the benchmark-ips gem.
Inside the benchmark block you report
on each of the tests using many tests run one method.
We can call each of the tests inside the code
and label them integration test and functional test
so they're easy to recognize.
We then use the compare bang method
from the benchmark-ips gem to compare
how many times slower the integration tests are
than the controller tests.
I really like this method because it means
I don't have to do math.
We can run this like any other test.
And it tells us we were able to get
393 iterations of integration tests per second,
give or take 8%.
In comparison, we were able to run
920 controller tests in one second,
give or take 7%.
The data shows the integration tests
were 2.434 times slower than controller tests.
That's a huge difference, and it is definitely going to be
noticeable if you're running your test suite
on integration tests.
So now that I proved the integration tests
were, in fact slower than the controller tests,
and it had nothing to do specifically
with Basecamp's code,
I wanted to double-check that this was still true
on Rails Master, which at this time, and the latest Ruby,
which at this time was Ruby 2.2.0.
Improvements to speed are being made all the time,
so it's a good idea to be sure there is still a problem
on the latest versions.
You don't want to be re-doing work
that someone has already done on Master.
I ran the same benchmark script again,
with Ruby 2.2.0 and Rails Master.
They were about the same, a little bit slower.
If you check the iterations, they're basically the same,
and have a slightly higher percent of standard deviation.
With Ruby 2.2.0 and Rails Master, integration tests
were 2.53 times slower than controller tests.
Here, integration tests are in pink,
and controller tests are in black.
When we compare the difference
between Rails 4.2 and Rails Master
the two versions were really close
in the number of iterations, but the newest
Ruby and Rails was slightly slower.
For integration tests, we are getting about
395 iterations in one second,
while we're getting 998 iterations of controller tests
on Rails Master.
This demonstrates really how much slower
integration tests are than controller tests.
We weren't seeing that when we were using
time as a baseline.
So now we have a solid baseline,
and we can start improving our code.
Having the benchmark script means that
we can be confident that we've improved
the speed of our code, and prove that to others as well.
The next step is to find what is slowing down our code.
We can't just go changing things
without knowing where the culprits are.
That's just gonna end up with us being really frustrated.
The first gem I used to profile the integration test,
and find out what was slowing down Rails was RubyProf.
RubyProf requires Ruby 1.9.3 or higher,
and is faster than other Ruby profilers,
because it's a C extension,
and it's been around a long time,
first being released eight years ago.
The first script I wrote,
this is the first script I wrote to profile
the integration test, and find where
it was slowing down Rails.
To write this, I followed along with the first
RubyProf example on GitHub,
as one does when they use a brand new gem.
First I set up the documents integration test
so I could specifically profile the integration test stack.
Here I'm using the RubyProf API to run
the index test in a block.
This will capture the integration test stack trace.
And finally I set the printer to the default flat printer
and output to standard out.
I then run the RubyProf script in the terminal,
and it returns the following output.
RubyProf has given us a lot of information here.
I'm not sure about you, when I first saw this
I was pretty confused.
I wasn't sure what I was even looking at,
and I had no idea how to use this
to speed up integration tests.
It felt like some kind of secret code
telling me where the slowest methods were.
Let's break down what we're looking at.
First we have percentage of time spent in this method.
This is calculated by,
from time spent in the method divided by total time.
Then we have the total time spent in this method
and its children.
Next we have the time spent in this particular method,
the wait time spent in this method,
time spent in this method's children,
the number of times this method was called,
and finally the method name that was called.
Even after breaking this down,
I wasn't entirely sure what this data
was telling me to do with it.
I felt overwhelmed and frustrated.
I started reading the documentation more closely,
and playing with available options and printers.
There are a lot of options, so if you
decide to try out RubyProf I encourage you
to just play with it for a while,
until you get data that you like.
First I found a new type of printer called
GraphHTML Printer.
There are no screenshots of how these work,
so you have to like play around a little bit with them.
Instead of printing to standard out,
this code'll print to a specified file,
and when I opened that file in my browser
this is what I saw.
It's a bit more readable, but still super confusing
to understand where the real problems are.
At first glance, if you don't know how to read this,
it seems like Global Num Method is the problem,
because we're spending 100% of our time there.
But that's definitely not the case,
and I'll show you why.
A better way to demonstrate time spent in the stack
was with the call stack printer.
This is by far the easiest output to understand.
When run it creates a visual display of the stack
where your code is spending the most time.
And if we look closer at the top
we can see that it starts with 100% in Global Num Method.
This is, okay, my screen went off, here.
This is just the beginning of our stack.
You will always spend 100% of your time
in the top of your stack.
Always, because that's where you started.
You can't spend less than 100% of time there.
What we want to focus on, though,
are the parts of the call stack graph
that jet out with higher percentages.
One of them is this one, which is
Action Dispatch Runner Reset,
and we're spending 12.56% of our time there.
This could be a good place to start looking
for problems in our code.
RubyProf is a great tool for showing you the big picture,
but at the same time,
the data can feel really overwhelming and frustrating
if you're not familiar with how to use it.
There are a lot of options that come with RubyProf,
so it's worth trying them all out.
Overall, RubyProf was most useful for finding
the differences between the code stacks
for integration tests and controller tests,
but in the end we didn't really use it
to improve the speed of our code.
Once I had these results, I reached out to members
of the Rails Core team to give me some insight
into the data I'd collected.
I got a lot of feedback, and at this point,
Aaron Patterson chimes in and says,
Oh, speeding up integration tests sounds like
a really great idea for my RailsConf talk.
Let's work together on this
so that I can use your code in my slides.
And the first tool that Aaron introduced me to
was StackProf.
StackProf was written by Aman Gupta
and is a Ruby profiler like RubyProf,
but it takes a sampling of the call stack,
so you can focus on the biggest problems
and slowdowns in your code.
Let's take a look at how to set this up.
Here is an example of the StackProf profiler code we used.
Again we add the integration test code
like we did in previous examples.
Then we call StackProf and set a file
for it to output to.
Inside the block we run many tests, 3000 times.
We need to run it this many times
so we can get a real sample of our recode.
And when we run this, we can run this
like we did before, and it's just outboarding to a file.
We're gonna need to open that specific file
after the script runs.
So to view StackProf output you can type
StackProf dash dash text and the path to the dump file
that you named in your script,
and this'll show you the sampled profile
of your Ruby code.
It seems confusing at first, but it's similar
to what we learned in in the RubyProf section,
where the stack is sorted by time spent in each frame.
First we have, the first column is the total number
of samples where this frame was in the stack.
Next is the percent of time all samples spent in this frame.
Then StackProf shows the number of samples
where this was the top most frame.
Then we have the percent of time
where this method was the topmost frame.
And lastly it shows the method
that was called in this frame.
Since StackProf output is sorted by most times
from the top, we can just focus on the first line,
which says we're spending 53% of our time
in Mini Test Runable On Signal.
Let's take a look at that On Signal method.
So basically this method is just yielding.
There's a lot of other stuff, but in the end we just yield.
So to test that integration tests are really spending
53% of our time here, we can just delete all the code,
and so all the method is doing is yielding.
There's no way that just this yield
would be the source of slowness in our code,
because it's yielding something else,
so that would be where the source of slowness is.
When we run the StackProf script again,
we can see that Mini Test On Signal is no longer
in our stack, and the new slowest code
is Search For File at 14.7%.
So we're done.
This is Mini Test's problem, now.
So let's open an issue, grab a beer,
and go work on something else.
But we forgot to do something.
Something very, very, very important.
We need to follow the number one rule
of verifying that we improved performance.
I call it A B B, always be benchmarking.
Every time you make a change to improve performance
you have to benchmark your code,
'cause otherwise you have no way to prove
that you actually made it faster.
So just be sure, let's run that benchmark script again
with the change that we made to On Signal and Mini Test.
We should see a massive improvement
now that we're not spending 53% of our time there.
When we run it again we can see that,
huh?
It's still 2.53 times slower and hasn't improved at all.
After a little bit of research,
and consulting the creator of StackProf,
we found out that there is a little bug, in OSX,
that causes CPU time to be completely inaccurate,
and not correctly profile the code.
(audience laughs)
This is something to remember if you use StackProf
or any other profiler that defaults to CPU time on OSX.
Linux does not have this problem,
so you're free to use CPU mode with StackProf on that OS.
So we can simply resolve this by using wall time instead.
If we change the mode in our StackProf script
from CPU to wall time, we get accurate results
that tell us where the real problems are.
And that problem wasn't in Mini Test.
It's not even represented in this stack sample.
StackProf in wall time actually pointed
to a regression in the Delegate method,
where we were spending 28.3% of our time.
We changed the Delegated methods
and routes set to be written out.
If we run the StackProf script again,
we can see that module Delegate is no longer
on the first, on my list,
and the slowest sample we now have is 4%
of time spent in Pro-Thread Registry.
Now we need to follow our number one rule
of always be benchmarking,
so we run the benchmark script again.
We'll see the integration test went from being
2.53 times slower to 1.4 times slower.
That's quite an improvement.
Here we can see that we actually lost some iterations
in the controller tests, but integration tests
have gained about 200 more iterations per second
from where we started.
We can see that we were definitely improving
the speed, and slowly closing the gap
between the two types of tests.
StackProf is really great in helping point out
the slowdowns in your code quickly,
because it sorts by slowest profiler,
rather than giving you all of the information at once.
But an issue that we had with StackProf was that it was
hard to pinpoint problems that were in methods
that were dynamically created,
or modules that were anonymous.
In fact, you wouldn't even have known there was a problem
in an anonymous module if it wasn't for RubyVM.stat method.
RubyVM.stat is a method in the Ruby Virtual machine
that provides information about the current method cache.
This is useful for being able to see
if your method cache is getting busted between runs.
Method cache is important to the performance of Ruby
because creating new globals and constants is expensive.
Here's an example of the script we used
to track the global and constant state
of the object in our test.
First we add the same integration test
and index method that we've done
with every previous example.
And then we output the RubyVM stats
before we run the test, and then run the test again,
and then output the RubyVM stats
after we run the test.
And when we run the test with the Ruby VM code,
we get the following output.
What we want to focus on is the global constant state,
which tells us that before we ran the test,
we had 7,823 constants,
but afterwards we had 7,824.
That means that we created a new constant
while running a single integration test.
And if we ran five more tests,
we'd have five more constants.
That is not good. (chuckles)
Creating constants while running tests is expensive,
because each time we are busting the method cache,
and then Ruby needs to resolve that method again.
We found that the method that was creating
an extra constant was URL Helpers.
Now finding this method took a lot of digging,
because we couldn't rely on StackProf to find it,
or standard debugging tools,
because the module is anonymous.
We had to go spelunking in the Rails internals to find it.
I looked back at the Git History and found this method
used to be memoized, so I figured just add the cache back.
Whatever, how hard can it be?
(audience laughs)
The diff is much larger than this, but you get the idea.
URL Helpers is now memoized, and we should see
some improvement in our benchmarks.
And so we run the benchmark script again,
and we saw a minor improvement.
Integration tests went from being 1.4 times slower
to 1.36 times slower, and saw an improvement
in the controller test as well.
But Aaron pointed out that there was a better way
to fix this besides just caching the URL Helpers.
When you're working with performance
it's really easy to think that you should just
cache all of the things,
because it will just fix all of your problems magically.
But caching isn't a free solution
to all of your performance problems.
Caching comes with a cost,
and if you're able to speed up your code
without just caching you should do that first.
You don't want to put a Band-Aid
on real problems in your code.
Optimize first, cache second.
The problem was that URL Helpers was being called
when it wasn't necessary.
The integration test didn't always need URL Helpers
when it was being called and initialized,
so a better way to fix this is to move it
where it is needed, rather than caching it.
The dip is quite large, but if we look closer,
we can see that this is moving URL Helpers
from when the integration session is created
to only be called when the session is reset.
Now that we've done that, we can run the RubyVM script again
to check the Global method cache,
and here we can see that the global constant state
remains the same before and after
we run the integration test.
So if we run our benchmark code again,
we can see that this improved integration tests even more
from when we cached them.
When we cached URL Helpers we were at
1.36 times slower.
But after this change, integration tests
are now 1.12 times slower than controller tests.
This proves how expensive
busting a global method cache can be.
That's a massive improvement for a fix
that on the surface seems kind of minor.
RubyProf, StackProf and RubyVM were instrumental
in helping us find the largest problems in Rails Source
that were causing slowdowns in integration tests.
Of course, I made a lot more improvements
that I didn't demonstrate today.
But we were starting to get to the point
we were making micro-improvements to performance.
One thing that we noticed after removing
the largest culprits, was that we were spending
a lot of time in garbage collection, 11.13%.
That's because we were creating a lot of objects
that needed to be cleaned up afterwards.
To figure out why we turned to AllocationTracer.
AllocationTracer was written by Koichi,
and is a tool for finding how many objects
were being allocated by Ruby,
and where they are in your code.
AllocationTracer focuses on a very specific problem
in your Ruby code and helps narrow down
unnecessary allocations of objects.
Here is an example of the AllocationTracer script we used
to track allocations in integration tests.
To use AllocationTracer, we need to include object space.
Then we need to find the index integration test
as we've done in the past scripts.
We then set up AllocationTracer, and called
path, line and type.
This allows to find the information that we want
from AllocationTracer's output,
just to get the path aligned, and the type
that the object was called on.
Then we assigned the tracer a result.
Again, we ran the test 3,000 times
to get an accurate example of allocations,
and finally we sort the result by the number of allocations.
Reverse puts the highest counts at the top,
and then we return the first file,
because we only want to find out
what the top five offenders are,
not all of the allocated objects.
Then we run the AllocationTracer script
like we've done with the past scripts.
The output is a bit confusing at first,
but once you get the hang of the data you're looking at,
it makes a lot of sense.
This shows us the path to the file allocated in the objects,
then the line number where the objects were allocated.
Next is the type of object that was allocated.
This could be a string, array, struct, hash,
all sorts of things.
The next array gives us the total count of objects created.
The old count of objects, which are objects
that haven't been garbage collected yet,
the total age of the objects,
the minimum age of the objects,
the maximum age of the objects,
and the total memory time.
Specifically, we found that a lot
of object allocation problems were in Rack and not Rails.
At least the top five biggest offenders were.
One of the ways in which we reduced the time spent
in garbage collection in Rack was to reduce the number
of string objects being created.
A simple way to do that was to free certain strings.
Under each one of these improvements,
we'd check our allocations.
Now here we don't need to be benchmarking,
because we're specifically not,
we're not trying to specifically improve
the speed of our code, but rather reduce time
spent in garbage collection, which will
speed up the time of a code, because we're not
spending our time wasted in garbage collection.
In this run we can see that Rails has replaced
three of the Rack allocations in our top five.
This is from changes where we froze strings.
The reason that we needed to freeze the strings
was because when you assign a string key and a hash,
the string is copied, which creates more objects.
Freezing prevents this, but makes your string immutable.
Now before you declare that winter is coming,
and that we need to freeze all of the strings,
you must first prove that strings allocations are a problem.
You don't want to prematurely optimize this code,
because you're making the strings immutable.
This isn't necessary if it's not slowing down your code,
and dangerous if the string needs to change.
This should only be done if absolutely necessary,
so I hope I don't see any of you send
a giant pull request to Rails with all the strings frozen
saying that I told you to do it.
(audience laughs)
You absolutely must first prove the string allocation
is a bottleneck before freezing it.
AllocationTracer was a great tool for finding
where objects were being unnecessary allocated in Rails,
and where GC was wasting our time.
We unfortunately didn't really see much
speed improvement from this, but it's also
because these are micro-performance improvements
that can be kind of tedious.
Comparing integration tests before we started
to where we ended up, we are almost faster
than the original controller tests.
Originally, we could only run 393 iterations
of integration tests.
Now we can run 828 iterations of integration tests.
That's a huge difference,
and calculates the integration test
being almost three times faster in current Rails
when they were in Rails 4.2.0.
In fact, most of these changes
didn't actually alter behavior, and were back ported,
so they're available in the newest Rails 4.2,
and you don't need to wait for Rails 5
to take advantage of these optimizations.
I hope this helps you better understand your tools,
and how to improve performance.
Be sure to know the advantages and disadvantages
of the tools that you're using.
You need to understand the limitations
so you don't end up frustrated with your tools.
No tool can do everything, so use multiple tools.
We were only able to speed up integration tests this much
because we combined the usage of benchmark-ips,
RubyProf, StackProf, RubyVM and AllocationTracer.
And of course, always be benchmarking.
It will help you be confident that you're actually
speeding up your code, and able to defend your changes.
It may even help you find a bug in your Ruby profiler.
I want to say thank you to a few people
who helped me make this work.
First thank you to Aaron Patterson for working with me
on speeding up integration tests.
I know that there was some concern
that he was serious when he said he stole my code
in his RailsConf talk,
but it was truly a joint effort. (chuckles)
And thank you to all of the creators
of the gems that we used
and the (stutters)
RubyProf (laughs), RubyConf for having me
speak here today, and thanks to all of you for listening.
I have got Basecamp logo and happy camper stickers,
so if you want one come find me.
(applause)
What? Oh, sorry.
Oh, yeah.
He asked if there are any tools
that automatically include benchmarks
in your commit message or pull request.
There aren't that I know of,
so if you want to make one, that might be kinda cool.
But it's also because you can't,
a lot of times when you are,
if you're sending a pull request,
it might not have,
it might not have every,
you wouldn't be able to pinpoint the part
of the code that you want to benchmark, necessarily.
So you've got a pull request with 10 changes,
and only one of them really improves,
or breaks performance.
You would need to like pinpoint each of those,
and measure each of those,
and then measure them together.
And so that's why it's,
why a lot of times when you send
performance pull requests,
there's specifically performance pull requests.
And when you do that, you should include
your benchmarks and your commit message
so that when someone is going through
and looking about why did I merge this,
you don't want to have to go back and ask
for those benchmarks.
Oh, the question was, did I use other tools
that I didn't introduce in the talk?
No.
(chuckles) Not for this one.
That doesn't mean that there aren't other tools
that I would use for performance stuff,
so if I was gonna do performance of applications,
MySQL calls, I would use Mini Profiler.
So that's a really good tool, but that's
not really relevant to this, so,
there are lots of other tools that I would use
depending on the type of performance
that I'm measuring, but for this
we didn't use anything else.