Wednesday, October 05, 2016

RubyConf 2015 - s/regex/DSLs/: What Regex Teaches Us About DSL Design by Betsy Haibel

fternoon
we're going to be speaking about regexes
and specifically their DSL design
and what we can learn from it when we're deciding
on their DSLs.
So just to keep everyone on the same page,
we're going to start with a quick introduction
for regular expressions to anyone
who's not familiar with them, or for anyone
in the audience who could use a refresher
since they haven't worked with them in a while.
Here's the simplest regex I can think of.
It searches a given text for the letters d, o, and g
in that order and with no characters between them.
So it'll match any of these strings here.
And here's a less trivial example.
In this one, we use the period wild card
to match any character.
Since this wildcard matches any character,
the regular express d.g, which is now on the screen,
thank you Google, can match the strings
did, d space g, d exclamation point g,
or a lot of other things.
There are a lot of other little wildcards.
They can match more specific things as well.
Word characters, white space.
God damn it.
Even a thing called a word boundary,
which is the first or last character
of any given word.
Both characters and wild cards can be grouped,
if the default groupings aren't powerful enough,
and it can specify the number of characters
to be matched with other wildcards, like star and plus.
The specifics matter less right now than the mere fact
that there are a lot of things you can do.
Getting a little more complex,
you can use capture groups to single out
specific subsets of your match for special treatment,
and a back reference to refer to a previously captured
captured group.
Also, Peter Piper picked a peck of pickled peppers,
later on within a single regular expression.
So we've got all these building blocks,
and individually they're pretty simple.
Come on, good little computer.
There we go.
And I'm not going to pretend that all regex are simple.
This, for example, is an email validation regex
that someone, somewhere, for some reason,
recommended that other programmers use in production.
The simple elements that make up regexes can be combined
in ghastly hieroglyphic-esk ways and often are.
So at this point, you may be wondering some things.
Things like whether it is possible
to learn about designing DSLs or indeed
about designing anything from something
that produces screen-fulls of mess
and that can't even fully parse
an email address in the process,
because, of course, that email validation regex
I just showed you did not actually work.
And the answer is that regex are old.
Like C, like shell scripts, like them,
regex are gawky and horrible and everyone has used them
for decades, anyway.
They are too bloody useful to erase.
They are too bloody useful to give up,
no matter how much we try to replace them
with tools that are not only aesthetically prettier.
Anything that bloody useful has to teach us design lessons,
whether its surface seems polished or not.
The biggest goal of software design,
over and above how elegant things are,
is getting the damn thing to work.
And regex, bless them, do that, if nothing else.
So that, as we will see later in this talk,
is because they get to cheat, but we can
still learn from the ways they cheat.
So how old are regex anyway?
They were first defined as a mathematical concept
back in 1958.
They were an outgrowth of set theory
used for describing grammar of regular languages.
A decade later, they were implemented
as a simple, independent programming language.
Note that this first implementation
treated them as a programming language
in their own right.
A few years after that, they began to see wider use
when they were embedded into a concrete tool,
the Unix utility grep.
They then became embedded in more and more powerful tools
such as sed and awk, and were embedded
into the programming language Perl in 1987
as a first class language concept.
In other words, regular expressions
got a lot more powerful and useful
and therefore a lot more used.
When they became a domain-specific language
for string processing, embedded
within a more general purpose language.
In the 28 years since Perl came on the scene,
regex implementation has been baked into countless other
programming languages.
We're at the point where they're considered
a language feature rather than
a language in their own right.
Most programmers had forgotten that or never knew.
And when I find regex historically like that,
contrasting their early days as a programming language
in their own right, with their modern days
as an embedded DSL, it naturally leads to question.
What are DSLs anyway?
Are they appreciatively different
from programming languages?
While I don't necessarily think
that the c2 wiki is an authoritative source,
it's someplace where a lot of smart people
have had a number of informed opinions.
A number of informed opinions over the years.
And they define DSLs, and this consensus
that is reached through a sheer stunning amount of debate,
as programming languages, as programming languages
designed specifically to express solutions to problems
in a specific domain.
There are a lot of spirited discussions
about the merits of this pattern,
because two programmers and three opinions
and c2 wiki, but it's universally agreed
by all of these programmers with all of these opinions
that both their potential beauty and the potential horror
of DSLs stems from their place
as languages in their own right,
because languages are difficult to design.
They also do some talking about
whether regex are actually a DSL,
fascinatingly enough.
A lot of people don't think they're complex enough
to count as a language.
To each their own, but I'm the kind of person
who will die on the hill that CSS and SQL
are also programming languages.
And regex have far more complex control structures,
even if these control structures are not
actually powerful enough to avoid
this kind of email validation regex,
and to let you express those ideas
in a more concise fashion.
But that cautionary tale aside,
which is absolutely what we think of
when we think of regex in fear,
in the wild, most production regex are a lot closer
to this basic example,
and while d.g isn't necessarily what we think of,
it's a perfectly valid regular expression,
and it exemplifies one of regex's
genuine intuitive strengths.
It's not just that far a leap to figure out
that a regex containing the letter d
will match on the letter d.
More generally expressed, we can call this feature
of regular expressions tight domain integration.
Oh wow, I timed that right.
Remember, DSLs are programming languages
designed specifically to express solutions
to problems in a specific domain.
When DSLs tie themselves closely
to the quirks and structure specific to a domain,
they get a leg up in solving domain's specific problems.
This is something that goes a bit deeper
than the ordinary programmer superpower
of needing things.
You're not just importing concepts
from the problem domain into Ruby,
you're placing the logic of Ruby
with the logic of that problem domain.
Regexes get to cheat a bit
when it comes to this tight domain integration.
They're a text-processing language,
and they're written using text.
Most DSLs we write don't get that automatic cheat.
But we can express this tight domain integration
with a little more work to figure it out.
For example, we're going to build a query language
that runs Twitter searches.
Targeted Twitter searches specifically,
and we'll start with the simplest query possible,
which is is searching my Twitter feed
for photos of my cat.
We can see here why that is the simplest thing possible,
or we will in about 10 seconds.
At this point, right, you don't really need
a DSL to express the thought.
A simple hash interface would convey my intents clearly
and implementing that interface would be
far more straightforward.
But what if you want photos of cats
in my general social circle?
Suddenly a more complex query language
starts to make sense.
These two examples are roughly comparable.
But let me start to add more complicated logic
around the network diagram with my Twitter friends
that our quote on quote simple hash interface
starts to look a lot less simple.
This hash flow will be difficult
for the search function to parse,
and difficult to actually use.
It would be difficult to document
and difficult to remember.
This is happening because we're defining our API
on Ruby's terms rather than our domain's terms.
It start to looks like a bad DSL, actually,
and specifically one without tight domain integration.
In the first example, by admitting
that we were writing a DSL, we were able to maintain
a tight focus on the core to main concepts
which ultimately led to a smoother design.
Now you'll note one thing that I am not saying here.
A lot of people talk about strings like this
as examples of successful API design
because they're Englishy.
What's actually happening, though, is more complex.
The two examples we're going to be looking at
in about five seconds are both RSpec
from different years of the framework.
They're both, I suppose, Englishy,
in the loose way that we were using the term before.
That is to say they both use English words to name things
and their grammar occasionally causes those English words
to flow together in a way that apes an English sentence.
The top half example is definitely
the English here of the two.
It's pretty much a sentence in its own right.
But it's been supplanted by the second style's RSpec
cause evolve, which is against what we'd be thinking
if Englishy was always the goal of API design.
It's been replaced by that for a lot of reasons,
among them a much cleaner implementation.
It actually isn't any harder to work with, in practice,
which goes against the idea that Englishy is the goal.
The mark of a good DSL isn't
how closely it approaches English,
its whether it enables programmers to write programs.
The RSpec DSL neatly encapsulates domain concepts
like test cases and assertions, achieving the same type
and necessarily intuitive domain integration
that regex achieved by having dog mash dog.
And only some of red RSpecs tight domain integration
comes from choosing good names for things.
The vocabulary of the DSL makes sense,
but languages are made of grammar
as well as vocabulary, and this brings us
to our next big principle of good DSL design.
Namely, composability.
If I want to make regex searches for either dog or cat,
the answer is pretty easy.
Regex's grammar is simple and for the most part, intuitive,
or a combination of back references
are really as complicated as it ever gets.
Since all its doing is providing this facility
for simple text matching, and since its made out of text,
it once again gets to cheat, and for the most part,
lean on its own structure to develop a grammar.
Since most domains aren't quite such natural fits
for one character after the next,
they each develop more complex composition rules.
When we build Ruby DSLs, we are building languages
that are implementing in Ruby and which lean
on the Ruby parser, and because of that,
we're constrained by Ruby's grammar
in deciding which composition rules to adopt.
In practice, this leads us to our three basic shapes.
The first and simplest is the class macro DSL.
Specifically, the class macro
with a lot of configuration options.
This sort of example is useful as
a top of a hook interface between a library
and classes that want to make use of its features.
Its how a lot of the Rails framework, for example,
is expressed, as well as a lot of
image attachment libraries.
It's not necessarily that expressive
because you can only build concepts with it
that can be expressed in a configuration hash,
but it's easy to read, and easy to implement,
and hard to screw up.
The next most complex of the DSL styles
we're going to talk about is method chaining.
In this style, which will hopefully appear
on the screen now, you use a series of methods
that return self to build code sequences
that continuously refine what an object means
before using that object.
This is a very common JavaScript DSL structure,
but in the Ruby world, I've mostly only seen it
in test libraries like MochaMox or RSpec natures.
Honestly, I wish it were used much more often.
Since it's designed around the idea
of continuously modifying objects,
it's easy to manipulate and use about,
and it can be bent to match
a lot of different domain models.
In our example of Twitter query DSL,
our composition rules focus on the shapes
of the relationships that people have with each other.
In Mocha, they focus on the different properties
of mock objects.
In each case, the grammar which defines
how elements can be composed
also echoes the domain structure.
In other words, tight domain intergration
matters at both the vocabulary and the grammar levels
of a domain's specific language.
The last common Ruby DSL style is the block structure.
Its simplest form, the one level block DSL.
It's a commons choice for tiny configuration DSLs.
It provides a really pretty interface
with a minimum of implementation.
Come on.
There we go, little computer.
There we go.
You can also build nested block DSLs.
Since the style pushes you toward code
that takes on a tree or a nested structure,
it's a strong choice when the pattern
echoes the landscape of that domain.
In the Rails routing DSL, for example,
the tree shape echoes the directory structures
that web browsers visually imitate.
This box structure is a common one in Ruby DSLs.
It defines a grammar that feels removed
from the ordinary one method after another rhythm
of Ruby, and so it feels DSLy
in the same way that arranging things in sentences
feels Englishy.
It's not that hard to implement necessarily
from lines of could perspective,
but because it relies on passing blocks of code
in between different concepts,
it's sometimes hard to reason about.
When things go wrong, it can be difficult
to intuit or even find the context
in which any given line of code is executing.
And this leads to one of my most common frustration points
with other people's DSLs.
Namely, them using the block structure inappropriately
because it looked DSLy.
The slide illustrating my point
should appear in a few seconds.
But in the interest of time...
The abstraction that they tried implement
with these inappropriate block structures
doesn't neatly fall into a nested structure, necessarily.
And so when I write code that tries to fit
what I'm trying to express within this nested structure
that doesn't fit it very well,
I wind up needing to pass around prox a lot
or use a bunch of instance valves or both
in order to get things done in a dry way.
Worse yet, because I'm passing around
all of these blocks that are evaluated
in various contexts that I know
very little about immediately,
I need to read the get libraries code
and really know a lot about what contexts
these blocks are being evaluated in.
I need to care about the internals
in a way that I wouldn't necessarily need to care
in a less leaky abstraction.
And to be frank, this talk was inspired
by a DSL that made me do that.
It also was designed in a way that wasn't easy
to extend or modify, and so I wound up needing
to monkey-patch it a lot.
It was a really bad perfect story of frustration
and so I was trying to write a talk
to figure out why I hated that entire process and...
So all through the project I was working
with that DSL on.
I wanted two big things from it.
I wanted it to be easily extensible with ordinary
object-oriented techniques, so that I didn't need
to monkey patch it all the time,
and I wanted it for me to be easily able
to merge blocks of code written in that DSL.
In other words, I wanted it scrimmaged below
for better composibility.
And when I started working on that, the stock,
I figured that those two were the same thing.
I really did think I was going to find out proving
that DSLs were irrelevant, and I was wrong.
And here's why.
Regexes are made of strings.
You can trivially build a regex with Ruby
using perfectly ordinary string manipulation Ruby.
You don't need to use class avail and feel dirty about it,
the way I did in the regex examples I was showing before.
And I figured that as long as I was going to say
that you can do stuff like this with your DSL,
it was going to be perfectly fine, it was going to be great.
And this talk was just going to be about
how to make it possible to do that stuff.
But if we accept that domain specific languages
are just languages, then what actually is the difference
between combining regex fragments with Ruby
and intermixing Ruby with other languages?
What's the difference between the regex with embedded Ruby
up top and the JavaScript with embedded Ruby below?
There isn't all that much of one,
and if we poke at our instinctive "ew" reaction
to that JavaScript with embedded Ruby,
we can figure out why.
So in this example, we're initializing a JavaScript
array and then using embedded Ruby to manually build up
a set of literal push calls that reassembles a Ruby array
in JavaScript world.
When I've seen this first example in the wild,
and yes, I have seen it in
three different production code bases,
God help me,
is generally put in the context above application view.
In other words, the developer was writing that code
to transfer an in-memory Ruby array on the server
to an array on the client.
But of course, there's another more widely accepted way
to do that, it's the example below.
You just write an API and point on the server
that returns the array, and then the client's side
JavaScript accesses it using an ordinary Ajax call.
In writing the embedded Ruby, we are ignoring
an existing well-defined interface
for transferring information between the client's side
and the server side,
and in ignoring that interface,
we can figure out what's going wrong.
It's not just that we're ignoring the interface, by the way.
When I first had this "ew" reaction to the array push,
I didn't actually know enough JavaScript
to understand that there was an accepted way
to not bullshit that.
But if there's a defined interface for us to ignore,
that means that we must have two objects
that the interface is between.
In this case, the objects are the Ruby server
and the JavaScript client.
But we can as easily think about that
as the Ruby and the JavaScript.
We can think of the languages as kind of objects
in a CS meta sense.
This is a little easier to understand
when we look at the regex example.
It's very clear that the two different objects
are the languages themselves.
And if a chunk of any given language
is a object in its own right, and again,
in some very interesting meta sense,
then what we're doing when we use Ruby
to compose a regex or assemble a JavaScript array,
is crossing those object boundaries of the language.
Those interpreted Ruby strings are not actually
spiritually different from using instance about
to call a private method.
They are reaching into the JavaScript's business
and messing around with it, which is part of why
code generated using this method is so very hard
to understand and debug.
And as soon as we've got that mental framework in place,
what's the difference between interpolating Ruby
into JavaScript like the example above,
and interpolating Ruby into RSpec?
And I know I just said a really weird thing.
RSpec is written using Ruby, so it sounds funny
to talk about interpolating Ruby into RSpec.
But again, in order for a DSL to be useful,
it needs to be a language in its own right.
We need to give it that respect.
And so we need to accord RSpec that respect,
and RSpec is kind of weird in this way, right,
because it expects you to embed Ruby into it,
but it expects you to embed this Ruby
in specific, cordoned-off, and well-defined places.
When you embed Ruby in a place that isn't one of those,
like by using an each loop to find
a group of similar samples,
then you're crossing language boundaries.
And it feels icky, in the way that that always does
and should always do.
If I were to try to use ordinary object-oriented techniques
to try and and expand RSpec, like I wanted to be able to do
with that bad DSL I was talking about earlier,
that would also be crossing those boundaries.
When was the last time that you tried to extend the class
that all described blocks-build instances of?
For that matter, when was the last time,
outside of Sam's talk earlier,
that you thought about the fact that
describe blocks instantiate an object?
RSpecs language design successfully hides
these implementation details from you,
just like a good library and a good language should.
You don't think about C when you're writing Ruby
unless you're doing weird optimization.
More than that, it successfully obscures its own Rubyness.
We nearly forget using it, that it was written in Ruby
and therefore must be made up of the objects and classes
that make up all Ruby implementations.
We get to do that because RSpec
has removed the need to think about it.
Instead of asking users to use ordinary object techniques
to extend RSpec, its maintainers have defined
some specific extension APIs such as
a shared example API and the maxtrix API.
And for matters connected to the actual purpose
of our Spec, namely the structure of example groups,
and examples and expectations,
you're expected to still not interfere.
In other words, any language's rules of composition
stay within the language.
Composability is not about how easy it is
to cross language boundaries to do whatever you want.
It's about how easy it is to do what you want
in a sensible way while staying within
the bounds of the language.
And that's great and all, but it doesn't solve
one of the problems I had with that other DSL,
the terrible one that I'm deliberately not naming.
That I couldn't do all the things that I wanted to do
with it, period.
Never mind sensibly while staying within the bounds.
That's why I needed monkey patches and internals.
And so how do we avoid that problem
in our DSL designs?
While we can provide a small defined extension API
like RSpec does, and that lets us define new words
in the language without bending its grammar out of shape,
but there's another way, and I like this one better.
And it's very simple.
One of the beautiful things about regular expressions
is that they search within text
and they occasionally replace text.
They do not try to do anything more.
They do not claim to do anything more.
They have chosen one specific problem space
and they don't try to solve any other problems.
As Stack Overflow's funniest answer is quick to remind us:
Regular expressions can only parse regular languages,
and those are a very small subset
of all the languages in the world.
They have their limits.
They are not a complete parsing engine for anything.
Especially not HTML.
And also, again, not to beat the dead horse,
email validations.
And that is totally okay, because they do not need
to do anything but search text.
I'm going to call this closed domain integration.
It's not enough to integrate deeply within a domain,
you just need to go to the limits
of that domain and no further.
In order to get there, you need the flip side of this coin.
Namely, constraining the domain definition
so that you know what those limits are.
It's okay to define loose limits
with big, red, placeholder boxes,
like RSpec does, and say, "User code goes here,"
but you need to have that really specific definition.
You need to know where those boxes lie.
If you do that, it makes the problem of covering the domain
completely, one that is even solvable in the first place.
So I'll start wrapping up now.
As Rubyists, we are not going to stop creating DSLs
any time soon, it's one of the things
everyone jokes about us, but actually it's a strength
because DSLs are very powerful
and they're kind of cool when they're done right.
So the question then becomes,
how do we write the good ones rather than
the ones that Erin is having feelings about right here?
And so you can treat your DSL like you would any other API.
You can expose what people need,
you can close off the other stuff.
You can stay close the domain you're describing
and have sensible composition rules,
and you can keep everything small enough to complete.
Getting there, though, is again, a very hard problem.
While a good DSL is often more usable
than a good vanilla library API,
a bad DSL is much less usable as we've all experienced
than a bad vanilla library API.
I'm not saying right now that you're doomed to screw up,
because, obviously, you've seen this talk
and every DSL you design from now on
is going to be perfect, but
a good DSL is a lot more work than
a decent vanilla API, and that's something
that you get to respect.
You're going to need to write that decent vanilla API
anyway in order to implement the DSL,
and so I'm going to suggest that you do that first
and figure out if you need more,
and let things lie like that.
That's everything I need to say right now,
I'm Betsy Haibel again, I'm betsythemuffin on Twitter,
which is going to pop up on the screen
in about five seconds.
I'm very sorry about the AV issues,
I'm not entirely sure what's going on
with Google Docs.
This talk is going to be up on my website
at the URL on the screen shortly after this talk,
probably some time during the lightning talks
through dinner.
Whenever I can get a decent lock onto GitHub
and with the conference internet, really.
I tweet about books, code, my cat, and feminism
@bestythemuffin, and I co-organize a meetup
back home called Learn Ruby in DC.
This is an informal space for newbies
to ask questions and find mentorship.
If you are interested in making a meetup
like that in your own hometown
or if you also run a meetup like that
and want to talk shop, then please, talk to me.
I think it's a really good model for building the community
and I would love to share nice stories
and also pitfalls so you can avoid them.
I work for a great little organization called ActBlue
that builds fundraising tech
for Democratic candidates and causes.
We focus on small-dollar donations,
which is a surprisingly powerful thing.
Our average donation size is around $30
and we've raised nearly
$850 million
over approximately a decade, we've been in business.
This really helps those donors' voices be heard
in a way that keeps the party accountable
to the voices of people who only have $30
to spare at a time.
It's something that means a lot to me.
We are also committed to building sustainability at the
kind of scale that can bring in that much money over time.
We have a modern tested stack and we have a focus
on maintaining culture that, well,
my third day was one of our biggest days of all time, right?
And pretty much everyone on my team HipChatted me
over the course of the day saying,
"By the way, Betsy, I know it's end of quarter,
"you're going to close your laptop at 5:30
"and you're going to have dinner
"and you're going to do everything but be on call."
And we're also hiring Rails, UX and DevOps people
right now, so if the values I just outlined
sound good to you, if they resonate,
then please, talk to me.
I'd love to work with you.
Many thanks to Noel Rappin, Kenzi Connor, Chris Hoffman,
Tina Wuest, and the entire membership of Arlington Ruby
users group for invaluable feedback
while I was developing this talk.
(applause)
That I personally have built.
I have not built enough things that actually required DSL.
Like I really do take the responsibility
to go up to those bounds and no further
quite seriously, and so I've built some templating stuff
that I'm pretty proud of, but other than that
I haven't worked in any problem spaces
that I feel require that level of power.
Unfortunately, that's all been proprietary stuff
so I can't point you to a GitHub repo.
The question from Walter is whether I have
any mental litmus tests for when something
does want a DSL.
So for that, let's kind of go back a few slides.
A lot of slides.
Hi, there, now you're working.
Now you're working quickly.
What the hell.
So, if you can see in that second example on the bottom,
we're getting an increasingly complex hash in our face,
and one of the things about that is that you aquire
more and more options for what any given library
access point, we'll call it that, even though that sounds
really fancy and it's not a fancy concept.
When any given method call
that's at the front edge of your API
winds up starting to take a lot
and a lot and a lot and a lot
of parameters, you should start thinking about ways to
encapsulate all those parameters within an object,
and a lot of the time, a nice, simple, method-chaining DSL
is a great way to actually build that parameter object
in a way that's clean and readable.
It's one of the questions I kind of anticipated getting
was someone calling me on differences between
RSpec and MiniTest, because they're very different
stylistically in terms of implementation,
but in terms of the ways
the MiniTest PSL has evolved over the years,
and the RSpec PSL has evolved over the years,
one of the interesting things is that
they'd actually evolved toward each other.
I think that
it's valid
to want something like the full on test unit prefix
everything with test style.
It drives me bonkers.
And through the years we've seen a lot of things,
like RSpec, like MiniTest spec syntax, like Shoulda
that attempt to impose more structure
than the test case magic API gives you,
and you know, there's no hard and fast rules
in perk, I mean, so this is going to be matters of taste,
but
the outer edges
of the RSpec spec API with describe blocks and it blocks
seem to be something that a lot of different things
just ultimately eventually decide works for test cases
even if that's not where they start out.
Cool, wonderful.
Well, I will let you all get to the lightening talks.
Thank you so much.