Software Development: RubyConf 2015 - Stately State Machines with Ragel by Ian Duggan

Wednesday, October 05, 2016

RubyConf 2015 - Stately State Machines with Ragel by Ian Duggan

Welcome everyone, this is Stately State Machines with Ragel.
Thanks for coming to my talk,
I guess there was a couple to choose from,
so I appreciate that.
I'm a little jet-lagged, so if I seem like I don't know
what I'm talking about, that's probably why.
Who here has heard of Ragel before or worked with it,
just so I know (mumbles) online, okay.
Okay, all right, we'll just get started.
So the goals for this talk are to convince you that
Ragel's worth trying if you haven't used it before.
And to give you some intuition about how it works,
and to show you how to set it up
so that you can kinda do some basic parsing with it.
A little bit about me, my name is Ian Duggan.
I'm into lots of different things, I play hockey,
several times a week.
This is Stormy the Pig, where I come from, North Carolina,
Pigs mean hockey.
I play guitar.
I play banjo, mandolin, ukulele,
I have a fiddle that's gathering dust,
and my cats actually appreciate that.
Sometimes I fly, I've got a Cessna 120
that I fly out of Hayward, California.
It's got a whopping 85 horsepower, but it's fun
to putt around the Bay and the Valley with that.
I love my cats, a couple, you know,
gotta have cat pictures in a presentation, so here they are.
This is Purrington, we call him Boofus.
This is Minnie, she's a goofball, she likes to wear outfits.
She doesn't care, she thinks they're fun.
This cat likes being cozy, this one's a dopey,
Chill.
Minnie likes to sit in strange places.
So I'm a software guy, I code things.
I code the internets and the googles,
I'm a recovering technology entrepreneur,
been in and out of startup institutions for many years now.
My current status is that I work for Twitch.
We do online streaming social video games.
Obviously we're hiring, we have lots of Ruby,
the whole front end is Ruby on Rails,
the backend is getting redeveloped into Go,
for a lot of our services, 'cause we need the performance.
I've been using Ruby for quite some time now,
I started on 1.6 days, a friend of mine gave me
a 1.6 book and I just kinda played around with it,
thought it was a pretty cool language.
So I don't know when the transition happened,
but I've just kinda stuck with it since.
But enough about me, this talk is about Ragel.
Ragel is this really cool tool, and it's used in a lot of
sort of high-performance techs parsing areas,
and you know, some people seem to have discovered it,
and are using it for stuff like that,
and there's a lot of other use cases where people
are kinda flooding with regular expressions,
where I think Ragel could be really helpful.
It's got a lot of features that allow you to do
some kind of really cool tricks with it.
And if you don't have it in your bat-belt,
then I'm hoping that you'll add it today.
As I mentioned, there are many projects that use Ragel.
Mongrel, Unicorn, and Puma, are the common ones
that people know about most, Zed Shaw I think kinda
popularized it about 10 years ago when he
rewrote the Mongrel http protocol parsing API with it.
There's a really cool gem called Whitequark,
which actually parses Ruby using Ragel.
The mail gem RedCloth, Hpricot, Gherkin,
they all kinda make use of it.
So just really quickly what does Ragel look like?
We'll jump into it, this is DSL,
you can kinda define some actions, you've got some...
Ways of converting different terms into other terms.
So that's kinda what it looks like,
and we'll get into the details very shortly.
So most people are familiar with regular expressions.
Regular expressions are easy, for the most part.
People understand kind of the basic stuff
that we all understand.
Matching, grouping, alternation.
Matching zero, one over r, end times, end times, et cetera.
And Ruby has really great tools for regular expressions,
'cause it has a heritage from Pearl, Sed, and Awk,
which relied heavily on those.
So you could do fancy things like match stuff,
and in a block, match some more stuff,
and in a block, replace something else,
free regular expressions,
so you can get pretty fancy.
But at some point, they can become irregular expressions.
So if you've ever opened up a file and stared at
something like this, then Ragel might be something
you would consider.
They have their place, but there are better tools.
Sometimes you want more control,
I posit that this means you want some sort of automaton.
So just a little bit about automatons.
Like finite automata have, basically,
these devices they have states and transitions,
many people have probably seen these,
you've got circles (mumbles) and
some input comes in, causing a transition to another state.
There are two kind of basic types,
there's a deterministic finite automata
and non-deterministic finite automata.
It's just kind of an academic distinction.
Deterministic means it could just be of one state at a time.
Non-deterministic means you can be
simultaneously in two different states.
However, you could map those multiple states
as superstates of the single states and convert that
into a deterministic finite automata.
So in academia they talk about this equivalence of
regular expressions, NFAs, and DFAs.
So all this stuff that you're doing with crazy,
regular expressions you can kinda do with these other tools.
So you should choose the tool that kinda makes the job
easiest for you at that point in time.
Excuse me.
So as a superset, these are all state machines.
State machines are an important tool in
computer programming, Ragel is a wonderful tool for
creating them.
And state machines are everywhere.
They're in your stoplight, you have green,
the timer goes to yellow, goes to red,
goes to green, goes to yellow, goes to red,
goes to green, around and around forever.
They run your CPU.
If you imagine your computer, like every single bit
that 's in it, as kind of a definition of a state,
then every piece of input causes a transition to
another state, your computer is one gigantic state machine.
There's some kinda non-state machine-ish stuff
that happens on the edge when inputs (mumbles),
and stuff but for the most part,
you can think about it like that.
It's a big honkin' state machine.
And there's other examples everywhere.
So like, watches, vending machines,
I've mentioned traffic lights already.
Barcode scanners, pretty much any piece of machinery
you see these days that isn't analogue.
Has any kind of computing device in it,
it's gonna be a state machine.
So I think they are the cat's meow,
they are great for many reasons.
They're simple to understand, and there's a great
deal of research around finite automaton state machines.
With the right approach they can also produce code that is
faster, easier to maintain, and correct,
more correct, and thus more secure.
They're very specific about what inputs are accepted
and what aren't, and it's,
it's harder to get it to parse something incorrectly
than it is for it to admit something that it doesn't parse.
Which can be very useful.
If you're still not convinced, we should probably just
spend some time looking at what they are.
So let's go over some vocabulary.
So state machines, they all have something called
a start state, usually designated as zero,
this is as it sounds, like the first state that the
state machine starts at.
They also have, kind of on the other side,
an accept state, or final state.
They can have multiples of these,
just depending on what they are.
And this, when the machine transitions in this state,
it is said to have accepted the string.
Excuse me.
So within a state machine,
to go from state one to state two,
in this case when h was received,
you'd go from state one to state two,
that's called a transition, so usually you have an arrow,
in the diagram, it's labeled with an arrow.
Sorry, it's an arrow in this label.
There's a special transition called an epsilon transition.
This allows you to go from one state to another
without taking any input, so this could be useful to wire
you could take two state machines, stick 'em together,
and basically it works like a concatenation,
you don't need input to go from one to the next.
So with what we know so far, we've got a couple of
simple state machines that we build.
So if you think about a regular expression and how
it's parsing things, this is kinda what it looks like.
So a star will take any number of as,
and just kinda keep mapping on it.
As long as the first one was an a.
The second machine just maps a single a,
and then we're done.
This next one needs an a, and then,
as many as as you want after that point, so that's a+.
And from that we can get more complicated.
So this is a machine that will match
zero or more hellos.
So the start state is actually a final state,
so that's acceptable.
But once we've taken an h, we need E-L-L-O
before we can go back to a final state again.
So whenever this is in a final state,
it's accepting the input.
You may be asking, "What's the big deal?
"I mean these just look like regular expressions,
"I've done this before, I don't really need this."
So what is Ragel exactly?
Ragel is a finite state machine compiler with output
support for C, C++, C#, Objective-C,
D, Java, Ocaml, Go, and Ruby.
So if you're doing stuff across projects,
it's really useful.
It also supports a generation of several different ways of
working its way around the states.
And it's really useful for building lexical analyzers,
and protocol definitions, et cetera.
So I talked about state machine generation.
So another unique feature of it is that,
with these states and with these transitions,
you can attach arbitrary code at any point in time.
So as you're moving around the state machine,
you can pull things in and out of the input,
and jump around in the state machine, and this is,
these are the features of Ragel that make it like,
really different than just a regular expression.
The regular expression you can match a string or you can't.
With Ragel you can match half the string,
jump to the end of the string, do some stuff,
jump to getting on the string, give it some more data,
download something, upload it to YouTube,
and then come back, you can do that all inside Ragel.
If you want to, not sure I would recommend that though.
So I'm actually, wasn't sure how to pronounce it,
I've been saying Ragel forever.
Ray-something, so there's ray-gull, there's rah-jul,
there's a bunch of different ways to.
So you can get the answer from the horse's mouth,
because Adrian Thurston who created it
actually I guess sent an email.
And he said that it's actually Rah-jul,
which is not at all how I guessed.
And he said it was based on his, R and L,
for regular language, and his nickname was Ag,
even though he's Adrian, or maybe it's Ag, I don't know.
So maybe it's Ragel and maybe it's Rah-jul.
If he's Ag it would be Ragel, but he wrote Rah-jul,
so I still don't really know.
But if you know, please let me know.
So darn, I've been pronouncing it wrong
for quite some time.
So on a more basic level, Ragel is a DSL for creating
these state machines and it's especially useful,
as I mentioned, for parsing protocols, data formats.
It's incredibly powerful, it allows you to do this in
sort of an EBNF kind of format,
so you get this really declarative approach to parsing text,
as opposed to kinda while loops and for loops,
and function calls and returns.
So let's just take a look at the DSL,
and get a better idea of it.
So the general structure of a Ragel file is,
it'll be written in the host language,
so if you're in Ruby, it'll be mostly Ruby,
it'll have some sections in there which
kinda call out pieces that are Ragel.
So this is a machine which matches (mumbles).
You can name several machines,
and refer to them inside your file.
There are definitions and instantiations,
so this defines a piece of state machine that you can use,
and you can compose it into a bigger state machine.
And to actually generate states, you need to have
an instantiation of some sort,
and this is done with (mumbles) equals.
And like any kind of programming system,
you can include and import and things like that as well.
There's some slight semantic differences in terms of
what those mean, so you can (mumbles).
Whitespace is ignored.
Comments are started with the pound,
just like in Ruby.
There's different kinds of literals,
there are straight literals, regular expression literals,
grouping literals.
Got kind of standard escape characters that
we're familiar with.
And if you need to jump, if you're inside a Ragel section,
you need to jump out to the host language,
you kinda put it inside braces like this.
Numbers can be specified as either integers or hexadecimal,
which is useful because when you're parsing in Ragel,
you're actually looking at the binary value.
So if you're looking at an A, a capital A,
it's actually 65, so you need to be aware of that.
Then we've got a couple of keywords.
So these, kind of first section here,
this is the stuff that is more like a regular expression,
so this matches the word symbol.
So nothing too fancy there.
You could have Unix expressions,
similar to regular expressions.
We have zero length machines.
We have numerical literals, as I mentioned before,
like this is matching 42, so this would be the ASCII
character represented by the number 42.
And you can have regular expressions inside Ragel,
which could be a little bit mind-bending but regular
expressions actually compile that into state machine,
so this is actually pretty obvious when you think about it.
We have the range expression, so if we wanna match a to z,
same with regular expression a to z.
This is where we kind of start to jump out of what you
can do in a regular expression.
Here you can kind of name something called supercode,
or refer to it, and it definitely
plays inside the regular expression.
So you could name it once and use it
in five or six different places.
And there are a bunch of sort of built-in character classes,
any ASCII extend, (mumbles) control graph.
So for basic stuff, you can just reach to one of these,
you don't have to define it all up front.
So at this point we sort of have the basic building blocks
for building more complicated machines.
These are like levers, wedges, wheels, pulleys,
these are like the simple machines
that mankind has build.
On top of that we've build rocket ships
and gotten to the moon,
so that's what we're gonna do next.
Here's another cat picture.
So when I was first learning about Ragel,
I had heard it was useful, I wasn't qutie sure why.
I was reading through the user manual,
that's like if you really wanna understand how it works,
that's the best way to do it.
There's a really good PDF online about that,
but it has this really interesting thing
where you can take a really complex machine,
and add it to another machine, or subtract it,
or do intersection, or a difference, or union.
And it will do the math to kind of add and subtract the
states so that it does what you expect.
This is kinda mind-blowing when you see it in process.
Compositional operators are the things I just mentioned,
union, intersection, difference, strong difference,
concatenation, (mumbles) and we'll go,
we'll check into detail in a second here.
So a union you would take, this string of talk could be
matching you know, this could have a thousand states in it,
the one on the bottom could have a thousand states in it.
The state machine will,
traverse both machines at the same time,
and enter a final state.
But it does this by adding together all the states,
and lining up the final results.
Sorry, the final states inside it.
I'll get my slide here,
something about union examples I'll show.
We have intersection, so this would be,
matches any string that would be in both machines.
So this is an example of that, it's taken the machine
on the left here, and the machine on the right,
and it has merged them together
into kinda one megamachine.
So we can do a difference,
so this would take any string the first machine matches,
minus any string the second machine matches.
This is the regular, this is the,
state machine that you get from that expression,
so as you can see, it's pretty complicated,
and coming up with that yourself will be a mess,
and trying to do that in code will probably be a mess,
but it's actually pretty simple here in the DSL.
You have the notion of strong difference,
this'll match any string in the first,
sorry, any string of the first machine that does not
have any string of the second machine as a substring.
So one example would be I'd wanna match
any character except you know, control,
character return line feed.
Character return line feed is like a,
is a compound string as well.
You know, concatenation, and this is just
taking the epsilon states out of one machine,
setting up epsilon states which come out of the
final states of the first machine into the second machine,
which creates like a global machine.
Concatenation example.
You have other things related to regular expressions,
like based on (mumbles).
More basic repetition.
We've got the optional operator.
Optional example.
So this would admit any string that the
machine doesn't launch.
Occasionally (mumbles), sorry.
Special form of character-level negation.
If you just need to get negate on a single character,
there's a special operator for that, that's this.
When you go through the work
of putting these machines together, feeding them into Ragel,
it does a kind of final process called
state machine minimization where,
it does some optimizations of the states within
the state machine and...
Any of them that are redundant, it can kinda subtract out.
So you may have a thousand, you may initially have
a thousand states that you've kind of added together,
that could subtract it down to like 250 that are equivalent.
So I mentioned before like one of the powerful things
you can do with Ragel is attach user actions to
different parts inside the state machine,
so this is some examples how you would do that.
You can define an action statement inside the DSL.
And then attach them to the transitions, so,
there are different kinds of transitions,
there are entering transitions, so I'm going into a state.
You can attach there, you can attach to a finishing
transition which is coming out of a state.
You can attach to every single transition,
or you could attach to just the leaving transition,
and you would do this, you would choose based on
kinda what you're doing, which one of these you wanna do.
And the embedding operators can get really fancy,
there are just like two or three pages in the manual
about how to do this, so if you need more details,
just check there.
One of the problems we can come up with
is this notion of nondeterminism,
so you may have two machines that are stuck together,
and you can't get out of the first state to match on
the beginning of the second.
You can't get out of the first machine to match
on the beginning of the second machine because
they're matching the same characters.
So an example would be this machine here.
So the new line in the whitespace
will prevent the final newline from matching.
So the solution is to exclude it from the beginning.
There's also ambiguity problems, like if you're trying
to match like a C comment, for example,
you might say okay, match slash star,
give me anything and then give me slash star.
But you'll never match the slash star,
because you're matching on any.
You can do things like, okay match on anything
except slash star and kinda add things together like this,
but that gets ugly.
Ragel fortunately gives us some tools that are
easier to use out of bounds,
so you don't have to kinda do this ugly stuff.
The way it does this is it allows you to set
priorities on these states, and so,
you could say that I want the transitions that
go into the second machine to have higher priority.
So one of the techniques it gives you is a
special notation called guarded operations.
And these are called guarding concatenations,
so an example of the one we just looked at,
how do I match, say, on a C comment,
when you use those kinda colon greater greater,
this causes the slash star
in this final machine to match more strongly.
So I'm gonna match on any as long as it's not a
star slash and then it'll kinda come out of the machine.
So we've got, there's a couple different forms of that,
whether like kinda the right machine is stronger
or the left machine is stronger.
Or if you want like a longest match but you don't wanna
kinda stay inside it.
So one of the things that you,
so that's kinda for simple matching,
but a lot of times we're trying to build a scanner,
like we wanna scan a file, we just wanna recognize
some things in it, like one after another after another,
having keep going.
So Ragel has a notion of a scanner.
And this will keep matching on the input,
and it will take the longest match that I can,
and recognize that time after time.
So an example of that would be so if you cut out,
and this is a parsing headers.
So right here we're saying a header,
within a header, I can match on a word,
a space, or a new line, and this is what I'm gonna
do for each of those, so it's gonna keep matching on word,
there's a space that's gonna ignore it,
there's a new line that's gonna,
after a turn we're just actually kinda coming back
to where it called F call.
So as I mentioned, it's really useful for protocol parsing.
This is what Zed Shaw did, that's what Puma uses it for.
So if you want like a good detail example of how to like
parse http using Ragel, Puma I think is the best resource.
And...
Up to this point we've kinda been thinking sort of
on the level of regular expressions, but,
on the lower level, these are just states of transitions,
and Ragel actually lets you kinda go down into,
what it calls the assembly language of state machines,
which is just saying I want these states,
I want these transitions, I want these characters
to go from one to the next.
So you can also define those, so this is an example of
parsing an XML CDATA body.
So this actually defines three states,
start, one, and two.
And some of the transitions based on what it's matching
to go from one state to the next.
And you can use F colon F return to kinda
set up function calls within Ragel,
which is useful as well.
When you start parsing stuff with Ragel you're gonna,
you're probably gonna reach for it,
instead of running a scanner LX,
maybe I could use Ragel.
One of the problems with it is parsing recursive structures.
Regular expressions are not good for that,
and Ragel's equivalent to a regular expression,
more or less.
So how can you do that?
The trick is to take advantage of these actions that
you can embed, and when you recognize within your parse
that you are looking at a sort of a nested construct,
you can in your host language set up like a stack.
And push context through that and jump out,
and jump back, and look at your stack when you need to.
So because you've got these embedded actions,
you can like step outside what regular expressions allow.
You can also implement lookahead.
So the trick here is you would match on a couple
of characters farther than you want.
And then you can call fhold which will kinda
back the parser up, it's where each time you call,
it backs it up one character.
So...
Internally, when you're working with Ragel,
it has all these, it has these variables that you're
gonna interact with, and they represent the state.
So there's data which is a buffer which is
where you kind of stored your string,
so we'll see in a second here like when you,
when you call Ragel you basically set up data in a buffer.
You set a variable called P which points to the
front of the data, PE which points to the end.
And then you just tell Ragel to go and it'll match
on everything between P and PE.
When more data comes in, you come tack it on to the end,
move P and PE, and like tell it to go again.
So if you wanna match one character at a time,
you can just line up P and PE,
and just kinda move them along.
And when you're inside a scanner,
it actually uses some other variables, TS and TE,
because in order to do the scanning and do a longest match,
it actually looks ahead and stores some data on a stack
and backtracks when it needs to.
So roughly, you start in state 0, you feed it data.
It runs the exec loop,
the characters move through the state.
You consume this P to PE data.
And it's done, if your current state matches
the final state which gives you a list of that means
that you admitted a string.
For scanners it's similar except it uses ts te to stack.
The way you extract a string, when you're,
so this is like, as you're moving through the scanner,
like you'll use this kinda slash mark,
that marks the beginning of something you're interested in.
And this percent would say like, I'm making a transition,
that's what I wanna admit.
So when I enter, I store where I was, so data P,
and then when I leave I store data PE.
I store from my mark to my PE, I pull that out,
and I store it into my things array for example.
As I mentioned, there's a bunch of different host language,
languages and coding styles.
One of the nice things you can do with Ragel is you can
write a generic parser, and just refer to all these actions
by name, and then you could have multiple
other files that you would include.
So I could have one definition of http,
and then I could have my actions defined in C,
and my actions defined in Java,
and that's actually what Puma does.
You can also prototype something in Ruby,
get it working how you want, and then go,
convert it to C pretty easily.
'Cause you have most of your logic inside Ragel.
So the directives that you need to put in your file
are an init, which sets up (mumbles) state.
Data, which has all the states and transitions.
And exec is actually, causes it to run that loop
that I was describing.
So this is an example, some of the code that it generates.
So installing it, on a Mac it's really simple,
just install Ragel.
And to generate the Ruby from a Ragel file,
you just say ragel-r for Ruby.
Simple.rl would be the Ragel file in this example,
and you give it the output,
so this will build your Ruby file.
And we'll take a look at it in a second.
And...
Debugging these things, it's kinda hard to do visually,
or can be hard to do visually,
so Ragel actually it lets you,
emit these out to a graphviz dot file.
Which you can then kind of make an SVG,
or a PNG out of it, kinda look at all the states,
so I'll show you one of those in a second.
So this is an example of calling it from Ruby.
So this would parse the data, so take the data,
we have to unpack it into the binary format,
that I mentioned from characters.
We set the EOF as a variable that Ragel uses
to know when it gets to the end.
Tokens is our variable where we're storing things.
Call init call exec and now it's gonna process
everything that you put inside data.
So somewhere in the actions it's actually pulling
things out and storing them in the tokens array here.
So I created a tool here for kind of
doing some visualization of this,
so I'm gonna pull that up.
Just figure out how to get onto the other screen.
Can you see that?
I actually can't see it, so I'm gonna turn around here.
Actually you know what, that's dumb,
I'll just mirror the displays.
Where is my cursor.
Where did it go?
Lost a window here.
One second.
So this is actually a little app that I built in Volt,
which is a pretty cool framework if you guys haven't
used Volt, that'd be good to check,
but it allows me to kinda experiment here,
and kinda see what these graphs are.
So normally to build these, I would have to run a couple of,
command line commands, but I just kinda set it up
in here so I can do stuff like this.
So I've got examples of like the different machines in here.
So this is a machine that matches on simple.
Oops sorry, go to auto-update.
So this is an example of the state machines being created
by this.
So if I would match on SIM, or PLE, it would look like that.
Or if I wanted more of those.
Let's say I wanted one more of those.
So I can kinda build these things up and play with it.
But I don't have the code for this referenced here.
This is a, presentation window there we go.
This is irritating.
So if I want I can kinda create a new machine here,
so we'll just call this new machine.
So let's take a look at some examples here.
Everybody see that?
All right, so this is an example of the machine,
well it's like the machine I showed earlier,
where it was matching, well, this is the machine
I showed earlier, it's matching on zero (mumbles).
So I've set up sort of a Ragel parser base here.
Take a look at that.
So parser base I have, I'm setting up some variables
manually here, I have a way of feeding it data.
I have a way of asking if I'm in a final state.
I have a way of pulling out the correct match.
And I just sort of, I have it outputting some debugging
data here which I'll show you on the console in a second.
But I just have this set up so that it'll read from
standard so I can use it for experimentation.
So this is a hello, this is hello parser.
And this is what the file you work with looks like.
And then we take a look at like the generated file.
Oops, sorry, the hello.
Dot rb, so this is the generated file.
And Ragel is doing all this work for you.
So this setting up all the states up top here,
this is a pretty simple state machine,
so there's not much going on here.
But this is the init that we saw,
this is the perform method,
and this is the exec that was buried inside of it,
so it's setting up all the states.
It's doing go-tos and jumps and matches, et cetera,
and so on, so all this really nasty code,
which is heavily optimized, that's why it's so fast.
So this particular example here...
That one's not working, let's try the next one.
Oh, it's coming to that one as well.
So I typed h, you can kinda see that it's
in current state one now, P is at one, PE is at one.
And we have one piece of data at h.
If I say E-L-L,
our final state at the bottom here is still false,
so we're not in a final state, I press O,
we've now entered a final state, so we've matched on hello.
I press H again, we're not matching.
E-L-L-O.
If I press any other character, I'm gonna step outside of
what it knows, so I'm not in an error state,
and I'll never match again.
So that's an example of, a real basic example.
I have some arg parsers to find here, so there's,
I mentioned state charts and using state charts,
so I wanted to find here using
the sort of standard approach.
So this is the main part of my parser,
and I'm just saying I wanna match on an arg,
dash and flags, or dash dash long opt.
Long opt I have defined as a bunch of alphas.
Flags will be a bunch of alphas as well.
And what this is gonna do is, this parser's gonna
go around and around on a loop, and when it sees
dash dash long opt, it's gonna call long opt.
So that Ruby code is actually pretty simple there.
And these were the actions up here.
So I called mark arg, when I need to kind of
mark this a little bit.
Oh really, can't see that?
How do I change the color?
Yeah.
How do I do that?
What's that?
(audience member speaking)
Where is my system preferences?
Accessibility?
Colors, there you go.
Is that more legible?
Okay.
So what I'm doing here in the code is,
when alpha is kind of matched for the first time,
I'm kind of marking the beginning of the,
marking the beginning of the argument,
marking the beginning of the flags,
marking the long ops, and then when I get to saw arg,
which is what the scanner calls,
I'll use that previously-marked thing to kinda
pull out what was there.
Wow that's funky.
Let's see, should be...
So I could do dash, asdf.
And see it outputted flags asdf,
so that was (mumbles) there.
So now I could do like a long one,
I could say long opt,
so it matched that long opt.
So that's an example of kinda doing basic,
argument parsing in there.
And then I have an example of this,
in the sort of state chart for it as well.
So state chart's a little bit more complicated.
When you draw it out on paper it's simple,
and you can do the .biz visualization of this as well.
But I start and I start state here.
If I see a dash, I'll go to pre flag.
If I see a quote, I'll jump to quoted value.
And if I see anything else I'm gonna kinda start a,
a value arg.
So that's an example, this is one called,
val arg is just kind of a word,
quoted v, quoted word.
Flags would be this and like a long,
long flag would be that.
So the basic idea here is that you create some buffers,
and as you match or kinda go around in a circle,
in a state, I'll keep appending to that buffer.
And then when I leave the state I'll omit what I want,
looking up what I've stored.
The Ruby code for this again is very simple,
most of the work is being done up inside the parser.
So this is an example of the same thing,
but you can say asdf, and then val arg,
and I got a val arg.
Or I could say quote, quote the,
and I got a quoted value.
Space brings me back to start.
I could say dash dash long equals,
value long, quote,
quoted, and I got a quoted.
So I have links to all this code,
so after the presentation if you guys wanna like
pull it up and kinda like dig through here,
figure out what's going on, it's in there.
Let me go back to the presentation here,
if I can find it in this color.
The presentation got closed.
So while I'm looking for this,
does anyone have any questions?
I'm not familiar with what,
so Treetop's all in Ruby right?
Yeah so Treetop is...
It may be similar, presumably they might be doing
kinda similar techniques under the covers.
I'm not super familiar with how Treetop does things.
But this is super highly optimized,
like for the C version it can actually,
actually one of the versions compiles it down,
it just jumps within a while loop.
So like it'll go to the bottom, it'll go to,
it'll go to, go to, go to,
it's almost like optimized assembly.
So I'd be surprised if it wasn't faster,
but it probably really depends on the use case.
No definitely, it allows you to,
it allows you to do chunking, so you can kind of
think about your problem in smaller pieces,
and unit test those, and then compose them together.
So as I mentioned, the composition is,
there's two main things that make it really cool,
the composition is one of them, that's kinda mind-blowing.
And then the fact that you got these actions, so,
I can not be in a programming language,
and then suddenly be in a programming language.
Okay.
Right right.
Yeah, I mean if you're,
does Treetop do other languages as well?
Okay, yeah so if you,
if you were writing an http parser and you wanted it to work
quickly on JRuby so you wanted it down in Java,
and also like in CRuby, so you wanted it down in C,
then this would be a good choice.
Right.
Yeah so the question is like, how do you know,
this is using instance variables,
and kind of local variables, and so how would you like
turn this into kind of like a composable unit?
So that's actually kinda what I tried to demonstrate here.
So I created this Ragel parser base,
and I have several parsers which kind of inherit from it.
The default is to use local variables, which obviously,
it's kind of painful.
Part of the DSL, and you probably can't see it again,
so let me break this.
Yeah, so this is, it depends on where you put these,
it depends on where you put these things.
So I put these inside this object that I've created.
So whenever first to at, it's gonna be
in the context of this object.
Inside the context of this method.
And you can remap what it does,
so normally it'll just refer to P,
I have it referring to at P,
'cause I set up this stuff up here.
Oh, hey, there it is.
So I showed the demos and that was it,
so I've got these resources here.
I have the slides up, and I've got the,
I've got the playground code up there.
That's one of the features I wanted to add to it,
but I kinda ran out of time.
So I hope that was useful, sorry if I droned on.

Software Development

Pages

Wednesday, October 05, 2016

RubyConf 2015 - Stately State Machines with Ragel by Ian Duggan

About Me