Thursday, October 13, 2016
GOTO 2012 Introduction to NoSQL by Martin Fowler
had a couple of tracks on no sequel databases
my name is martin fowler Stephen skis hosted the track he asked me to kick
things off most of this track is going to be about practical experience of
people making use of no sequel databases but this talk is the exception because
this is really an introduction to what no sequel databases are all about
I'm going to do my best to cram into 15 minutes as much useful information as I
can
that will help you give you a context for understanding a lot of what goes on
in the later talks and the first part of this is I'm going to talk a little bit
about the history of no sequel databases
because as if with many things to understand why something is the way it
is
it's useful to know how ever got there in the first place
when I started in the computer industry in the mid-eighties it was just at the
point at which relational databases were really coming in and beginning their
rise
it's kind of hard to imagine that there was a time without relational databases
but I remember when they were the new hot thing that was people arguing about
whether there would be any good or not and they brought us many benefits
obviously they look at the persistence of our data and they're also very
important in the fact that they manage concurrency through transactions sequel
has become a de facto standard language to talking to these databases
it's not entirely standard but its standard enough but once you know sequel
you can talk to these different tools they also become very important for many
organizations that integration and reporting which as well see has both its
ups and down sides so sequel databases are a really good thing but they also
have some problems and most obvious problem is one that most application
developers run into as they're working with them which is that we assemble
structures of objects in memory often in terms of the kind of a cohesive whole
things and then in order
save it off to the database we have to strip out into bits so that it goes into
those individual rows and individual tables a single logical structure in
after I user interface and for processing in memory ends up being
splattered across lots and lots of tables
this is referred to as the impedance mismatch problem right the fact that we
have these very two different models of how to look at things and the fact that
we have to match them causes difficulties
this is what leads to object relational mapping frameworks and all that kind of
stuff
now the impedance mismatch problem is sufficiently ever awkward problem that
in the mid-nineties people said well we think relational databases are going to
go away
object databases are going to come in that way we can take her in memory
structures and save them directly to disk without any of this mapping between
the two but we know what happened there
we didn't see the object databases people like me who thought they were
going to be a dominant thing in the future we will rock and you still listen
to me but oh well I guess you're easily taken but we argue endlessly about why
it is object databases didn't actually fulfill that potential and I think at
the heart of it is the fact that sequel databases had become an integration
mechanism that many people integrated different applications through sequel
databases and as a result that really made it very hard for any of the kind of
technology to come in and that led to relational continuing to be dominant
right through into the two thousands so relation of had 20 years of complete
dominance of certainly the enterprise data space and plenty of other ones as
well then we saw with a the signs work at the large hadron collider
they didn't really want to use relational databases and but they had to
to some degree at least
what changed really was the rise of the internet and particularly sites that
have lots and lots of traffic
the big internet sites such as an Amazon or Google or a bet there or something of
that kind as you get large large amounts of traffic coming into your data
what do you do when you need to scale things and you got one obvious route is
to scale things up by bigger boxes but that approach has problems you can only
it costs a lot and there are real limits as to how far you can go
so I was i hope you will know a lot of organizations
most famously google use a completely different approach lots and lots of
little boxes just basically cpu motherboards discs commodity hardware
all thrown into these massive grids
but here is an issue for the data storage sequel was designed to run on
those big box dying to run as a single data node system
it does not work very well with large clusters of little boxes and several of
the Big Data players understood this
they tried they attempted I've talked to several people who have attempted to
spread our relational databases and put run them across clusters
the usual term that comes up in conversation when they describe how i
tried to do this was unnatural acts
it's very hard to do so
a couple of organization said we've had enough of this
we need to do something different and they develop their own data storage
systems that were really quite different from relational databases and they
started talking a little bit about not publish papers and talked about what
they were up to
and it is this that really inspired a whole new movement of databases which is
the no sequel movement
now it's important this point just took a little bit about where this term no
sequel comes from a lot of people complain about it quite reasonably
because they say well it's a really old term trying to define a movement by
something it's not
and the origin is really very simple it was this guy in london your hoskinson
had done a lot of work with her dupe and things like that
he wanted to have a look you have to go to a conference in California
he wanted to take a look at all of these various interesting databases that were
poking around at the time and he said proposed a meetup a little meeting where
people could discuss ideas and of course if you're going to do that in the light
to thousands you absolutely need something that's really really important
you need a twitter hashtag so we asked around
well what would be a good hashtag it's going to be short it's going to be
unique and so we can easily sort on it and a guy came up with the hashtag no
sequel
that's all no sequels ever meant to be a twitter hashtag to advertise a single
meeting one point in time the family has now become than that though
the name of the whole movement was completely accidental
nobody thought that was going to be the case so you know this is the way
language often knows it's very unpredictable fits and starts
so there's a whole bunch of people who turned up to that meeting by the viruses
that the list of people there
that's not what we call the whole set of no sequel databases since a lot of
databases who weren't at that meeting and now considered part about no sequel
umbrella
so this inevitably leads to the question of what is the definition of no sequel
and this is something I had to kind of think about writing a book about the
subject if you it's important if you're gonna write a book about something to
define what it is you're writing about my conclusion is we cannot define no
sequel databases
because of this very old history what we can do is we can identify some common
characteristics of no sequel databases and is a whole bunch of these obviously
no sequel databases are not relational it's actually more about non-relational
than it is about no sequel
obviously as a strong leading towards cluster friendliness the ability to run
on large clusters
because that's what the original spark through google and amazon came from but
that's not an absolute characteristic there are some no sequel databases that
aren't really focused around running on clusters most of these databases rather
interestingly they're open source so most of the things we generally call
no sequel databases or open source there are commercial tools that like to call
themselves no sequel databases and maybe over time that will become part of the
that would no longer be a common characteristic but it is still a common
characteristic of the moment perhaps most importantly is they're all things
that have come out of the 21st century website culture and there are plenty of
databases out there going back long before relational databases that do not
use sequel or the relational model but we don't call such things as I ms or
mumps
for those who have heard of either of those things really no sequel databases
so
that's what I see as the common characteristics are mentioned the last
one in a moment
so one of the things that's interesting about no sequel databases is they use
different data models to the relational model
obviously since the name says that and if we can have plotted a picture of the
most commonly referred to no sequel databases to me what we see is that they
get divided into four broad chunks based on their data model
let's dig into these data models a little bit more
so the most simple data model to talk about is that of the key value store
the basic idea is you have a key
you go to the database tell me grab me the value of whiskey
the database knows absolutely nothing about what's in that value it could be a
single number it could be some complex document it could be an image the
database doesn't know doesn't care how you figure this basically as just a
hashmap but persistent occur in the disk
simple as that
another data model is very common is the document data model of a document data
model thinks of the database as this storage of a whole mass of different
documents where each document is some complex data structure usually that data
structure is represented in forms of Jason because Jason is what's
fashionable these days
I mean you could do it in xml but who wants to be seen wearing xml in public
no one so we have these different documents that all flash around and the
usual document databases will allow you to say give me a document that has these
fields with these those you can query into the document structure and you can
usually retrieve portions of the document or update portions of a
document
so the big difference that than to the key value store where it's a very opaque
structure of the document is much more transparent
one thing to notice right away about these databases about document databases
and indeed all
no sequel databases is that they don't tend to have a set schema with a
relational database you can only put the data into the database as long as it
fits in the scheme of that you've defined for that database we have
inverted almost all no sequel databases basically even sure of anything you like
we need any stuff you like just go in there and the no single people will talk
endlessly about how this increases your flexibility
it makes it easier for the migrate data over time it's all absolutely wonderful
and as usual acts not really the entire truth
I mean usually when you're talking to a database you want to get some specific
pieces of data out of it going to say i would like the price i would like the
quantity i would like the customer as soon as you're doing that what you're
doing is you're setting up a implicit schema you are assuming that order has a
price field you are assuming that the order has a quantity field you know
you're assuming that it is called price and not cost or price to customer or
whatever other thing you could think of what it would be
implicit scheme is still in place and you've got a managed fact implicit
schema in many ways in a similar approach to the way that you manage the
relational more strict schema so schemers is really a bit of a will see
term here now it by having the no fixed storage schema does give you some
options that you don't get with relational databases and and there is a
difference and there are advantages in terms of flexibility as well but you
can't ignore the fact that you were always dealing with an implicit schema
the only time you don't have to worry about an implicit schema is if you do
something like give me all the fields in this record and throw them up on the
screen field name value and occasionally you want to do that but most of the time
you actually want to do something more interesting
so I've talked about to data models key value and document data models and I've
presented them as two quite different things but actually there
the line between these two is a hell of a lot more fuzzy but that many key value
data stores allow you to store metadata about the value
this allows of course you to have builds more complicated indexes
I mean if you want to get all the orders for a particular customer
you don't want to search every order in the database to find that you
the moral equivalent of a table scan you want to index that so key value
databases allow you to store various metadata things typically which kind of
makes them feel a bit like document databases right and then on a document
database
yeah you can do all sorts of queries against the thing but often there's an
ID and often when you actually look that up
you actually do it by saying give me the thing with that particular ID and that
idea is effectively the same as the key in a key-value store
so the boundary between a key value in a document database as I said he's
somewhat blurry and I've often heard a particular database sometimes described
as key value and sometimes described as document in reality
I wouldn't worry too much about the difference between them think of it as a
kind of a first approximation to work with but it's not actually that
important as it goes on what is important though is that both key value
and document databases have this common notion of your taking some complex
structure that you can save as a single unit into the database
whether it be in a relatively transparent document or a completely
opaque value that notion still exists and that commonality magnet think what
we really need some term to describe databases that work kinda like that
and so for the book i came up with the term and aggregate oriented database
that have that allows you to store these big complex structures and where did the
term aggregate come from it comes from and this book here are written by our
canvas domain-driven design how many people have read domain-driven design
hopefully good view of your excellent book I mean it really talks about how to
think about modeling domains and one of the key concepts in the early part of
the domain driven design is that often when we want to model things we have to
group things together into natural aggregates because when we're talking to
a database even a relational database
it makes sense to think of those aggregates with storing and retrieving
data if we're modeling orders for instance usually will have separate
classes the orders on the line items
that's pretty , - standard object 1 a 1 model but we think of the order as a
whole thing a single unit
so an aggregate may be many diff many objects in many classes
it may be quite a complex structure but when we're talking about persistent get
for retrieving it from memory
we think of it as warm thing to cross back and forth now in a relational
database we have to splat about aggregate across a whole bunch of tables
but nice thing about an aggregate oriented database is we can save that
aggregate as its single unit in the terms of the database itself
so for a key value database the aggregate is the value in a document
database
the aggregate is the document and that becomes the single unit that we move
back and forth and our I certainly find this is a much easier way to think about
the commonality of these classes of databases know the third data model
I'm finished it briefly described is that of column family databases now this
is a bit more complicated data model of these it is another aggregate oriented
databases however the column family database basically says we have some
think of its single key
they call it a rocky and then within that we can store multiple common
families where each column family is a combination of columns that kind of fit
together
the column family here is effectively your aggregate and you address it by a
combination of the row key and the column family name
now , families can also become a different look at the lower one here but
he's effectively a list of items that the various orders for a customer
so that doesn't kind of feels so much like the typical record structure that
you might know about but it is of course the same restoring an array in a
document and earn and something of that kind
so again you get something of that that kind of rich structure and that you can
build in here , family databases give you a slightly more complex data model
to work with but the benefit you get is again in terms of the retrieval you can
more easily pull individual columns and things about case but again the broad
data model is that of an aggregate oriented picture
so the great thing about this is that now when you're taking your aggregate in
memory instead of spreading it across lots of individual records you get the
store the whole thing in the database in one cup and the database knows what your
aggregate boundaries are now this is interesting
where it becomes really useful is when we talk about running the system across
clusters
because he had distribute data what you want to do is you want to distribute the
date of it tends to be accessed together and so the aggregate tells you what data
is going to be accessed together
so by placing different aggregates on different nodes across your cluster you
know that when somebody says all give me the details about this particular order
you're only going to go to one note on the cluster instead of shooting around
goodness knows how many to pick up different rows from different tables so
aggregate orientation naturally fits in very nicely with storing data on large
clusters and that's of course part of the whole thing with big table and I
know they're both effectively went for and cluster oriented approach big table
very much a column family style approach
dynamo much more a key value stop but it makes running on clusters efficiently
way more straightforward and that's really been as I said that the driving
factor here but however living is perfect and aggregate orientation isn't
always a good think
let's imagine we've got our order system and we want to look at the data are like
this
we want to say given a particular product tell me the revenue tell me a
past revenue we now not care about orders at all
we only care about what's going on with individual line items of many orders
grouping them together by product effectively what we're doing is we're
saying we want to change the aggregation structure from on where orders aggregate
line items to ones where products a great line items the product now becomes
the root of the aggregate now in a relational database
this is straightforward we just query data differently
it's very straightforward to rearrange the data in
the structures we might want in different cases with an aggregate or
into database
it's a pain in the neck you can do it and what they'll typically do is they
will run MapReduce jobs to rearrange all your data into different aggregate forms
and probably keep those persistent or maybe even do incremental updates but
it's always going to be more complicated
so being aggregate oriented is an advantage if most of the time you're
using the same aggregate to push data back and forth into persistence
it is a disadvantage if you want a slice and dice your day two different ways
so what I've done so far is I've managed to cover some of these models are
basically taking the document column family and key value and lump them
together under this aggregate oriented category and I think that's a useful
abstraction at least at the level of what i can say in 15 minutes
there's one very noticeable out like that you see though and that is graph
databases graph databases are not aggregate oriented at all
I use a completely different data model a graph databases data model is
basically that of a node arc graph structure not a bar chart or anything
like that but just nodes and arcs something that hopefully with the
familiar
i'm at least from a few boring computer science classes and the nice thing about
storing a graph database is that it's very good at handling moving across
relationships between things
relational databases you might think with the word relation in there that
they're good at handling relationships but of course relation doesn't mean
relationship it means something in set theory and actually relational databases
are not terribly good at jumping across relationships you have to set up foreign
keys and you have to do joins if you do too many joins you can get into mess
if you've modeled a graph structure or a hierarchy special form of graph
structure in a relational database you'll have had this experience
it's not straightforward relational databases aren't good at this
so grass grow devices coming and so yeah we can handle
I'm jumping around relationships left right and center we make it easy to do
and we optimized to make it fast to do that kind of thing
furthermore we can come up with an interesting query language that is
designed around allowing you to query grass structures this kind of query here
this is a cipher from near for j is all about saying well given a certain graph
structure let me use that graph structure to express a more complex
query and you can do some very interesting graph oriented queries in
graph databases things that would be very very difficult to write in terms of
sequel as well as a pig to work out in terms of performance
so in many ways in kind of thing of we have gone in opposite directions
aggregate oriented databases take a lot of stuff that scattered around and puts
them into bigger lumps while graph or into databases kind of break things
apart into even smaller units and let you play with those smaller units more
carefully when you can still model relationships in aggregate oriented
databases just as you can in relational databases you basically refer to IDs in
different documents but it's a lot more messy and so part of your decision as to
whether you are a no sequel database is going to be interesting to you is how do
you work with your data
do you tend to work with the same aggregates all the time which would lead
you towards an aggregate oriented approach
do you want to really break things up and jump across lots and lots of
relationships in a complex structure but would leave you to graph approach or is
a tabular structure working well for you
in which case you want to stay with a relational approach
so no sequel divides into those two categories all of these a skinless so
the graph databases as well allow you to add any bits of data to any node you
have all that flexibility but with the same caution about implicit schemas as
well so that is kind of half of the picture the data model part
now i'm going to move on to another issue which is about consistency and
effectively dealing with lots of people trying to modify the same data at the
same time you've probably heard something like this that relational
databases
they are acid they do the familiar acid transactions that we all know and love
atomic consistent isolated durable no sequel yet they don't do any of that
kind of thing
and of course no single people so all we do base which is an even more contrived
and meaningless acronym velocities and I won't even attempt to tell you what it
is because I can only remember what it is on Tuesdays
so basically what it boils down to is if you've got a single unit of information
and you want to split across several tables
what you don't want to be in this court in a position where you only get to
write half the data and somebody else reads it all you get to write half the
data and somebody takes the same order and writes a different half of the data
and things get really messing in that kind of situation you need to have this
mechanism to control to effectively give you atomic updates and that's really
what transactions are all about atomic updates so that you either succeed or
fail and nobody kind of comes in the middle and messes things up
now when it comes to our nicely organized set of no sequel databases
the first thing to point out is graph databases do tend to follow acid updates
which makes sense they decompose the data even more than relational databases
do so they've got even more of a need to make sure they use transactions to wrap
things together
so if anybody tells you are no sequel databases they don't do acid
you now know and immediately joined up ah the graph databases do little that
now i will get oriented databases actually don't need transactions as much
because the aggregate is a kind of bigger more richer structure in fact if
you read the domain driven design book one of the things I point out is that
the aggregates in domain driven design our transaction boundaries
you shouldn't less transactions cross aggregate boundaries because if you do
it will just be complicated to manage the concurrence of your system so the
domain driven design community from the beginning even before no sequel said
keep your transactions within a single aggregate that's effectively what you do
in the world of aggregate or into databases any aggregates update is going
to be atomic it's going to be isolated
it's going to be consistent within itself it's only when you update
multiple documents in a document or in database but you have to worry about the
fact that you haven't got acid transactions but that problem occurs
much more rarely than you'd think
so that's the first line about acid-base think I some databases are asked fully
acid anyway and the angular oriented databases that aren't they are acid
within their aggregates which is kind of what really matters but is also a bit
more to thinking about consistency
even than that because even in a relational world acid transactions don't
mean we get to be completely consistent and don't have to worry about
update anomalies and i will walk you through what hopefully is a very
familiar scenario to point this out
and also to illustrate how you deal with some of this
so imagine we have some typical multi-layered system we've got person
talking to a browser browser talk to the server server talk to a single database
and we're going to have two people talking to the same day two in the same
database at the same time
although through different browsers and servers and he is the basic little
scenario we begin with
both people left and right taking the same piece of data with a get request
essentially they bring it up onto the grounds the screen and now the human
being girls hmm
I need to make some changes to this will come to them to them to number the
number of them and eventually the guy on the left
I was getting my left and right confused says okay I've got my updated data
let's suppose some changes and then shortly afterwards
the guy on the right says I've logged in my data now let's post some changes
now of course if we let that happen just like that and warning conflict
this is a right right conflict two people have updated the same piece of
information they weren't aware of each other's update and they've got
themselves in trouble
acid to the rescue right what do we do
well what we'd have to do to prevent this conflict is we wrap the entire
interaction from getting the data onto the screen and posting it back again in
a transaction that way we make sure the database will ensure that we don't get a
conflict effectively one of them will be told now you got to do this again
retrieve your day - again we don't get conflicts problem solved
how many people do this on your production systems would ya
occasionally you can get away with this most of the time
you can't why because holding a transaction open for that length of time
while you've got a user looking and updating the data food you I that's
going to really suck your performance out of your system
alright so and i want to stress you can do this in some circumstances if your
performance needs a really very minor
you know you've only got a handful of people look at using the system at once
you might be able to get away with this approach and it is advantageous to do so
because I a whole lot of problems go away if you do this but the most systems
you can't afford to you can't afford the whole transaction is open that long
and if that most people who write about transact building systems like this will
tell you never to do this
don't hold transactions open for
user interaction what I say instead is you just wrap the transaction around
that update that last bit of updating the database and that's a good thing
because that stops a collision where one half and update Mac mixes up with
another half done update you get some tables updated over here in some
different tables updated differently over there and the result is an
inconsistent mess but you still effectively get a conflict because the
two people made updates on the same piece of information without knowing you
have a person did that and this is what's typically might happen even in
aggregate or in cat eye area aggregate oriented database if you have to modify
more than one aggregate because you might find one person modifies the first
one and then go over to the second one you have a person the city of way around
and as a result you could lead into an inconsistent between aggregates now if
you come across this which you probably have you probably also know how to solve
this and basically use a technique which in one of my previous books are referred
to as an offline lock and basically what that means
unusual way of implementing this is that you give each data record for each
aggregate at least a version snap and when you retrieve it
you retrieve the version stamped with the aggregate data when you post you
provide the version stamp of where you read from and then for the first guy
everything works out okay the version stamp gets incremented and then when the
second person tries to post and they still got the old version stamp and then
you know something's up
and you can do whatever conflict resolution approach that you take you
use the same basic techniques again with you working with no sequel database
the nice thing is you don't have to worry about transactions about this
problem so much because the aggregate gives you that natural unit of update it
is your transaction boundary but once you cross aggregates then you've got to
think about juggling version steps and doing something of that car
but it's not really very different to what you have to do with a relational
database because offline locks force you to do this juggling with version stamps
anyway so yeah you don't get these acid transactions to the same degree that you
do with a relational database but the impact is not as great as some people
think because we actually have to deal with this stuff all the time anyway
now when we talk about consistency
I find it useful to think about actually two kinds of consistency the consistency
have been talking about so far is what I call logical consistency
these consistency issues occur whether you're running on a customer of machines
or whether you're running on one single machine
you always have to worry about these kinds of consistency issues
now when you start spreading data across multiple machines
this can introduce more problems when it comes to distributing data broadly you
can talk about it in two different ways money sharding data taking one copy of
the data and putting it on different machines so that each piece of data
lives in only one place but you're using lots of machines sharding doesn't really
change the picture very much you still get the same logical consistency
problems that you do with a single machine
they are exacerbated to some degree but the basic problems are still the same
another thing however that's common to do with clusters of machines is to
replicate data to put the same piece of data in lots of places
miss can be advantageous in terms of performance because now you've got more
nodes handling the same set of requests
it can also be very valuable in terms of resilience is one of your notes goes
down the other replicas can still keep going
so hence will talk a lot about availability and resilience with these
clothes thrown into the approaches
however as soon as you replicate data a new class of consistency problem starts
coming in there again in the street with a simple simple example
so here we have two people myself and my co-author promote and we both want
the book a particular hotel room and so we send in a booking request and will to
wear happened to be on different continents promotes in india i'm in the
u.s. we send our requests are local processing nodes
now the processing nodes of this . need to communicate
we need to go and what's going on here and the system as a whole needs to come
up with some kind of decision
essentially ensuring that one of us has to sleep on the streets in this case
make this is what happens 99.99 whatever percent of the time
well let's take a kind of variation on this example again we both want to book
a hotel room
but now the communication line has gone down the two nodes cannot communicate we
send in our requests
what happens well actually there's two broad alternatives alternative one is
the system says that our communication lines gone down sorry we can't take your
hotel bookings at the moment please try again later
the alternative is the system says yes will accept your booking thank you very
much
because we're really reliable and up-to-date and all the rest of it and
they proceeded to double booked the hotel room
I'm not that friendly with promote are we good friends but you know there are
limits are we may not want to share that hotel room
so basically what we're seeing is a choice
it's a choice between consistency which means now i'm not going to do anything
with more communication lines down and availability which says yes I'm going to
keep going but at the risk of introducing an inconsistent behavior
that a vital thing here to realize is that this is a choice and it's a choice
that can only be made by knowing about the business rules but the main rules
that you working with
I mean it may sound really awful to say are we going to double booked a hotel
room
possibly with complete strangers I mean that would be that but actually maybe
the hotels have ways of dealing with this
maybe they have a blog
of rooms that they always keep available till the last moment for emergencies
they can just use one of them well maybe they just send an apologetic groveling
letter and some frequent sleeper points out to try and make me happen
these various ways in business that people will deal with inconsistencies as
they crop up
now I'm not saying you should always go for availability of the consistency but
what is true is it it's always a domain choice
it is the business people who will have to decide what's more important the risk
of double booking the last room in the hotel or the fact that we have to bring
down the site and so sorry we can't accept any orders at the moment which is
kind of bad for business
this is one of the things that drove dynamo I wanted to make sure that the
shopping cart was always available you can always put things in shopping cart
why is this because it's America what's the most important thing to do in
America shopping
we must maintain our retail destiny
we must always be able to shop and at what happens
you look you come to check out you don't why is this item in here twice or I sure
I put the so-and-so in here now computers don't make mistakes let me
just fix it when the worst could happen you actually send out the order to get
duplicate stuff you ring up amazon
we're sorry sorry sorry and you get it all back
you'd much better than actually someone not being able to shop for a few seconds
so the point is to business choice so this then ties into something you'll
hear endlessly about whenever someone talks about this stuff which is the cap
theorem
everybody has heard of the cap theorem how many people understand the cap
theorem
some of you it's actually pretty straightforward
it's described very badly about well they're not very badly but I don't think
it's terribly useful this over these three concepts up here and you get to
pick any two
this is true but i think it's easier to reformulate it
it's a bit clearer if you say if you've got a system that can get a network
partition which basically means communication between different nodes in
the cluster breaking down
and if you have a distributed system by the way you are going to get a network
partitions
if you get a network partition you have a choice do you want to be consistent or
do you want to be available
that's really what the cap theorem boils down if you've got a single database
running on a single server
it's not been a partition you don't have to worry you can be as available as that
notice and you're going to be consistent you can maintain everything that when
soon as you have a distributed system you have to make that choice but that
isn't a single binary choice right across your system
you actually have a spectrum you can go for a certain amount you can actually
trade off levels of consistency and availability
I'm not going to go into how just just me you cap furthermore it can vary
depending on the particular operation you want to do certain operations can be
highly consistent certain other operations can be highly available any
of the databases that do this kind of stuff will give you all the knobs and
tweaks to do this
and so you're going to learn out how to trade them across and actually most of
the time you are trading off consistency vs availability it's not availability
that's the issue and it's not even dealing with network partitions that's
the issue a lot of a time
what you're doing is you're trading off consistency vs response time because
what's happening is the more you want to have consistency across a cluster of
nodes that means the more nodes have to get involved in the conversation again
think of that hotel case the two had to communicate that's going to slow down
the response time
so you might say even if the network is up you know I'm gonna each notebook its
own hotel stuff and sort it out later
even with the network up and still get you the faster response time rather than
doing all the communication i need to forget the persist with the consistency
and again that's a business decision
another thing that amazon said was we want to always get people shopping fast
because what's the most important thing in America shopping
so therefore we want really rapid times and even if all the nodes are available
and we could give you completely consistent solution we want to be quick
and they also it helps that
merging shopping carts dealing with the inconsistency of shopping cart is
relatively easy
well they're so this over here that's about over there will clearly they were
both because this is America
everybody wants everything got taking stuff out of shopping carts
how why would we want to encourage about any of that this is a broader trade-off
in terms of computing this is really just another aspect of the general
concurrency trade-off between safety and likeness and if you've gone to compute
concurrent classes and you've heard people talk about that
really this should actually seemed fairly familiar in in cut in those kind
of terms now what I really wanted to do with this little segment on consistency
was focus on giving you a feel for how consistency is different in the
particularly the aggregate oriented no sequel world as opposed to how you may
have thought about consistency so far there's a lot of topics i could have
talked about here but I'm just haven't got time to talk about the important
thing to go away with is realizing that you have to think about consistency
issues differently
essentially because you've got this different data model and the open the
possibility of replicated data out and in and in particular you have to think
of it about this terms of this consistency
availability trade-off and that it's not up to just us as techies to make that
decision
it is actually up to the way the business one who works as to where we
make these trade-offs and if you want more
well I'm going to tell you to buy my book anyway so you know what to do
ok so the last little segment I'm going to talk a bit about when and why you
might want to use a no sequel database and the way I think of it is there are
two drivers the push us towards a no sequel database
the first one is the one that I've already talked about as the real driver
for the whole suit no sequel movement itself and that is you've got to deal
with large amounts of data you've got more data than you can comfortably or
economically
fit onto a single database server you are going to go
you're going to have to deal with some pain you can either take the pain of
trying to run a relational database across the cluster or you can go into
this new no sequel stuff and you know most of the time I think our government
no sequel stuff because running data relational databases across clusters is
still somewhat the black card
so big amounts of data is a big issue now some people have said and for one of
the reviews comments on my book was yeah but only very few organizations have to
worry about this stuff
if you google and amazon yes pretty much everybody else
now as i read that what i heard in my head was 6 30 cases in the for almost
everybody reality is through his tons of data coming at us a lot and every
organization is going to be capturing and processing more and more data
so this large-scale data problem it's only going to grow and and that is a
factor but actually this is not the main reason i think what most people go into
no sequel
there was a survey is or in the the track on on Monday but pointed out that
most people actually are interested in big amounts of data for succour
databases
what they want to do is they want to be able to develop more easily
so a good example of this is I have some friends who work on the guardian
newspaper and website
many people thought of the Guardian good english language you spoke many of you
good and yep they're dealing with articles
they're saving articles updating articles pushing articles back and forth
the article to them is a natural aggregate spreading that articles data
and metadata across the relational databases
it's a pain in the neck it's awkward but taking it as a single finger single
article and pushing it into the database
that's much more straightforward the matter is the impedance mismatch problem
is drastically reduced if you've got a natural aggregate and many other
projects are of that I've talked to in Fort works have used a no sequel
database have gone that route
they have said a data model doesn't really fit very well with relational
these one of these no sequel options is better
it might be a natural aggregate in which case of God the aggregate oriented roots
or it might be we've got something that feels much like a graph structure so we
go to graph database fruit and and that I think is the most common reason at the
moment why people are using no sequel databases
because you've got you get and get effectively getting rid of that
impedance mismatch problem
now of course that raises a question that was of course the promise of object
databases
they were going to get rid of the impedance mismatch problem but they got
clobbered because databases are being used for integration
why is that same problem not hitting us now
only these hitting us but it's greatly reduced because now more and more people
are saying we don't want to integrate that way we want to hide our databases
inside a broader application or service and then we want to use some kind of
service oriented interaction between the two which may be web services
it can be something is really disgusting as soap on PSBs we've got knows what
thrown in but the point is applications and they're controlling access to data
and iffy if you're in a scenario where you can do that where you can
effectively encapsulated database then integration issue becomes a lot less
serious and that I think is a very important enabler to make it possible
for now sequel databases to fry's that this is a good practice anyway even if
you've got relational databases
you do not want to be integrating through integration databases there
cause no end of trouble
believe me if you haven't experienced it yourself and so much better to try and
encapsulate something like that
and if you're going to do that then you've got much more freedom for what
database to use and I think that's going to be a very driving structure towards
this
another thing that's encouraging people to use these databases is to deal with
analytics we all know about data warehousing the usual data warehousing
project as far as i can tell is that salesman turns up from one of the big
companies and says oh you want to do data warehousing well he is this project
plan by which every piece of data you could possibly have in your organization
is all put into one place so that everybody can get it easily
and it's a multi-year project we have lots and lots of very diverse
stakeholders we know that story
I mean people come across these big data warehousing projects that they felt have
succeeded
there's usually one or two no one is prepared to admit it are you prepared to
admit it but most of them go badly
what we look for instead is a different approach that says let's particularly
focus on one particular problem and see what how do we grab the data from that
and the data by the way I might not be in well-known relational or even know
sequel schools they might be scattered around the log files or you know what
truly runs most enterprises which is Excel spreadsheets
let's get that data and let's poke it and pull it together and no sequel
databases play an important role in this
the graph databases allow you to easily do graph like analytics on the database
which is really quite nice
the aggregate oriented databases and generally less good at this because they
can't slice and dice oh well
but what they can do is still large quantities of data
so if you are pulling stuff of devices or log files or the like
then I become very attractive and of course that's what's given a big
advantage to the the Amazon because they're able of mine all this
information
so we've all of this does this mean that no sequel is the future of databases
that relational databases and then a disappear and we're all going to be
doing
no sequel stuff i don't think so i think really the future is something that i
referred to as polyglot persistence and what this means is we think that there's
going to be room for lots and lots of different kinds of databases with
relational database is still playing a big role if you're building an
application
maybe you'll use lots of different databases as part of your application
certainly across an organization you'll use lots of databases and what you're
doing is you're choosing the appropriate database for the nature of the problem
that you're working with and because there are different nature's of problems
for different data stores the want the idea of whatever your problem is the
answer is a relational database will go away
now this is great it gives us lots of opportunities for the future but as
every cynic knows every opportunity is really a problem and are plenty of them
you've now got to think about this kind of stuff you got to decide what is the
appropriate no sequel database for a problem
you got a deal with organizational issues relational dbas are not gonna
like this
in fact for some people that's a big advantage but let's not go there
no sequel databases are immature but don't have the tools and the experience
and the knowledge of how to work with them well but we've had from 20 years of
relational databases and all of these consistency issues can still end up
biting you
so when it comes to what kind of project
I get dive start with a driver's if you want rapid time-to-market fast cycle
time you need to be quick easy and development is really important
therefore if you can do that with no sequel databases are two reasons to go
with them
similarly if you've got a very day to intensive project then obviously no
sequels ability to deal with large amounts of data is very important but i
think is another overriding goal as well which is easier project really important
to the competitive advantage of your business what I refer to as a strategic
project because if it's a strategic project then it's worth taking on the
extra risk the unknowns of dealing with an immature and not so well known
technology which is what no sequels are
if on the other hand you've got a project that while i call the utility
project it's kind of a straightforward it's not really vital to the businesses
operation then that maybe not the best place to bring in as an unknown like
this in that kind of situation you're probably better off with a familiar
at least for a few years but there's lots of strategic projects out there and
I'm certainly are experienced over the last 23 years of four works has been
very positive with no sequel databases I've heard remarkably few complaints and
footwork is always complain about what they're working with so I certainly am
very much convinced but no sequel databases have
important part to play in the spectrum of future developments and the rest of
the talks in this track will explore different ways in which they've been
used so i hope you found that helpful if you want more depth
the book is very thin my target was a hundred and fifty pages and I only
missed it by two so 252 pages quick overview a bit more than what I just
gave you and I hope that will be handy if you got that page on my website
I collect together various other things that I've done all talked about in terms
of no sequel and thank you for listening to me