Reusable Execution in Production Using Papermill (Google Cloud AI Huddle)

Reusable Execution in Production Using Papermill (Google Cloud AI Huddle)


[MUSIC PLAYING] MATTHEW SEAL: Thanks
for the nice intro, and glad to be here
and talk and get to share a little bit about
what we’ve been working on in the open source community
and how we collaborate some with Google Cloud Team. If you’re a designer
in the room, close your eyes for
one second because I’m about to do
presentation blasphemy. There we go. Now you can open them. We’re on a new set of sites. So here today, I’m going
to talk a little bit about a topic around notebooks
and explain a little bit, if you don’t know what
notebooks are, what they are, and then how we’ve been using
this to move notebooks closer to a production tool and make it
usable with a technology called papermill. So I think the bio intro
for me was already there. I work on the big data
orchestration team at Netflix. And so what that really means
I’ll explain in a minute. And yeah. So I’ve done this
talk once before, so hopefully this will go really
nice and we’ll be able to dive a little deeper into things. And I’m excited to connect
with you all afterwards and see what your thoughts are. So to start off, we’re going
to talk a little bit about, what the heck does a
data platform team do? This is surprisingly mysterious,
even among data platform team members sometimes. So I’ll give a
little bit of outline of, at least at Netflix, what
a data platform team does. So the data platform
team is really focused around building
tools and services that get out of the way of
users doing their job. So you have things
on the left side of the slide here where you
maybe have events or user inputs or system metrics
and this thing that contributes to this vague
idea called big data. All that data gets collected
up, and you need somehow to get that to the end result
of making business decisions around reports or delivering
information to individuals or making data models for AI. And at the end of
the day, the user just wants to be able to get– as a data engineer or a
machine learning developer, they want to be able to use this
data to make these outcomes. So the data platform
services at Netflix is really about
building the tools so this is easier and easier
for those users to do their job. And so we do a
lot of integration of many different systems
and platforms and tools so that they don’t
impede a user’s being able to achieve that. So what that boils down to is
data platform team, it opens doors, just not that door. All the things I’m going
to talk about today are actually open
source projects. So this is around Jupyter
open source ecosystem and the nteract ecosystem. They’re both building different
tools around notebooks, and they’re intercompatible. They’re not different
projects in the whole. And all these things
have been contributed by Netflix and by Google
Cloud to get integrations into their platforms,
into their tools, or just as a general
usability for everyone. So real quick, how many of
you know what a notebook is, have use a notebook? How many know what the
difference between a Jupyter notebook and
another notebook is? Cool. We’ve got like four people. So I’m going to talk
a little bit what the difference is there. And for those of you who
haven’t used a notebook, what the heck are they? So in the most basic
vein, a notebook is– you’ll see something like
this as an interface. [INAUDIBLE] This is a notebook. This is an interface
that basically gives you a broken down chunk of
code and documentation and logs that are
sent and received from some remote
[INAUDIBLE] or local REPL. So you have some
engine that’s going to execute code or execute
documentation requests and return back some display
and outcomes of what happened. So here, in this
example, you see is has this really
basic notebook. It’s going to import
an image, and then it’s going to render that
image in the notebook and then print out some text. Really super basic
stuff, but you can see this is a grouping
of code and execution in a nice, flowing manner. And you can see this as
a different interface for notebooks but
the same technology. These are all Jupyter
notebook interfaces. And in particular, the benefits
of using this technology lie around the fact that you
have code logs, documentation, execution results all
in the same document, and that this document
is shareable to others. This means it’s really useful
for iterative development. It’s really handy for sharing
results with colleagues, and it lets you integrate
various API calls in one central place
that’s well-formed. A little bit about
what’s on this page. What does it represent? What’s it mean? So you have things
like a status or save indicator in the upper right, or
something about the connection to a kernel, which is what’s
actually executing your code. You have the
standard menu items. And you have things
called code cells here. So you see this thing I’ve
outlined is a code cell. It’s a chunk of text
which represents code in a particular
language or framework. Jupyter Notebooks in particular
are actually language agnostic, though most of the
time you’ll see them in Python because
that’s probably the most popular kernel type. And so here you see there’s
some basic Python code. Maybe I want to load in a model
and assert that my model has some version on it. And then, maybe I want to pull
something out of that model and display it. So the second part that’s
really useful in notebooks is the ability to display
data or graphs or images or HTML right inside
the same place where you define the code
so you don’t have to jump between systems to see results. The other really neat
thing about notebooks is you can rerun just one cell. So it keeps track of
the state as they’re executing so you can run
everything linearly in a row, like we’re going to do later. Or if you are iterating
and trying to figure out where the display you’re
trying to get actually is, you can try and look
for it, run the cell. If it fails, try changing it. Rerun the cell. You don’t have to rerun
the model load that happened in the cell before. It’s already been loaded,
and you’re just iterating on a live code session. We talked a little bit
about already the wins here, but mainly it’s this
familiar interface for exploring data and
exploring problems. So how’s this work a bit? So in the Jupyter
space, independent of other types of notebooks,
the Jupyter space actually is really about
defining the protocols, about how to communicate
between the code execute her and the front
end or the client that actually wants to execute this. So you’ll get
something like this, where we have some users on the
upper left, and those users– yes, upper left for you as well. So users in the upper left. And they all will be using
some sort of Jupyter UI. And they have some document,
an .ipnyb file which represents the notebook. And then, they’re
interacting through the UI with a Jupyter server which
is communicating to the code executor called a kernel. And it basically just forwards
requests back and forth through the kernel
on your behalf. And then, there’s a
protocol that kernel follows so that any client in
any different UI or interface could do the same thing
with their own display. The difference between there and
other notebooks, by the way– step back a second. Other notebooks oftentimes
bundle the kernel and the server together. So they’ll have just the
API layer to the client, and they won’t emphasize
how you actually execute, which means
it’s harder and harder, if you want to make
a new type of kernel, to just execute a
new type of language. Or if you want to
make a new UI, you have to reinvent
more things here. The kernel and the UI APIs
are all independent units that are all well spec. So you’ll see many,
many different flavors of UI and extensions like the
ones we’re talking about today. And that gives
Jupyter Notebook space an advantage over many of
the other notebook options. So we’re going to talk a little
bit about why these came about and how come people
use notebooks. Especially if you’ve
never used one and someone’s
like, I really need to have this notebook in
production, and you’re like, what is this thing? Where’d it come from? Where it came from
was this ability to explore and analyze
data and find a result. So it was really handy for
data scientist workflows, where I need to load some
data that’s expensive to load, and I need to iterate on a model
over and over and over again. And maybe I don’t know what my
end code’s going to look like. Maybe don’t even know what
I should be doing yet. So it makes it a really
nice way to explore without having to recompute
all the very expensive aspects of your code. And this comes into
the fact that this makes some really nice
qualities for certain workflows. It also lets you record outputs. And it’s also an
easy to modify tool. I can share this notebook
with somebody else. They can reproduce
what I’m doing, and they can modify it
slightly without having to understand a lot of systems. I can just ship them a file, and
they have that reproducibility. But there’s a lot
of things that make engineering teams
frustrated with trying to move these from a development
cycle on a scratch pad into a real production system. And some of these things are
around the lack of history. If you accidentally
edit a cell and save, it doesn’t tell you
what you edited, and it’s hard sometimes to go
find out what you actually did, especially if you close your
browser or something else. They can be really difficult
to test traditionally. So someone will write
1,000 lines of code, hand it over the wall, and
ask– or 10,000 lines of code– and ask, hey, make
this run in production. And you’re like, whoa. I don’t know how this works or
if it’s going to work reliably. It’s also a mutable document,
so you’re always editing and in place in traditional UIs. And it becomes hard to
collaborate on in this sense, because two people trying
to edit that document at the same time, you can
run into a lot of issues without extensions
on the protocols. So one thing we
did at Netflix was try to fill some of these gaps. We had a ton of users
using notebooks, and the things we
saw was that, instead of trying to push those users
away from using notebooks, we said, OK, we’ve
got a sizable number of users who all love
working in this environment, and they’re having a
lot of friction and pain moving from that to the systems
that are already in place. So they oftentimes
had to rewrite all the work they did or
had someone else rewrite all the work they did in another
place in the same exact pattern they already solved. So some things we
wanted to do is really help with improving the fact
that these notebooks aren’t versions, so it’s hard to
figure out what you actually ran in particular time. They have mutable state. And they can’t
really be templated. It always requires a human
to go edit from the UI and change a variable to
rerun it, which is not a very programmatic pattern. But some things we
really wanted to preserve was results linked right
next to code, good visuals, the ease of sharing. These are the
benefits, and we didn’t want to move away from them. So we worked on this
library in the open source called papermill. And papermill is in this GitHub
project called nteract, which has a whole bunch of
utilities and tools around notebooks, in
particular, Jupyter Notebooks. And from a high level,
it works this way. It’s a program which runs like
a client, much like a UI would. And it’s going to take
something as an input notebook in some
path, either S3, in Google Cloud Store
like in this example, in your local file system. It’s going to prioritize
and run that notebook, and then it’s going to output
it to an output location that’s independent of the input. So in this case, we maybe
have four different runs we run where we’ve run the
same notebook four times with a different run each time. And how would this look in code? So it’s actually a
really simple interface. You don’t have to get
really complicated with it. Here, we’re going to
do some examples where we run papermill against a
Google Cloud papermill demo with some input notebook. And we’re going to basically
output to a new place. And here, we’re just running
the notebook verbatim as is. We say, hey, we want
to run this notebook, so what’s papermill doing? It’s actually executing the
whole notebook and the end, from the beginning to
the very last cell. And as each cell executes,
if no errors occur, it continues to the next cell. And as each cell executes,
it’s saving the outcome into the output location. So you get incremental
results as it’s going. And it also can log out live
what’s actually happening, even if you have
some interruption in the middle of a cell
or some hardware dies. But if you run this
where we have some output run of the demo and
we have demos out, you see a folder in
your Google Cloud that has these output runs maybe
are running once for each day, so you put the date
into your output path. And now you actually can
see a history of what you ran every single day. You didn’t mutate the input. Actually, your output is
isolated from your inputs, so you get some nice
immutability guarantees. You also get some version
history, in the sense– if you’re running things,
you can see historically what you actually executed. But that isn’t where
we stop because one of the other things that we
can’t do is we can’t go in and edit like a human and
change a parameter or a variable inside the notebook and then
expect that notebook to run. So when someone
give us a notebook, we actually need the
ability to templatize it, which is add parameters. So in this case,
you can see we’re going to run this input
notebook and pretend this is a notebook which
goes and maybe counts the number of clicks by
device in a certain region for some website
you’ve maybe heard of. The notebook that
you’re running, we would have this default
parameter cell in here where you would identify–
this is my default. You can set the region to US. Device is a PC. So if I were just to run this
notebook without papermill, it would run and
say, hey, I’m going to count everything in
the US that was a PC that did a click on the website. But then, when we parameterize
it in this way with papermill, you’re going to see that we’ve
overwritten region devices by inserting another code cell. So the way papermill works is
it will generate your inputs, create a code cell
out of those inputs, and inject that
into the notebook as though if it
were a user input. And then, when it runs,
your outcome actually just has that write-in as code. So when you read
it, there’s going to be the cell with a
comment saying parameters and then all the parameters
that got injected by the system on your behalf. And also, you don’t have to
be executing this from Python. Excuse me. You can also execute this from
command line, which is actually probably a more common
pattern, though if you’re trying to extend it, you want
to extend things in Python. But here, you can see
the same exact command where we did this Google Cloud
notebook input-output path. We can do the same
thing with command line. We’re just saying
papermill as a CLI, which wraps that other call. And then you can
pass in parameters in any sort of JSON format
or as individual key values. Technically, -y is
for YAML, so you can give it YAML as well,
though here we’re passing JSON. So let’s do maybe
what would this actually look like if I were
calling it on the command line. So it’d be something like this. I run that command that we ran. You’re going to see
a printout of, hey, I’m grabbing an input
notebook from here. I’m going to an output. And then you’ll see
this progress bar going that prints out
progress is going and the rate at which
it’s processing. It’s configurable about
which of these things you want to render by
default but it gives you some basics about what’s
actually executing as it goes. And you could even
put it in a log mode so that it will print out
all the logs about each cell as well as saving
it into the cell. If we ran that
same outcome and we want to go look and say, hey,
what’s in your Google Cloud, we just ran that. And actually, the
code we ran up there was the exact same–
basically the exact same code. And then I point to this
papermill demo bucket, and you can see we
have the outcome and the input separated. Input has 993 bytes because it
didn’t have the image baked in. And then, the output
has 35 kilobytes because it was that
original notebook we looked at that prints an image. So how does this
change the picture of the interaction with
the tooling under the hood? Like this is replacing
this whole stack? Not really. It’s actually just
a different client to talk to the Jupyter kernels. So it’s following the same
protocols and specs as an API client would or any other
kind of codified interface you want to make or user
interface you want to make. And because Jupyter has these
really nice specs about how to communicate between the
REPL and the interface, the papermill can act like any
of those clients and execute. So when it runs,
it will actually launch a kernel
manager, which will go find the kernel by name,
start it up, run all your code, shut it down. And then, the only
other difference is that the read and write– we’re reading from one source
and writing to an outcome source or outcome destination. That’s a little bit different
than the maybe development cycle pattern. But the end of the day, you
always have an .ipnyb file that you can load in any system. If you look a little deeper
under the hood about what this actually
looks like, there’s a few modules that go in here. There’s syncs and
sources, which can be any kind of
schemaed destination. So things like Google
Cloud, S3, Azure, File System, [INAUDIBLE],,
those are all baked in. And then, the parameters here. You can pass in any
kind of JSON-like value and those extendables. So if you want to have
more customized things for your platform,
it’s very easy to make an extension of how
the parameters get passed in so you could do things
like, with table references, actually load the table
data on behalf of the user. So the other thing this
enables is the idea that you might have
a notebook template, let’s say something that’s
a learning algorithm, and you want to adjust
how you actually run it. You can run that and
parameterize the notebook differently for each run. And then you can
run that notebook and have a documented outcome
that has visualizations, logs, and outcomes all in one
place for each parameter variation from your
template you started with. This is really, really handy
for machine learning patterns because oftentimes you want
to have a template explore a little bit, find a pattern. But then you maybe
don’t know what parameters you should
actually use in production. So this is a way you can
fan out that notebook without rewriting it along
many different parameters. And the end result here
is you can also then see what the result of
maybe a confusion matrix or some other
accuracy score on the outcome is for each of those notebooks. So there’s a few aspects
of notebook execution that this extends to. Part of it is we’ve
extended the use case to other types of users. And this is actually
where Netflix was really targeting with some of
this papermill investment. Before, we just had the data
scientists had a nice iteration place that didn’t fit
very well into production. And now, we actually
have expanded this, so now analytics engineers
and data engineers are using notebooks
more and more because they get the
benefits of logging. They get the benefits of
associating the thing they’re executing to outcomes. And they can
iteratively develop. They all have an iteration cycle
that looks somewhat similar, although I would say
the data scientist this is the most involved in
terms of iteration cycles. But now, they can develop
on those notebooks, and all of those
different user profiles can use papermills and then
schedule that on the platform or run it programmatically
at a later time for people. So one example of this. There was a blog post, which
is linked on the slide here, around how to use papermill
with TensorFlow on Google Cloud to actually execute your
notebook with GPU instances or other types of
configurations. So here’s just a little snippet. I had it cut a little
short to fit in the slide, but the blog has the full
code to execute this. But you’re basically just
going to run your Google Cloud compute instance create
call with an instance, and then you’re going to pass
in your GPU count and type. And then, you’re also
going to say, hey, my startup script is this
papermill input path, output path. And you can imagine adding any
kind of parameterization here in that you would want. And then, you can tell Google
Cloud to delete that instance when it’s done. And this basically has
now made a orchestration for executing a notebook on
top of another set of hardware, including hardware that
you might not have access to in your local dev. The other thing too is
this all open source code, so all the things I’m
showing aren’t proprietary, hidden behind walls. The Google Cloud
storage was actually merged into the open source,
so it’s very easy for anyone to access with their
tooling in the Google Cloud. S3 and Azure are
of similar stories. And it’s really easily
accessible to any other scheme we want to add. We’ll talk about how
they do that in a minute. And the whole papermill repo– which, by the way, had the
1.0 release actually went out today, which was
feature completeness and all the little things
upstream finally got finished and merged
into releases. So it’s a really nice,
clean repo to read now. But all of it is plug and play. So how you read from
syncs and sources, how you actually
execute the notebook, and how you apply
parameters are all registered classes that you
can register your own custom versions of. Say, for example, we want to
implement an SFTP handler. Say we have some files
in SFTP because that’s where someone has
some notebooks, or they want their
notebooks to land there because they need to share
across some B2B channeled that still is using SFTP. To implement this
would literally be all the code on here. If you implemented
read and write and then did the
setup tools here, this would be all the
code to be able to call the command at the
bottom, which is papermill execute with an SFTP scheme. It has this entry points
pattern you could do, where you can register
in your own setup pie an entry point about
how to register an IO interface in and for
the other interfaces of a similar pattern. So the code actually
execute things. If you read the whole repo,
it’s really short now. There’s not very
much code in there. So it’s pretty easy to
grok what it’s doing and to add new things. So one thing I
want to talk about is actually the failure
mode is actually really nice in notebooks. Traditionally, if
you think of, how would I have a bunch of scripts
that maybe I didn’t quite productionize and
put into the library, or maybe it’s calling things
across three different systems, how would I actually understand
what went wrong when it fails? And this is what keeps
people like me employed because it’s a hard
problem, so we can go build solutions for people. But one of the nice
things is with a notebook, if you failed with
a notebook and we call that same pattern
where we fan out over some parameter,
the failed notebook actually has a really rich
source of what happened, how’d it go wrong. how do I reproduce the problem. You get all of that
captured in that notebook. Because it failed, it’ll
actually still save the outcome, but it’ll save it. Here’s the stack trace. Here’s the cell it failed. Here’s how the code
didn’t succeed. And your parameters, when
you’re running the papermill, are already baked
into that notebook, so you don’t have to
re-parameterize it. It will run exactly as it
ran when you were scheduled or running inside a platform. So here, for example,
we have this notebook. These are some of the
advantages that I just talked about with stack traces
and rerun-ability and execution logs. But let’s say that
notebook we just had, we failed and it crashed. Well, something
we can go do is go find the issue in the notebook. Say here we found this
cell execution number 12– I’ve cut out middles layers– has some sort of HTTP connection
failure talking to Spark. We can go load that notebook
into a notebook dev, iterate on it until
we fix the problem and identify what’s
actually wrong. And then we can
ship that fix back to the upstream up notebook. In this case, for
example, we had a Spark job is defaulting to
a hostname that didn’t exist. So it’s trying to
target some YARN cluster that doesn’t exist. So maybe we need to tell it
where the YARN cluster actually is. And now the job
actually succeeds, and we can push that change up. But we reproduce
the exact problem, and we didn’t have to go and ask
five systems what went wrong. So what these very simple change
to how you execute notebook with a client that isolates
the inputs and outputs gives you is that you get immutable
inputs if you’re not changing your input
that you’ve scheduled or are executing on
a programmatic basis. You get immutable
outputs if you’re saving your output
to a unique path, for example, we put to
whatever the user’s execution context plus a GUID
for every single run, and then we associate that back. And we put it into
someplace where it’s read only so users don’t
have access to go edit it. That means you have very
good reproducibility and audit history about
what actually executed. You also get nice
parameterization notebook run. So no longer do you have to
have a big block of instructions for the human to go
change a notebook, like, when running this, change
these three lines in this cell, and then go change these
four lines in this cell. You don’t do that anymore. You can just parameterize
and templatize it so the user can just
pass in parameters. And then, the other thing
too is the configuration of sourcing and syncing. This means that you
can get interface right into the platforms we
already share files or where you already
have hosted systems without having to
rewrite the execution and wrap it in your own
copy and move and shift. So given all that, it also makes
a pretty compelling argument for being able to
test notebooks better with papermill as
opposed to before, testing it was usually
ship it to somebody and have them run it and
say, yeah, this is good, and hope they were right. So in terms of
notebooks, one thing to note when you’re talking
about testing notebooks is notebooks make pretty
terrible libraries. They make great
integration tools. But if you have a ton of
library code in there, they’re pretty hard to make
reliable and know that you can share and make it reuse. I’ll talk about what that means. So they make a good
integration tool because notebooks are
really good at connecting different pieces of technology,
logging how it actually ran, building a result, or
taking some actions on disparate pieces
of technology without having to
leave an ecosystem. But they’re unreliable
when they’re really complex or they have a really
high branching factor. So when you have a notebook
that has 15 conditionals inside of it or many different
functions and loops, it can be really
hard to reason about if this thing’s going to run. The traditional
path areas with code is you would go unit
test each of those pieces and break them up. With notebooks, since unit
testing is a bit hard, we actually, at
least at Netflix, have adopted some of these
development guidelines for the more critical things. This is keeping a
low branching factor. Short and simple is better. Try to keep to one
primary outcome. So if you have a notebook
which does five things, maybe consider breaking
that into five notebooks and have five notebooks that
do one, each doing its own job. And try to– I put
this in parentheses because we fail at it too–
leave library functions in libraries. So if you get a really complex
notebook and it’s important and you want to move
production, move that code into a library that’s shared
across your notebooks. Maybe talk to your–
whoever’s helping support you to make that
happen if you don’t know how. But we do still get an
advantage of something we couldn’t do before, which
is integration testing. So if you want to write
an integration test, you basically just want to write
exactly what the user would have run, a template
call into your notebook. So say we have
that Spark template that we showed failing earlier. We want to write
an integration test because it failed for the
user because they didn’t have a cluster host to find. And we could have
easily caught that, that the default value
was going to fail. Here what we can do is run
that same Spark template, and then we’re going to
do some test run output. Maybe we would automate this and
have a test run idea in there. And then, what
we’re going to do is pass in a fake region, a fake
run date, and the debug flag just in– for notebooks that want
to be able to maybe dry run their execution if
they’re in debug mode. And in this case, we’re going
to be targeting this region that has tons of users called luna
We’re going to pick a run date. Maybe that doesn’t
really matter. We’ll fix it to
a hard-coded day. Really– is everyone
awake, by the way? I’ve just talked
about, like, cool, finding people on the
moon, and no one even– OK, cool. So we’ve got the– here are parameters
that got passed in. You see region got assigned. The run date got assigned,
and our debug true. And then, here, this Spark SQL
that we’re actually running, maybe it’s running a particular
query against an output table. And now, using our parameters,
we can reorient how this runs and run it against dummy data. So we’ll do this
a lot where we’ll run integration tests
at Netflix, where we’ll run all the templates. And we have default
tables with some known potentially problematic
patterns or just the simplest thing we can make. And then, we run
the same templates the users are going to
use in the same way they would use by targeting that
table or that piece of code. And this gives you
a really nice way to say, hey– and this
notebook runs well. I know all the
defaults makes sense, and I know I can
parameterize it in some way. You could also, if you
have a DAG execution, you can do things like
run the integration test and then have a follow up
unit in your DAG that says, hey, did this actually write
data if you’re concerned about that, and you can branch. It gives right into the
wheelhouse of integration testing if you’ve
ever dived into how QA thinks about testing. It gives a nice
breadth of coverage for notebooks in that space. So that means is what are
some guidelines about how I test my notebook? Unit testing your
notebook is hard. The tooling out
there isn’t great. Or even if you had good tooling,
it’s a little bit cumbersome because how do you– you had
this one document you could bundle and send to somebody. But now, how do you
bundle it with unit tests? Do you put them in your notebook
and make it really cluttered? Do you put it alongside? Then, why don’t you just
make a library anyway? There are some hard
conversations there. So what we actually ended
up doing was saying, hey, use notebooks throughout
the good app, which is integrations or
reproducibility, and try to follow these
guidelines to make sure that you can trust that
they’re going to work. If you have a simple
one-pager of a notebook, one integration test
is probably sufficient. You’ll be fine. If you have something that
has maybe a little complexity, maybe it’s two pages
of code on your screen, maybe a couple integration
tests for a few of the patterns that could
go through it to make sure you cover more of the lines
usually covers you pretty well. And you’ll be pretty good
for production use cases. If you do a really
complex notebook– like I wrote my whole ML
training, loading, saving, and three different model
iterations in the same notebook, and it’s like
a 20-page document– one, maybe don’t
put that in prod. If it’s critical, move some
of that code into a library so that the shared code
between your model execution can be loaded in
all your notebooks. And then your notebooks
get nice and trim, and it’s more of an integration
where your library is one of the integration pieces. And then, if you can’t
do that and you still need to get it running,
trying to get one integration test per usage pattern of how
this will be used at least gives you some confidence. It’s not a silver bullet,
but it gets you further. And one thing I want
to really emphasize too is that I’ve talked
a lot about papermill and how you can use this
tool to change notebooks. There’s actually a ton of
other goodies in the ecosystem. Because it’s all spec
and protocol based, notebooks actually have
a whole rich environment of libraries that give you
different capabilities. And oftentimes,
in today’s world, a lot of notebook teams
that are being formed are really about how to
collect these goodies and put them into a platform
that works for your use cases. So a lot of tools that you
probably haven’t heard of but are used by platform
teams all around that I would encourage
to look into. nbconvert is something that
actually papermill uses, but it can have
other export forms. So you can make HTML or PDF
outcomes of your notebook. commuter is a
read-only interface for notebooks, probably the
most useful out of this list because it makes sharing a
lot safer and a lot easier. You don’t have to have a
whole notebook server up just to share the results
of an execution or of how you’re
planning to do something. And there’s a few others
in here, some of which I’ll talk about for
a couple minutes. So scrapbook is one of them. Scrapbook is another
repo that I’ve written from scratch
that basically pulled some functionality that was
in the original papermill implementation and put
it in its own bucket. And what this is really
trying to do is, as a tool, Scrapbook is intended
to complete the story arc of notebooks as a function. So with papermill now, you
have an input notebook, and you have papermill. So you have the function
you’re actually going to run, and you have parameters
as the inputs. But your outcome today
is still a notebook. So when you outcome a
notebook, it’s still mostly human interpretable,
not machine interpretable. One of the things that
this adds is the ability to save results by name in your
notebook and then recall them. And what this is really
useful for– say I ran that notebook
that was an ML model and I want to iterate
all my parameters. One of the things I
might want to save is the confusion matrix result. And then I want to collect
all those confusing matrices and find the parameterization
that was best or fit some criteria. This lets you do that pretty
easily without leaving the notebook. I encourage you to take
a look at the GitHub to look into more details
about how that works. But here in this example,
inside the notebook in the first section,
we’re importing scrapbook and gluing our model results. And then, later, outside
the notebook, maybe after we’ve executed it
or in another notebook, we’re going to read
that outcome notebook and then go collect the
model result by key. And you can also save things
like graphs and recall them. So one of the things
that’s really handy is if you have a performance
graph, you can actually save performance graph alongside
your confusion matrix result. And then, when you find
the confusion matrix you like the most, you can actually
go render the graph for that run as well and not
have to re-execute it. It’s a handy little tool. Take a look. It’s a little bit early in the
sense of where it could go, but it has full feature
capability for the things I just described. And I talked a little
about commuter. Really highly recommend
looking into this one. This is the interface
for commuter. This is our Presto template
for how to execute Presto jobs. It also fits on one page. It’s nice and clean. There’s almost
nothing happening. But it’s really handy
because these calls here, we have schemaed input, and
then we have a gaze here that prints out the
job information, then a monitor at the end. Here, we can share this
notebook template with users, and when they actually
execute, they’ll see this but parameterized
with their logs in there. And we always share through
commuter to the user so they have a really
nice way of seeing what happened without the
risk of editing anything. So I’ve talked a lot
about the open source and hinted about
how we use it a lot. One of the things at Netflix,
in about a year and a half ago when we kind of
started really diving into this adventure,
there was a strategic bet made that the large
swath of users that we have that we
were hiring the most of, which were analysts,
were actually using notebooks
instead of other tools that us leaning into
making notebooks easier would have benefit the company. And then we took it
a little further, almost to crazy levels, where
we were doing a new scheduler project. We decided to make every
single scheduled job be a notebook, which
might sound extreme, but actually, it
played out very well. Because what we really
did was we made templates. And the users didn’t even
know they were using notebooks at first. They say, I’m
running a Spark job. I’m running a
transport to Druid. They’re just giving the
inputs and parameters for that type of job that
schemaed like any other job definition. But the end of
the day, we always translate that to a notebook. So it’s really easy for
anyone to debug what happened. I’ll even put this the point. We had a manager of a team who
hadn’t written code in a long time, and his on-call person– this is a machine learning team. His on-call was out sick, and
one of their critical jobs failed. And he told me afterwards– I didn’t even know he did
this– but he said, oh yeah, I got paged because he was sick. And I went and
looked at the job, and it had this notebook thing. I don’t know what it was. I clicked on it, and it had
this really nice visual, and then it told me exactly
what the stack trace error was. And I read it, and I
understood what was wrong, and I actually fixed it and then
reran the job, and it worked. And that was a manager of
a team was able to do that. It lowers the
barrier quite a bit on understanding what actually
happened and what went wrong. So that’s a fun thing. And since then, since
we started this project, we have moved like
over 10,000 jobs. It produced like on the
order of 150,000 queries a day to all be running in
this notebook framework. And we’ve had
pretty good success being able to support that. Cool. That’s what I have
for the presentation, and then I think we have
some time for questions. But yeah. [APPLAUSE] AUDIENCE: I would like to
ask about that performance and potential impact
of the performance compared to running a
clear Python [INAUDIBLE]?? Is there any difference, or– MATTHEW SEAL: Yeah. So the question was,
is there any difference between running pure Python
and running a Python notebook? AUDIENCE: Exactly. MATTHEW SEAL: The
answer’s pretty much no. In reality, this is all– the only thing you have is
a little bit of startup time in order to launch
the kernel, which is on the order of
a second or two. And then, after that,
you’re running Python code with IPython the same way you
would run Python code normally. So there’s no real
performance impact. Yeah? AUDIENCE: Does the kernel
had a built-in compiler or interpreter, or is it using
the system’s interpreters? MATTHEW SEAL: The kernel’s
responsible for making sure it can execute the code. So in most case– like
in the Python case, it’s wrapping IPython, so
it’s launching a process– AUDIENCE: It’s built in. MATTHEW SEAL: It’s built in. AUDIENCE: So if you
want to make sure you’re compatible with a
server’s version, you have to do something else. MATTHEW SEAL: Yeah. Oftentimes, people
for version things, like Python or Scala
or some other– or R if you’re using it– they’ll name their
kernels with the version. So you’ll say, I am an
IPython version 3.7 kernel. So you know you’re
running Python 3.7. That’s the standard
you’re executing against. It’s much like you
would say, hey, execute this job with Python 3.7. Same way with Scala
and others and Spark. They’ll oftentimes
put the version in the metadata of the
kernel so it’s very clear what you’re executing against. How it chooses the kernel is
actually inside the notebook, it saves when you run it
or you connect to kernel. It saves in the metadata of the
notebook, I ran on this kernel. When it runs, it tries to
find exactly that kernel, and then it has fallback
mechanisms if it can’t to try best effort if you want. AUDIENCE: So before– MATTHEW SEAL: Oh, sorry. You, and then you. AUDIENCE: So before
notebooks came to Netflix. So what was the
other alternative for doing the same thing? MATTHEW SEAL: Yeah. So what is the
alternative to doing what we’re doing with notebooks now? Well, one is we had a
really old scheduler, and it was a real
pain to debug it. There were like two
guys who could do it, and everyone else was
like, hey, this went wrong. And those two guys got
really, really busy. So on the scheduling
side, the story was kind of unscalable
for what we had. We could have solved
it in other ways. On the iterative cycle
and moving things the prod side of things
with notebooks, the story was the story that’s
a lot of places today, which is every time
you wrote something in notebook, after
you got done, you got told to go rewrite
that someplace else. And sometimes you did
a good job of that, and sometimes you threw it over
the wall to data engineering. And sometimes they did it, and
sometimes they threw it back at you. So there was a lot of friction. And it was OK and
at a certain scale, but we were clearly
seeing the warning signs that the user base
that’s using this is growing way faster than
all the other user bases. So rather than running
away from the technology they like and use,
we wanted to move the platform closer to them so
that there was less friction. At the back there. Sorry. He’s had his hand
up a couple times. AUDIENCE: Is there a limit
on the computation methods in the kernel? So can you do
distributed [INAUDIBLE]?? MATTHEW SEAL: Yeah. So the question is, is there
a limit on the computation within the kernel? It has the same limits as
any other execution process. So oftentimes, you’re executing
when you schedule this or run us on a platform. You’re running usually
in some container that has some resource limits,
and those are your limits. It’s the same as
any other process. Yeah? AUDIENCE: So when
you run several– when you copy and save several
Jupyter runs, do you do the dif and save on either parts of
the [INAUDIBLE] that change, or do you save everything? MATTHEW SEAL: No,
it saves everything. It’s an output totally. What it does when it runs is
it actually reads the IP MIB. It’s a JSON schema. So it loads the IP
MIB file in, verifies it matches the schema
for version 4.4. Then it executes it in
memory against it, updating. And every time it
saves, it converts that in memory representation
to an output totally independent of the input. So it will actually dump it. AUDIENCE: So if I
change [INAUDIBLE] like 5% of my
notebook [INAUDIBLE] trying to experiment
with something, but it could also
save the remaining 95% of the [INAUDIBLE] MATTHEW SEAL: Today, every
time you save, you’re saving the whole notebook. So that’s something
that will probably change in notebook 5
format with some options for being a little more
scalable in that sense. Usually, it’s not a problem
unless you save really, really big displays. Like I save a gigabyte of data,
you’re going to have a bad– Like if you say print
and it’s a gigabyte, you’re going to have
a really bad time. Of course, you’d have a bad
time in most interfaces. But it’s a little worse in
notebooks in that sense. But yeah, every time you save,
it saves the whole document. A question over there. AUDIENCE: So if you
will use a Python file, you have all the tools that
help you write good quality code, like [INAUDIBLE]. What are you going
to use in notebooks? How do you encourage people
to write good quality code? MATTHEW SEAL: Yeah. I would encourage you to
bring some of those tools to notebooks. There are a few. Like Linting and a lot of
UIs have some built-in stuff around certain languages. Though, since it’s
a polyglot system, a lot of those languages
lack nice autocompletes, nice linters, and
things like that. So if you jump over to
something like Scala, it’s still kind of rough. It’s getting made
better because there’s companies investing in it. But R studio still does
better than R in Jupyter in that sense. I would say those tools
can be integrated, and I would encourage to
bring those tools to Jupyter. There’s a few places
where they already exist. We got a bunch– I think you were next. AUDIENCE: So I’m able
to handle the cases when the kernel actually fails. For example, to code makes
the kernel run out of memory. I’m able to save the
last execution that falls out of the [INAUDIBLE]? MATTHEW SEAL: Yeah. So he’s talking about– the question was,
when things go wrong, like the kernel runs out
of memory, what happens, or is it handled well? Well, the nice thing is with
papermill 1.0 which released, it now has a better OM handler. So it’ll actually
raise that it ran out of memory instead
of running forever, which it used to do sometimes,
which is the worst thing it did out of things that were bad. But in terms of story arc, what
will happen when you actually run is it won’t save
that final execution. You’ll see if you
load the notebook, it’ll say, executing with
papermill, dot, dot, dot. And it’ll be stuck
in that mode, mostly because the processes die and
we don’t know how to recover. But what you do get is
when you run papermill– and we run it production. We run with dash-dash
log outputs. And that says– every time
we would get a message, we were buffering to
save in the notebook, we’d log it out right there. So we at least
have– the container that was executing papermill has
all the logs, so if something really catastrophic
happens, we can always read the logs from there, even
if the cell in the notebook failed. One thing that
would be nice, and I think we should
probably invest in that more, at least
in our platform, is when it does out of
memory of making a link, like rewriting
over that notebook to add a link back to the logs
or inject the logs afterwards. It doesn’t do that today, but
it would be pretty easy to add. Cool. AUDIENCE: Do you save it
or host in version control? And then, how do you dif it? MATTHEW SEAL: Yeah. So question was, do you
save in version control, and how do you dif it? The answer is I save my
stuff in version control. Not everyone else does. I would say that diffing
is kind of painful. It’s been getting
better, actually. There’s a few tools out there. So nbdime is a tool in
Jupyter for diffing notebooks. It’s not so integrated
with Git everywhere, but there’s some efforts there. nbviewer is a nice– it’s got an open source
side and a close source side for doing GitHub viewing. And I encourage you to go–
if you don’t use GitHub, they’ve got issues for other
Git systems that I’ve definitely plus-oned Go poke
them and tell Stash– or I mean Bitbucket
to go add an API so you can do the right thing. In terms of local diffing– so the reason why I ask
this, notebook dipping is really ugly because you’re
diffing JSON documents. They are pretty
printed, at least, so it’s not all one line. But when you want to
read it as a human, you usually give up after a bit. So the effort with
nbdime and other places to get integration
to your places where you do code reviews
and render the notebook is a dif, a
two-sided dif, rather than reading the JSON raw. Today, I read that
JSON raw sometimes, and it’s unfortunate. You had a follow up question. AUDIENCE: Yes. So you mentioned that you save
it in the GitHub [INAUDIBLE].. MATTHEW SEAL: Yeah. AUDIENCE: [INAUDIBLE]
company do not. So what’s the reason for? MATTHEW SEAL: Some of that’s
just the analysts and even some of the data scientists– Git’s a new thing. They maybe just learned SQL. They can do reporting. They know Tableau. They know some of
those interfaces. And Git’s kind of new, so it’s
a scary thing to get into. And also, sometimes
people are just lazy because humans are humans. So while, yes, you should always
use Git and version control, the reality is sometimes
people skip that step if they don’t think it’s too important. And then, later, it
becomes important. They forgot to do that. What we are trying to do there– one of the tools I mentioned
there is called Bookstore. Bookstore tries to give
you linear versioning, so every time you save, it
kicks off a version to– right now, it does it to S3. A blasphemy in this room, but
Google Cloud, you can plug it, make it happen. But Bookstore is a way to get
to some linear version control without having the user
have to do anything. Yeah? AUDIENCE: So let’s say
you have a notebook and you want to make
it into a Python file? Is there an easy
way to do that or do you just have to [INAUDIBLE]? MATTHEW SEAL: Yeah.
nbconvert dash-dash 2 Python. Oh, sorry. The question for anyone that
was online that couldn’t hear was, can you convert a notebook
into a regular Python file? And nbconvert’s the
tool to do that. Yeah? AUDIENCE: So Jupyter
notebooks are open source, so can
anyone contribute to it, or if I wanted to contribute
to it, how would you start up [INAUDIBLE]? MATTHEW SEAL: Yeah. The question was– Jupyter’s open source. Can anyone contribute? Do we have a Donations Please
or a big Don’t Enter sign? No, we don’t. So I’m on the
Jupyter team as well. And yes, we try
to be as friendly as we can to pull people in. There are a few of the
packages or code pass that could use some
more professional love, so we do also try and tag
issues with New User Friendly. Papermill has had a lot
of brand new contributors contribute to
papermill specifically. Matter of fact, that’s actually
how Google Cloud– they just opened up a PR and
said, hey, we really love to get Google
Cloud working with this, and that went really easy. So I would say it’s open,
and there’s a lot of forums both on Discourse, Gitter,
and the Google Group are all good places to
engage if you’re trying to find a place to work. And then, Interact also
has a Slack channel for lots of things. You can go ask questions there. AUDIENCE: [INAUDIBLE]
as a URL parameter? MATTHEW SEAL: As
a URL parameter, like go fetch them from a URL? AUDIENCE: [INAUDIBLE] MATTHEW SEAL: I’m not sure
I quite follow the question. AUDIENCE: I mean when I open
the [INAUDIBLE] notebook in the browser, [INAUDIBLE]
URL for that, right? MATTHEW SEAL: Yeah. AUDIENCE: So can I add extra
parameters [INAUDIBLE]?? MATTHEW SEAL: Oh, I see. So he’s asking, when you
load a notebook in the UI, can you add query params to
parameterize your notebook? No, because that’s not actually
running through the templating path at all. That’s the UI human path. There will probably be
extensions to make it easy. Some of the interfaces now
I’ve actually just made add a notebook parameter cell
as a first-class thing into the notebook interface
where you just click a button and it builds a cell for
you, though traditionally, today what you have to do is– So the way it will choose
where to put your parameters and how to apply them is it will
look for a cell with the cell tag parameters, and it’ll
treat that as the default. So anything in there it
thinks are the defaults, and then it’ll
inject after that. If it doesn’t find anything,
it’ll just put it at the top. But from a query
parameter point of view, it’s not really an interface
you’re parameterizing there because you would just
write the code cell yourself or write the default
tag cell yourself. It doesn’t automatically
do that for you. Cool. AUDIENCE: This is more
of a feature request. So do you have the [INAUDIBLE]
information for each cell? Because if something
real, you want to know whether it took
20 minutes [INAUDIBLE] or it took 2 and 1/2 hours. MATTHEW SEAL:
Feature request was can you add something so you
can keep track of how long cells took to execute? If you run with
papermill, it does that. AUDIENCE: Oh, it has it already? MATTHEW SEAL: Yeah. AUDIENCE: Great. AUDIENCE: You might have
addressed this already, but if it does
fail, there’s no way to start to continue execution
for it where it failed from? Suppose not failable
I’m going to say that [INAUDIBLE] interrupted. So with the metadata
from the cell, you can still see whether
it completed, whether it’s pending or running. So suppose something’s pending. Can you start or continue
execution where it left off? MATTHEW SEAL: Basically, the
question was, if a kernel fails and your papermill
execution stops, could you reattach the
kernel and run against it? Today, with tools
as is, it would not do that because it shuts
down the kernel when it fails, or it attempts to,
because the kernel may already be dead. You could write some code. It’d probably take on the
order of 50, 100 lines of code to make it not do that. And actually, that
extension part where I said where you can
plug and play the I/O sync, so like FSTP, you could
awesome plug and play a thing called Engine. So it has the ability to
register custom execution engines. And you could register a custom
engine that would hold on– you could monkey
patch to overwrite the way nbconvert shuts down
the kernel and not shut it down. So maybe have an engine
called debug or interrupt, and then you could run
that version of the engine when you execute, and
then you could leave it up so you could reconnect. It would be a little complicated
from an interface point of view, but it’s
physically possible, and the out-of-the-box
doesn’t do that. AUDIENCE: So in
one of your slides, you showed that
there are Jupyter servers [INAUDIBLE] notebooks. So who maintains those servers? Whose job is to keep those
servers up and running? MATTHEW SEAL: Yeah. So the question was, in
the architecture diagram there’s these Jupyter servers. Who runs these servers? What do they do? Who owns them? The answer is usually a
notebooks platform team. So many places have
a notebooks team. Google does here. Netflix does. Most of your big vendors do. They’re usually the ones who
are hosting the servers that are running everything
and figuring out with the rest of platform
how to orchestrate that, so how to isolate
resources or on demand. So there’s a few tools
in the open source to help you with this, like
JupyterHub as a way of divvying out things. And I think that’s a pretty
rapidly evolving story. But generally, it’s usually
a notebook platform team that’s managing the servers. And then, when you
click Launch My Notebook or you try to load
a notebook, it’ll dynamically allocate
resources for you someplace. Depends on the platform. Some platforms,
you have dedicated resources or shared resources. It lets you do what your
platform thinks is the right. Also, I don’t know if people
are on the remote call and want to ask questions, if
that’s possible, or people– AUDIENCE: [INAUDIBLE]. MATTHEW SEAL: OK. Sure. AUDIENCE: So how do
people go around building more complex pipelines? So for example,
you’ve got notebook A that you want to
fit into notebook B. Then you want to fan out
to a couple of notebooks that collect the results and
go to notebook B or something. So the normal API is on
the execute notebook. Do people [INAUDIBLE] to execute
these sort of complex things or is there some
tool [INAUDIBLE]?? MATTHEW SEAL: Yeah. So the papermill
and the Jupyter side basically just made it so it’s
very easy for any scheduling tool that has DAG execution
to do its job well, which is doing DAGs and
linking things together. So we use an internal schedule
tool that we beefed up, and we basically
have two job types. We have run a container and
run a container with papermill. That’s the two job types. But in reality, from
the user perspective, we have many templates that
we default in for them. But there, we really
leave it on the tool that knows how to schedule,
not knowing what papermill is and just executing
doing that job well. So an open source example
of that is Airflow. If you have Airflow,
it’s very easy to integrate
papermill and Airflow. A lot of people do it there. Luigi or some of
the other schedulers can do it very easily too. But I would say in chain
of responsibilities, I would let schedulers
do scheduling well and DAG executors do
DAG executing well, and papermill do
the execution well. Yeah? AUDIENCE: You had some user
personas on one of your slides, like gate engineers
and stuff like that. [INAUDIBLE] the
entire [INAUDIBLE] iterating to [INAUDIBLE]
or is there hand-offs between different [INAUDIBLE]? MATTHEW SEAL: Yeah. So the question was, with the
different personas we have, what’s the lifecycle
of a notebook with each of those personas? Are they handed
off or end-to-end? Depends more
actually on the team or how that
organization operates. So the tooling we’ve built spans
a bunch of orgs at Netflix. Most of the time, people
own end-to-end I would say. They iterate on the
notebook, and then they want to schedule
it, and then we ask them to follow some guidelines. Or if you want to burn yourself
and do something bad, you can. We have a hands-off
approach at Netflix, so it’s a little different
than some places. I would say a few teams, and
probably more the iteration cycle outside of
Netflix, is probably more the handoff model, which
is the traditional model. Hopefully, this
tooling helps lower that friction some so
the handoff is more to review and do more
like a code review and then templatizing
it rather than rewriting someone’s work entirely. Yeah? AUDIENCE: My notebook input be
a Google collaboratory notebook? MATTHEW SEAL: Yes. So Google Colab is a Jupyter– follows the Jupyter specs. So if you a Colab notebook,
it will run with papermill, and you can load in any other
notebook interface that’s a Jupyter Notebook interface. Colab, by the way,
if you’re not aware, is one of the Google-provided
notebook interfaces for managing notebooks
inside a Google Cloud– or not inside Google
Cloud but inside Google. AUDIENCE: Is there a better
visualization tool or for video in notebook or [INAUDIBLE]? MATTHEW SEAL: Is there a better
visualization tool for video in notebooks? There’s a few efforts
have been there. It’s more about the
component that renders. So the way– if
you actually look at what’s being sent as
messages to and from the kernel, they actually– you
send code to execute, and then asynchronously, you get
back these different messages. One of the messages
is called display, and display has
a MIME type, much like any other
front-end rendering system would understand. So then, your UI can
implement MIME type renders. So you can implement a nice– a better video
render by anything by implementing something that
responds to that MIME type and then plugging that into the
video editor player you like. So there’s been some
efforts in a few UIs of making better video. Mostly, those have been closed
source extensions of Jupyter, not so much in the open source. I’m not aware of what
the current state is for open source video
rendering niceties. But if you want to
add a component that’s easy for all of them
to use, I’d encourage looking at the nteract
project in the nteract. That’s a collection
of React components that are made to be used
in different notebook UI front ends. It’s really easy to add
a new MIME type render. Yeah? AUDIENCE: Hi. So I wonder how do you
guys reach to the decision that you guys want to do the
[INAUDIBLE] at a local level instead of [INAUDIBLE]
try different parameter within a notebook [INAUDIBLE]
inside a notebook? MATTHEW SEAL: Yeah. So the question was
around how do we come to the decision of
branching on the notebook level instead of maybe branching
within the notebook and iterating. And that actually has been– I see it as there’s two
design philosophies about how notebooks would evolve. One of them was the notebook
as a black box function that I’m going to execute,
or some larger system I’m going to execute. And the other was the
idea that each cell would be that functional
unit, and then you would reuse those cells somehow. Netflix ended up leaning
on the notebook side because it was simpler. So the kiss it model kept
it easier to reason about. It doesn’t mean the other model
is wrong or couldn’t be built. But we built tools
for what we saw as the path of least resistance
that would have the most reliability. One of the issues
with trying to– one of the opinions
papermill takes is, you’re going to run
this notebook linearly. Notebooks don’t require you
to do that from the UI side, but papermill does. And this simplifies
the problem and the reproducibility
question a bit. In the other model, how
you execute your cells, it becomes almost
an acyclic graph of execution of your
different cells. And notebooks didn’t
have a great way to express that DAG
within notebooks today. So if we went that
route where you want to iterate on
executing a cell, probably would have pushed
more on building tooling around setting up a DAG around
inside the notebook object about how you’re going
to execute your cells and how they relate better. But we didn’t go
that way because we felt it was more
complicated at the time. AUDIENCE: On a
conduction standpoint, when you have a new data
scientist that comes in, how do you teach them essentially
what are the best practices for here’s– and like you mentioned, a
lot of [INAUDIBLE] themselves and [INAUDIBLE] everywhere
[INAUDIBLE] access control for that or– MATTHEW SEAL: Yeah. So the question was, how
do you onboard people onto these best
practices, and how do you make the best practices
actually best practices instead of maybe best practices? That’s really– I can give
you the Netflix answer. I don’t think it’ll translate
to a lot of other places. And it’s not perfect. I think, in the Netflix case,
we lean a lot on this concept of freedom and responsibility. So teams, individuals are
free to do whatever they want, but we expect that we’ve
hired experienced people that are going to make
responsible decisions. So oftentimes, within an
org, that org or that team has made decisions about
how to best onboard people. And so we try to
provide the tools and notify, inform
people that these tools and patterns and
documentation are someplace. And we leave the
responsibility on the teams to figure out which of these
they should keep track of and onboard their users with. There’s some
centralized onboarding. We do training about the
scheduler and notebooks for new hires every
month and for people who just don’t know what it is. So we do a little bit
of the training model and try to follow that way. It works pretty well
within our culture, but within other
cultures, you might want to do a different approach. Cool. I get to save my voice then? Awesome. I’m not sure where
we are on time. Oh, we have another question. OK. AUDIENCE: So I work
as a data scientist, and that for most
of the prototypings we worked with the
Jupyter [INAUDIBLE].. So my question is that
by using the papermill, you’re trying to put your output
in a separate notebook file. What’s the advantage for that? Basically, for
production, we have to somehow transfer
the notebook back to the Python script for the
efficiency and other things that– I just wonder how
or why you need to code that output
on another file. MATTHEW SEAL: Yeah. So the question was
around why separate the notebook when it
goes to production because the data
science lifecycle lives within the same notebook. That’s how they operate. Now I have to think
about two notebooks that exist in the wild. There’s a couple
reasons we did this. One is it made data platforms
be OK with us doing this. So that’s one of the– that’s
actually a really compelling thing, is it crosses a
bridge where you give this immutability guarantee about I
ran this source and it resulted in this outcome. . So from a pure execution,
independent of what it’s doing, that gives a lot of nice
security and re-playability and debug-ability for
people who are outside of knowing what your
notebook actually did or how it should operate. So that’s one aspect. The other aspect
we have is that we wanted to have a very
good recollection of how your notebook changed over time
or how it iterated last time. So many times, if something
fails and a platform engineer or someone tries
to help, they have to go look at what ran and
figure out why it didn’t work. And knowing what the last
successful version is is really handy because you can go look
at the last successful run, compare it with the current run. That can tell you a
lot of things about why something isn’t working. Now, how does that feed
back in as a model? Or how do you have a template? One thing that happens
as a development side, if you schedule something with
a particular parameterization, usually, when you want
to try playing with it, you want that same
parameterization. So many times, people in
the data science cycle would schedule a notebook, run
it, and they go see how it ran. And either they’ll just copy the
parameter cell back and run it because it’s the same
source, or they’ll copy the whole notebook. And then, when
they’re done, they’ll just delete the parameter
cell and save it. So there in the cycle is
a little bit manual still. There could be tooling
to make it less manual. And I think that’s where a
lot of the tooling is going, is smoothing over these edges. It’s a little bit of a
bump for a really big win to be cooperative with
the rest of the ecosystem. Does that help
answer the question? OK. Awesome. Honestly, my voice–
and I’ll be around, so maybe I’ll lose it anyway. But thank you everyone,
and I hope you enjoyed. [APPLAUSE] [MUSIC PLAYING]

Only registered users can comment.

Leave a Reply

Your email address will not be published. Required fields are marked *