Susan Tan – Let’s read code: the requests library – PyCon 2016

Susan Tan – Let’s read code: the requests library – PyCon 2016


(host)
Well, good afternoon and welcome to our second talk of this session. Please help me welcome Susan Tan
who works at Cisco for her talk, Let’s read the code:
the requests library. Thank you. [applause] (Susan Tan)
Hi. I’m Susan. I’ve lived and worked
in California for the past four years. So I used to work at
a small cloud computing startup called Piston, and then last summer
they got acquired by Cisco, so now technically I work for Cisco
in San Francisco, California. The first time you open
a new code base, let’s say you’re a new engineer
at a new workplace, or you’re a new contributor
for a new open source project, you might run into a code base
with lots of files, like hundreds, thousands of files, and it may be pretty difficult
and daunting at first to look at this really large
unfamiliar code base and understand what’s going on
and how all the big pieces fit together. Navigating a code base
carries this assumption that, well, if you don’t understand
what this code is doing, then, well, it’s probably way over
your head and it’s too difficult. But I want to demystify the process
of reading a new and unfamiliar code base. Especially, I want to look at
the Python requests library as a case study for looking —
for reading it through and we’ll understand a lot of big chunks
of this Python requests library. We spend a lot more time
reading code than writing code. And so I want to share
some of my strategies that have worked for me
on how to read a new code base. And we’ll look at different parts
of the Python requests code base and really understand
how certain methods are implemented
underneath the hood. So the first step
is to prepare your editor. So for any editor of your choice,
you have to set up your editor such that you can jump into
any method or class definition when you’re highlighting it
or when you’re clicking on it. You also need to set up your editor
so you can search files by keyword. And optionally you may also
want to see the call hierarchy of any given class or method. The next step,
you should clone the repository, in this case
kennethreitz requests library. And then you open the repository.
In this case I’m using sublime text. And now you have
the code repository in front of you. Then you can set up
your local dev environment to get into the mindset
of what it’s like to edit the code. So in this case you would create
a new Python virtual environment, pip install the dependencies,
run all of the unit tests and function tests
and confirm that they all pass. This is directly the screenshot
from the official requests documentation and how to set up
the local working environment for making contributions
to requests library. More often than not, the reason why
you’re reading code in the first place is that you want to get familiar
with the codebase enough such that you can make changes
and edits to the code base. So setting up the development
environment is a big step in that. So if you follow
these instructions in the — for setting up the requests library,
it would take about five minutes to do, which is — I think
it’s pretty extraordinary. The requests library happens to be
on permanent feature freeze, so a lot of the features
that it currently has now won’t change much
later in the future. And you can’t really add
new features to the requests library. All right, so you have
your editor set up. You have your local
dev environment set up. You can now start
reading the code base. And I like to think of reading code base
as kind of like playing a game of Pac-Man. Sometimes the dots lead to
a pretty clear logical direction. Sometimes you hit a wall
and you have to retrace your steps and go a different direction
and try again until you figure out
what this code is doing. When you Google search
“Python requests library,” one of the first results you’ll see is
the official requests documentation page. And on that front page
is this code snippet that does a — so, a request, — a get request is being made
and you have a response. And then you can check
the different attributes of that response,
labeled like letter R. This includes the status code, headers,
encoding, text, JSON response. So the goal for today is
in the next 20 or so minutes we’ll figure out how this code snippet
works underneath the hood. So let’s look at the unit tests
for the requests library, and all of the tests
are located in testrequests.py. There’s over 1,600
lines of code here. And let’s narrow down
the range of things to look at. So let’s do a git grep
or keyword search on requests.get
and see what we get. There’s over 40 instances of the unit tests
that mention requests.get. So let’s just look at one test
and focus our efforts on that. So I’ve picked this test,
test_DIGEST_HTTP_200_OK_GET. So I’m going to go over
each section of this. So in the first section here,
there’s some test setup that’s happening. The auth object gets created, an HTTPDigestAuth instance
is being invoked with the string’s user and pass. And then there’s a URL object or variable that gets created
with the httpbin method. In the next section here, you can see that
there’s two requests that are being made, one with the auth object
and one without the auth object. When you have the auth object,
you can access the page. It’s a 200 response. Without the auth object,
you can’t access page and instead you get
a 401 forbidden error. So this says that you need to have
some sort of correct auth — authentication credentials
in order to access this URL. Then in this third test here,
you can see that there is a session instance that gets created, and then the auth attribute
of this session gets populated. The get method gets invoked
on the session, and you can access the page. So there’s a number of questions,
and when I’m first reading this code for the very first time,
some the questions I have is like, what is this sessions data
that’s happening? What is this HTTPDigestAuth? And what is this httpbin? So I’m going to answer and figure out
the answers to all these questions. So let’s look at the first thing.
So there’s HTTPDigestAuth. It takes in a user
and a pass as strings. Let’s do the function jump
and look at the class definition of HTTPDigestAuth. There’s a number of methods here
and I’m going to ignore most of them
except for the init method. And the init method takes in
the username and password, which are the required arguments to create an instance
of HTTPDigestAuth. Let’s look more in detail
about what is this digest auth. So the digest auth is a very popular
form of HTTP authentication, and it’s supported by
the requests library out of the box. And this is directly
from the documentation. You can see in the screenshot
code snippet here that the get request is being made and that line looks really similar
to what you just saw in a unit test, where an auth object gets passed
and the auth instance gets created. All right, so we have
a pretty good understanding about what HTTPDigestAuth is doing,
and we have the auth object all set up. Next let’s look at
the httpbin method and see how that’s defined. This is the definition
of the HTTP bin method located in conftest.py. And if you’re familiar with py tests,
all of the test fixtures are placed in conftest.py. So just from reading this, I can see that
in the bottom of this slide, an object is being passed
into httpbin, and then it also returns a function. So just reading from this,
I’m a little bit confused about what it’s doing. So what are my next two steps? I can try to look up the keyword
“httpbin” in the documentation. I can set some break points
and figure out what’s going on with that method. So let’s do that. I type in the word “httpbin”
inside of the search bar of the requests documentation website. I get some search results,
and eventually I go and find this Github page. It’s called httpbin and it’s a standalone API service
that’s publicly hosted. This particular service
is written and run by Kenneth Reitz, who’s also one of — who’s also
the creator of the requests library. And this httpbin.org
supports a number of end points including, get, post, put, delete. So I’m going to try out
some of the endpoints. This is the result
of the cookies endpoint. I get out a JSON response. Another is the result
of the get endpoint. This is another JSON response. I can also do a post request on the post endpoint of this httpbin with a set of a dictionary keys
in the data dictionary. And then in response I can see that
this is the same dictionary that I got. OK, so I’m starting to understand
what httpbin is doing. And when you look at testrequests.py, httpbin is pretty much
almost everywhere in unit tests, because every time a request
is made in a test, you have to be able to compare it
to some known correct output. And that known correct output
is provided by httpbin. So this is a pretty big step
in understanding how these unit tests are being constructed
in the requests repository. All right, so the next thing
I want to understand is, there’s a variable called URL
in the previous unit test that we saw. We can drop a debugger inside. You can use the debugger
of your choice. Some people for ipdb, or pdb. I personally prefer pdb++. One of my co-workers recommended that
so I pretty much just stuck with it. All right, so put in a debugger
right after the first request is done. And from there I can inspect the URL,
the auth object, the response. This is the resulting URL string,
and it’s digest-auth/auth/user/pass. So here’s what is this URL. And I go to my browser
and I type in this URL as it says in the —
this end of results. And then there’s a pop-up that says
I need to enter my username and password. And I know what the username
and password is because that’s given already in the unit test.
It’s “user” and “pass”. So I enter that and I’m logged in. So I get this JSON response. This is the correct response,
this is 200. All right, so that’s pretty good.
We’re getting closer to understanding what this first line is all about,
requests.get. We looked at a unit test,
so now we can look further on what happens next. So, requests —
a request is being made to this URL. And I’m going to do
yet another function definition dive and look at the get method
and see how that’s defined. So, API.py is the user interface
for the entire requests library. It’s also where
not only get methods are defined but it’s also where the post, delete, and all the HTTP methods
are being defined. And that’s how the developer
interacts with the requests library. In this case, what’s happening here
in the last two lines of code is that there is this redirect value
that get sent — gets set,
and then a request is returned. So I’m just going to go deeper
into this code and really look at how this
request method is being implemented. All right. This is a request method. There’s a lot of docstrings here
and some comments. And if you remove
all the docstrings and the comments, this is really just
two lines of code. A session context manager
is being created. And then there’s a request method
on that session that gets invoked. So there’s two questions here,
and I keep asking myself more and more questions
and I keep trying to figure out, like, how does requests.get
get implemented? So my first question is, what is sessions?
What is a sessions object? It — you know, there’s mention of it in a unit test, now there’s
a mention of it here. And I’m curious about
what is this request method that gets invoked
inside of the session object. All right, so let’s look at sessions. This is the class definition
of sessions in sessions.py. You can’t read it. That’s fine. I can also look at the extensive
and abundant documentation for the official requests library, and it talks about
sessions objects. That’s really cool,
really cool and convenient. So directly from the documentation,
what is a session? So a session is an object that
persists parameters across requests. It makes use of the urllib3’s
connection pooling. It has all methods
that the request API has. And every time that you make
a request of any sort, a new session instance
gets created. The session also provides
default data to the request object. So all of these are pretty much
directly almost word for word from the documentation. All right, so I think we have
an understanding of what the session object is doing
and why it’s needed. So let’s figure out further,
what is this request method inside of the session class? So here we go again,
jumping deeper into the library. This is the request method
inside of the session class. It looks daunting at first,
but when reading it, I can break this up
into four important sections. So first there’s a request
that’s created, there’s a prep request object
that also gets created, a request gets created, and then
the response gets returned. So there’s four steps to this. And let’s go over briefly
what these four steps are. First, a request object is created. Let’s see how that’s done. This is the class definition
for request. And you probably can’t see this,
but I’ll let you know that in the init method, there are
a number of required arguments including method, URL, headers,
files, data, JSON, etc. So all of these are —
most of them are parameters that you’ll need
to create a request. All right, so, since that
was pretty straightforward, the request gets created
and the second step is, we use that previously created request and then we make another object
called a prepared request object, we call it “prep” in here. So let’s figure out what this does. This is the prepared request method. A prepared request instance is made, and then prepare gets invoked
on that instance. And prepare takes
a number of arguments including method, URL, files. And there’s a number of arguments
that it’s trying to — that’s here. So there’s also some arguments
from the self object, which is the sessions object. And so prepare is merging
these different request arguments to create a complete request object. All right, so we can go even deeper. This is the definition
for prepared requests. And this is the method
called prepare. And there’s a lot more
layers of abstraction. You can go even deeper to look at
what the prepare method is or what prepare URL is,
prepare headers, prepare cookies,
prepare body, prepare auth. But this can be a total rabbit hole. But I think we pretty much understand
what prepare request is doing. It’s combining a lot of different
attributes to create a full request. So we can just move on
to the next step in this process. So the next step is, there’s the request.
It gets created. It gets sent. And then we have a response
that gets returned. All right, to sum up the requests, the request gets sent, and it — there’s a send method
that happens that — you know, it takes in a request object and then
the response object gets created. So I’m going to dive deeper and take a look at the send method. And it’s a pretty long function,
but the important parts are in… …are right in the middle
of this function, where it talks about a thing called the adapter,
and the adapter gets created and then there’s a send method
on this adapter. So there’s a lot more questions
that are happening, and I’m diving deeper,
like five, six layers deep into the requests library. All right, so I want to know now,
what is this adapter object, and why do we need that? So this is the answer. So the adapter provides
a way for the requests library to talk to http and https
using the powerful urllib3 library. This is directly from
the requests documentation. There’s even more information
about the HTTP adapter class in the docstrings. So we’re finally getting to the part
where, six layers deep, finally, we’re seeing the underlying machinery
for how the Python requests library is powered, and that’s using urllib3. And at the top of this file,
adapters.py, we see a lot of different
urllib3 packages are being imported. So to go back to
what the send method is doing from previously,
we can go even deeper and look at the send method. And there’s a lot going on here. And the send method makes use of
many different urllib3 functionality to be able to create
a response from it. So I’m curious about
what this send method is doing and what the final output is. So I place a break point
right before the return statement happens
to see what the response is. And I’ve run the same unit test
on the command line so that way I only get
the results for that test. So this is the shell for… …for looking at
the different variables. So the break point stops
on this line in the send method, and I can inspect the URL. And that’s pretty much
what I expected. Its digest-auth/auth/user/pass. Cool, OK, that’s the URL
that I’ve been looking at this entire time. Great. So the next step is,
I’ll look at the response. And it looks like the response is
in fact a urllib3 type of response. OK, cool, that’s also what I expect. The next step is that I invoke
the build response method, which takes in the request
and the response object. And I get this response back. And that’s cool because I can look at
the attributes of this response, I look at the JSON data, I can look at the status code
and a bunch of other attributes. And so this is finally when you have
the response. It’s finally made. So finally, OK, 30 or so minutes
later we have the response from this request.get. Cool. [laughter] All right, how did we get
to this point to figure out how requests.get is implemented? We looked at a unit test,
we looked at the get method, we looked at the request method,
then we looked at the sessions, and we looked at the request method
inside of the sessions, which has different — it’s doing
four different things here. And it might be a bit easier
to look at a map of all the different files
and the different methods and class names
that are getting invoked, to kind of keep your mind straight
on what’s going on. So, everybody reads code base — the same code base
a little bit differently. So there’s no one right or correct
way to read the code base. One alternative thing
that I could have done is to entirely skip the unit test
and instead place a break point directly in the adapters.py,
and then I can fire up a Python shell, make a request of any kind,
and then hit this break point. And then in the debugger,
I can take a look at the variables and see how the response
gets constructed. So there’s different ways
for walking through a code base. All rights so, we finally figured out
what request.get is doing, and we looked at the unit test,
we looked at docstrings, we looked at documentation,
we set break points. That’s pretty cool. And we’re lucky
that the requests library has a lot of really well-commented —
commented code and docstrings and a lot of documentation about
different parts of the library. But sometimes, you know,
you might get stuck figuring out what’s going on
for a code base that may not have all these different full comprehensive
unit tests and full documentation. So I think the best idea
if you get pretty stuck, in general, is to talk to the core developers,
or talk to the maintainers. And the first point of contact
would be either if they have a mailing list or an IRC channel,
you can ask them questions there. Alternatively,
you can also do a git blame to see which person
made that line change and get to look at the history
of the code base to figure this out. And if you’re stuck on a code and you
want to figure out what it’s doing, you can use your favorite
Python shell and debugger and take a look at a smaller sample
of the code base and go from there. So, overall, this has been
my perspective and my dot process for when I was first reading
the Python requests library, and, you know, things I discovered
like roadblocks and so forth. I am generally pretty curious about
what other people’s perspectives are in looking at
different types of code bases and trying to see
other people’s perspectives and how did they
figure these things out and how did they glue
all these pieces together? So for people who are, say,
avid users of a project or for a maintainer of a project,
I definitely recommend, say, you go and write
blog posts or give talks about how some feature is implemented
in the code base you’re familiar with. And I think that a lot of people
would be pretty curious to learn more about it. All right. And that’s
pretty much all I have. I’m happy to learn more
about different strategies that have worked for you
when you’re reading a new code base. And this is something
I’d love to talk more about. I’m in the hallways.
I’m also on Twitter. Thank you. [applause] (host)
We have time for some questions. Any questions for Susan? (audience member)
Out of curiosity, when do you go through this exercise for another code base? Like, is it when you’re implementing
a feature and you’re stuck, and then you’ll stop
and you’ll take a couple hours to understand the implementation
of something you’re using, or is this just something you do
as a standalone thing for fun, or is it when you’re
contributing to a new project? (Susan Tan)
I think really all of the above. Like, some — I think for me
when I’m looking at other people’s code, I usually already have
a task assigned to me or there’s a project I’m doing
that required me to look at upstream code like the Python
requests library or other code bases. So I think the answer is yeah,
for all of the above. You can do for fun too if you want to. (host)
Other questions? (audience member)
I feel like here with requests you kind of zeroed right into
the core function like get and walked us through
a lot of big ones. But if you’re just
going to go into unit tests, there’s an equally likely chance
you’ll pick something mundane and boring and not really central
to the library, initially. So, any tips on how to get
straight to the good, meaningful ones? (Susan Tan)
Hmm, I think that’s — if you can find a unit test
that maybe has — that covers that particular
feature you’re looking at, then that would be the first thing
I would go to, personally. But I think it’s a lot harder
when there’s a lack of unit tests, or it’s not —
there is incomplete code coverage. And then you’ll have to go directly
and figure out what the code pathway is and place debuggers
and see where it would — what’s triggering
which part of the code. (host)
Other questions? Well, please help me thank Susan
for a wonderful talk. (Susan Tan)
Thank you. [applause]

Only registered users can comment.

  1. Another nice debugger is pudb, which has a graphical interface in the terminal (a la turbo pascal). I find it helpful for exploring new codebases.

Leave a Reply

Your email address will not be published. Required fields are marked *