-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathserver.tex
458 lines (372 loc) · 16.6 KB
/
server.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
\chapter{Creating a Server}\label{s:server}
Now that we have a data manager (\chapref{s:dataman})
the next step is to create a server to share our data with the world,
which we will build using a library called \hreffoot{https://expressjs.com/}{Express}.\index{Express}
Before we start writing code,
though,
we need to understand how computers talk to each other.
\section{HTTP}\label{s:server-http}
Almost everything on the web communicates via the HyperText Transfer Protocol (\gref{g:http}{HTTP}).
The core of HTTP is a \grefdex{g:http-request}{request}{HTTP!request}/\grefdex{g:http-response}{response}{HTTP!response} cycle
that specifies the kinds of requests applications can make of servers,
how they exchange data,
and so on.
\figref{f:server-cycle} shows this cycle in action for a page that includes one image.
\figpdf{figures/server-cycle.pdf}{HTTP Request/Response Cycle}{f:server-cycle}
\begin{enumerate}
\item
The client (a browser or some other program) makes a connection to a server.
\item
It then sends a blob of text specifying what it's asking for.
\item
The server replies with a blob of text and the HTML.
\item
The connection is closed.
\item
The client parses the text and realizes it needs an image.
\item
It sends another blob of text to the server asking for that image.
\item
The server replies with a blob of text and the contents of the image file.
\item
The connection is closed.
\end{enumerate}
This cycle might be repeated many times to display a single web page,
since a separate request has to be made for every image,
every CSS or JavaScript file,
and so on.
In practice,
a lot of behind-the-scenes engineering is done to keep connections alive as long as they're needed,
and to \gref{g:cache}{cache} items that are likely to be re-used.
An HTTP request is just a block of text with two important parts:
\begin{itemize}
\item
The \gref{g:http-method}{method} is almost always either \texttt{GET} (to get data) or \texttt{POST} (to submit data).
\item
The \gref{g:url}{URL} is typically a path to a file,
but as we'll see below,
it's completely up to the server to interpret it.
\end{itemize}
The request can also contain \grefdex{g:http-header}{headers}{HTTP!header},\index{header (HTTP)}
which are key-value pairs with more information about what the client wants.
Some examples include:
\begin{itemize}
\item
\texttt{"Accept:\ text/html"} to specify that the client wants HTML
\item
\texttt{"Accept-Language:\ fr,\ en"} to specify that the client prefers French, but will accept English
\item
\texttt{"If-Modified-Since:\ 16-May-2018"} to tell the server that the client is only interested in recent data
\end{itemize}
\noindent
Unlike a dictionary, a key may appear any number of times,
which allows a request to do things like specify that it's willing to accept several types of content.
The \grefdex{g:body-http}{body}{body!of HTTP request} of the request is any extra data associated with it,
such as files that are being uploaded.
If a body is present,
the request must contain the \texttt{Content-Length} header
so that the server knows how much data to read
(\figref{f:server-request}).
\figpdf{figures/server-request.pdf}{Structure of an HTTP Request}{f:server-request}
The headers and body in an HTTP response have the same form, and mean the same thing.
Crucially,
the response also includes a \grefdex{g:http-status-code}{status code}{HTTP!status code}\index{status code (HTTP)}
to indicate what happened:
200 for OK, 404 for ``page not found'', and so on.
Some of the more common are shown in \tblref{t:server-codes}.
\begin{table}
\begin{tabular}{llp{0.5\textwidth}}
\textbf{Code} & \textbf{Name} & \textbf{Meaning} \\
100 & Continue & The client should continue sending data \\
200 & OK & The request has succeeded \\
204 & No Content & The server completed the request but there is no data \\
301 & Moved Permanently & The resource has moved to a new permanent location \\
307 & Temporary Redirect & The resource is temporarily at a different location \\
400 & Bad Request & The request is badly formatted \\
401 & Unauthorized & The request requires authentication \\
404 & Not Found & The requested resource could not be found \\
408 & Timeout & The server gave up waiting for the client \\
418 & I'm a Teapot & An April Fool's joke \\
500 & Internal Server Error & A server error occurred while handling the request \\
601 & Connection Timed Out & The server did not respond before the connection timed out
\end{tabular}
\caplbl{HTTP Status Codes}{t:server-codes}
\end{table}
One final thing we need to understand is the structure and interpretation of URLs.
This one:
\begin{minted}{text}
http://example.org:1234/some/path?value=deferred&limit=200
\end{minted}
\noindent
has five parts:
\begin{itemize}
\item
The protocol \texttt{http}, which specifies what rules are going to be used to exchange data.
\item
The \gref{g:hostname}{hostname} \texttt{example.org}, which tells the client where to find the server.
If we are running a server on our own computer for testing,
we can use the name \texttt{localhost} to connect to it.
(Computers rely on a service called \gref{g:dns}{DNS}
to find the machines associated with human-readable hostnames,
but its operation is out of scope for this tutorial.)
\item
The \gref{g:port}{port} \texttt{1234}, which tells the client where to call the service it wants.
(If a host is like an office building, a port is like a phone number in that building.
The fact that we think of phone numbers as having physical locations
says something about our age{\ldots})
\item
The path \texttt{/some/path} tells the server what the client wants.
\item
The \grefdex{g:query-parameter}{query parameters}{query parameter} \texttt{value=deferred} and \texttt{limit=200}.
These come after a question mark and are separated by ampersands,
and are used to provide extra information.
\end{itemize}
It used to be common for paths to identify actual files on the server,
but the server can interpret the path however it wants.
In particular,
when we are writing a data service,
the segments of the path can identify what data we are asking for.
Alternatively,
it's common to think of the path as identifying a function on the server that we want to call,
and to think of the query parameters as the arguments to that function.
We'll return to these ideas after we've seen how a simple server works.
\section{Hello, Express}\label{s:server-express}
A Node-based library called Express handles most of the details of HTTP for us.
When we build a server using Express,
we provide callback functions that take three parameters:
\begin{itemize}
\item
the original request,
\item
the response we're building up, and
\item
what to do next (which we'll ignore for now).
\end{itemize}
We also provide a pattern with each function that specifies what URLs it is to match.
Here is a simple example:
\begin{minted}{js}
const express = require('express')
const PORT = 3418
// Main server object.
const app = express()
// Return a static page.
app.get('/', (req, res, next) => {
res.status(200).send('<html><body><h1>Asteroids</h1></body></html>')
})
app.listen(PORT, () => { console.log('listening...') })
\end{minted}
The first line of code loads the Express library.
The next defines the port we will listen on,
and then the third creates the object that will do most of the work.
Further down,
the call to \texttt{app.get} tells that object to handle any \texttt{GET} request for `/'
by sending a reply whose status is 200 (OK)
and whose body is an HTML page containing only an \texttt{h1} heading.
There is no actual HTML file on disk,
and in fact no way for the browser to know if there was one or not:
the server can send whatever it wants in response to whatever requests it wants to handle.
Note that \texttt{app.get} doesn't actually get anything right away.
Instead,
it registers a callback with Express that says,
``When you see this URL, call this function to handle it.''
As we'll see below,
we can register as many path/callback pairs as we want to handle different things.
Finally,
the last line of this script tells our application to listen on the specified port,
while the callback tells it to print a message as it starts running.
When we run this, we see:
\begin{minted}{shell}
$ node static-page.js
\end{minted}
\begin{minted}{text}
listening...
\end{minted}
Our little server is now waiting for something to ask it for something.
If we go to our browser and request \texttt{http://localhost:3418/},
we get a page with a large title \texttt{Asteroids} on it.
Our server has worked,
and we can now stop it by typing Ctrl-C in the shell.
\section{Handling Multiple Paths}\label{s:server-paths}
Let's extend our server to do different things when given different paths,
and to handle the case where the request path is not known:
\begin{minted}{js}
const express = require('express')
const PORT = 3418
// Main server object.
const app = express()
// Root page.
app.get('/', (req, res, next) => {
res.status(200).send('<html><body><h1>Home</h1></body></html>')
})
// Alternative page.
app.get('/asteroids', (req, res, next) => {
res.status(200).send('<html><body><h1>Asteroids</h1></body></html>')
})
// Nothing else worked.
app.use((req, res, next) => {
res
.status(404)
.send(`<html><body><p>ERROR: ${req.url} not found</p></body></html>`)
})
app.listen(PORT, () => { console.log('listening...') })
\end{minted}
The first few lines are the same as before.
We then specify handlers for the paths \texttt{/} and \texttt{/asteroids},
each of which sends a different chunk of HTML.
The call to \texttt{app.use} specifies a default handler:
if none of the \texttt{app.get} handlers above it took care of the request,
this callback function will send a ``page not found'' code
\emph{and} a page containing an error message.
Some sites skip the first part and only return error messages in pages for people to read,
but this is sinful:
making the code explicit makes it a lot easier to write programs to scrape data.
As before, we can run our server from the command line
and then go to various URLs to test it.
\texttt{http://localhost:3418/} produces a page with the title ``Home'',
\texttt{http://localhost:3418/asteroids} produces one with the title ``Asteroids'',
and \texttt{http://localhost:3418/test} produces an error page.
\section{Serving Files from Disk}\label{s:server-files}
It's common to generate HTML in memory when building data services,
but it's also common for the server to return files.
To do this,
we will provide our server with the path to the directory it's allowed to read pages from,
and then run it with \texttt{node\ server-name.js\ path/to/directory}.
We have to tell the server whence it's allowed to read files
because we definitely do \emph{not} want it to be able to send everything on our computer to whoever asks for it.
(For example,
a request for the \texttt{/etc/passwd} password file on a Linux server should probably be refused.)
Here's our updated server:
\begin{minted}{js}
const express = require('express')
const path = require('path')
const fs = require('fs')
const PORT = 3418
const root = process.argv[2]
// Main server object.
const app = express()
// Handle all requests.
app.use((req, res, next) => {
const actual = path.join(root, req.url)
const data = fs.readFileSync(actual, 'utf-8')
res.status(200).send(data)
})
app.listen(PORT, () => { console.log('listening...') })
\end{minted}
The steps in handling a request are:
\begin{enumerate}
\item
The URL requested by the client is given to us in \texttt{req.url}.
\item
We use \texttt{path.join} to combine that with the path to the root directory,
which we got from a command-line argument when the server was run.
\item
We try to read that file using \texttt{readFileSync},
which blocks the server until the file is read.
We will see later how to do this I/O asynchronously
so that our server is more responsive.
\item
Once the file has been read, we return it with a status code of 200.
\end{enumerate}
If a sub-directory called \texttt{web-dir} holds a file called \texttt{title.html},
and we run the server as:
\begin{minted}{shell}
$ node serve-pages.js ./web-dir
\end{minted}
\noindent
we can then ask for \texttt{http://localhost:3418/title.html}
and get the content of \texttt{web-dir/title.html}.
Notice that the directory \texttt{./web-dir} doesn't appear in the URL:
our server interprets all paths as if the directory we've given it
is the root of the filesystem.
\figpdf{figures/server-mapping.pdf}{Mapping URLs to Files}{f:server-mapping}
If we ask for a nonexistent page like \texttt{http://localhost:3418/missing.html}
we get this:
\begin{minted}{text}
Error: ENOENT: no such file or directory, open 'web-dir/missing.html'
at Object.openSync (fs.js:434:3)
at Object.readFileSync (fs.js:339:35)
... etc. ...
\end{minted}
We will see in the exercises how to add proper error handling to our server.
\begin{aside}{Favorites and Icons}
If we use a browser to request a page such as \texttt{title.html},
the browser may actually make two requests:
one for the page,
and one for a file called \texttt{favicon.ico}.
Browsers do this automatically,
then display that file in tabs, bookmark lists, and so on.
Despite its \texttt{.ico} suffix,
the file is (usually) a small PNG-formatted image,
and must be placed in the root directory of the website.
\end{aside}
\section{Content Types}\label{s:server-content-types}
So far we have only served HTML,
but the server can send any type of data,
including images and other binary files.
For example,
let's serve some JSON data:
\begin{minted}{js}
// ...as before...
app.use((req, res, next) => {
const actual = path.join(root, req.url)
if (actual.endsWith('.json')) {
const data = fs.readFileSync(actual, 'utf-8')
const json = JSON.parse(data)
res.setHeader('Content-Type', 'application/json')
res.status(200).send(json)
}
else {
const data = fs.readFileSync(actual, 'utf-8')
res.status(200).send(data)
}
})
\end{minted}
What's different here is that when the requested path ends with \texttt{.json}
we explicitly set the \texttt{Content-Type} header to \texttt{application/json}
to tell the client how to interpret the bytes we're sending back.
If we run this server with \texttt{web-dir} as the directory to serve from
and ask for \texttt{http://localhost:3418/data.json},
a modern browser will provide a folding display of the data
rather than displaying the raw text.
\begin{aside}{So What's an API?}
A library's \gref{g:api}{Application Programming Interface} (API) is simply
the set of functions other programs are allowed to call.
This is usually a subset of all of the functions defined in the library,
since many may be helpers intended for internal use only.
Similarly,
a server's API is the set of requests it knows how to respond to.
For example,
NASA's near-approach asteroid API (\secref{s:interactive-fetching})
can handle requests that include an authentication key and a starting date,
while the server we have built in this chapter
can respond to requests for HTML and JSON files.
We will look a little more closely at API design in \secref{s:capstone-server}.
\end{aside}
\section{Exercises}\label{s:server-exercises}
\exercise{Report Missing Files}
Modify the version of the server that returns files from disk
to report a 404 error if a file cannot be found.
What should it return if the file exists but cannot be read
(e.g., if the server does not have permissions)?
\exercise{Serving Images}
Modify the version of the server that returns files from disk
so that if the file it is asked for has a name ending in \texttt{.png} or \texttt{.jpg},
it is returned with the right \texttt{Content-Type} header.
\exercise{Delayed Replies}
Our file server uses \texttt{fs.readFileSync} to read files,
which means that it stops each time a file is requested
rather than handling other queries while waiting for the file to be read.
Modify the callback given to \texttt{app.use} so that it uses \texttt{fs.readFile} with a callback instead.
\exercise{Using Query Parameters}
URLs can contain query parameters in the form:
\begin{minted}{text}
http://site.edu?first=123&second=beta
\end{minted}
\noindent
Read the online documentation for \hreffoot{https://expressjs.com/}{Express} to find out
how to access them in a server,
and then write a server to do simple arithmetic:
the URL \texttt{http://localhost:3654/add?left=1\&right=2} should return \texttt{3},
while the URL \texttt{http://localhost:3654/subtract?left=1\&right=2} should return \texttt{-1}.
\section*{Key Points}
\input{keypoints/server}