book/ssh-to-remote-server.Rmd at master · tfelice25/book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
# Using Remote Server {#remote-server}

Sooner-or-later you are in a situation where you have to work on a
**distant networked computer**.  There are many reasons for this,
either your laptop is to weak for certain tasks, or certain
data is not allowed to be taken out from where it is, or you are
expected to use the same computer as your teammates.  Or maybe you want to set up a website and your laptop, obviously, travels around with you instead of staying in one place with reliable internet connection all the time.  The server you use may
be a standalone box located in a rack in your employer's server room,
or it may be a virtual machine in a cloud like Amazon EC2.  You may
also want to set up your own server, or your own virtual machine.


## Server Setup

There are many ways one can set up a distant machine.  It may be
Windows or linux (or any of the other unixes).  It may or may not have
graphical user interface (GUI) installed or otherwise accessible (many
unix programs can display nice windows on your laptop while still
running on the server).  It may or may not have RStudio
available over web browser.  Here we discuss a barebone option
with no access to GUI and no web access to RStudio.  We assume this server is already set up for you and do not discuss installation here.

This is a fairly common setup, for instance when dealing with
sensitive data, in organizations where computer skills and
sysadmin's time is limited, or when you rent your own cheap but limited
server (graphical user interface takes a lot of memory).


## Connecting to the Remote Server

Given the server is already running, your first task is to connect to it.  Here it means
that you will enter commands on your laptop, but those command are
actually run on the server.

The most common way to connect to remote server is via _ssh_.  ssh
stands for "secure shell" and means that all
communication between you and the remote computer is encrypted.  You connect to the server
as
```sh
ssh myserver.somewhere.com
```
_ssh_ is nowadays pretty much the industry standard for such connections, it comes pre-installed on macs and it is included with _gitbash_ too.

When you ssh to the remote server, it asks for your password and opens remote shell
connection.  If this is your first time to connect from this particular laptop, you may also be asked to accept it's fingerprint.  This is an additional security measure to ensure that you are actually talking to the computer you think you are talking to.

The remote machine  will offer you a similar bash shell environment as you are using
on your computer but most likely you see a different prompt, one that
contains the server's name.  You may also see some login
messages.   Now all the commands you are issuing are
running on the remote machine.  So `pwd` shows your working
directory on the server, which in general is not the same as on the
local machine, and `ls` shows the files on the server, not on your
laptop.  Now you can use `mkdir` to create the project folder on the
server.
<p class="alert">
Note: when entering your password, it usually does not
print anything in response, not even asterisks.  It feels as if your
keyboard is not working.  But it is working, and when you finish and press enter, you will be logged in.
</p>

By default, ssh attempts to login with your local username.  If your
username on the server differs from that on your laptop, you want to add it to the ssh command:
```sh
ssh username@myserver.somewhere.com
```

![Local and remote shell window](img/ssh-to-remote-server/ssh-windows.png)

<p class="caption">
The screenshot above shows two command line windows, the upper one connecting remotely on _info201_, and the lower one running locally at a computer called _is-otoometd5060_.  In the upper one, you can see the login command `ssh otoomet@info201.ischool.uw.edu` and various start-up messages.  The `pwd` command shows the current working directory being _/home/otoomet_, and `ls` shows there are for objects there.  Below, we are on the local computer _is-otoometd5060_.  Current working directory has the same name, but on the local computer it contains rather more entries.
</p>

Finally, when done, you want to get out.  The polite way to close the
connection is with
command
```sh
exit
```
that waits until all open connections are safely closed.  But usually you
can as well just close the terminal.


## Copying Files

Before you can run your R scripts, or build a website on the server, you have to get your code and data copied over.  There are several possibilities.

### scp

The most straightforward approach is `scp`, **s**ecure **c**o**p**y.  It comes pre-installed on mac and gitbash and it works in a similar fashion as `cp` for the local files, just `scp` can copy
files between your machine and a remote computer.  Under the hood it uses ssh
connection, just like `ssh` command itself, so the bad guys out there cannot easily see what you are doing.  It syntax is rather
similar to that of `cp`:
```bash
scp user1@host1:file1 user2@host2:file2
```
This copies "file1" from the server "host1" under username "user1" to
the other server.  Passwords are asked for as needed.  The "host" part
of the file must be understood as the full hostname including dots,
such as "hyak.washington.edu".  "file" is the full path to file,
relative to home directory, such as `Desktop/info201/myscript.R`.
When accessing local files, you may omit the "user@host:" part.  So,
for instance, in order to copy your `myscript.R` from folder
`info201` on your laptop's Desktop to the folder `scripts` in
your home folder on the server, you may issue
```bash
scp Desktop/info201/myscript.R myusername@server.ischool.edu:scripts/
```
(here we assume that the working directory of your laptop is the one above
`Desktop`.)
Note that exactly as with `cp`, you may omit the destination file name
if the destination is a directory: it simply copies the file into that
directory while preserving its name.

![copying a local file to the remote machine](img/ssh-to-remote-server/scp-local-remote.png)
<p class="caption">
`scp` in action.  The upper shell window, running locally, depicts _scp_ in action, copying file _startServer.R_ from directory _api_ to the remote server into _api_ directory (while retaining the same name).  The lower window shows the remote machine: first, `ls` command shows we have an _api_ folder in our home directory, and second `ls -l api` shows the content of the _api_ directory in long form.  _startServer.R_ is copied over there.
</p>

After running your script, you may want to copy your results back to
your laptop.  For instance, if you need to get the file
`figure.png` out of the server, you can do
```bash
scp myusername@server.ischool.edu:scripts/figure.png Desktop/info201/
```
As above, this copies a file from the given directory, and drops it
into the `info201` folder on your Desktop.

<p class="alert alert-warning">
Always issue `scp` command locally on your laptop.  This is because your laptop can access the server but usually not the way around.  In order to be connected via _ssh_ (and _scp_), a computer must have public ip-address, and ssh server up and running.  It is unlikely you have configured your laptop in this way.
</p>


### rsync

`rsync` is a more advanced approach to `scp`.  It works in many ways
like `scp`, just it is smart enough to understand which files
are updated, and copy the updated parts of the files only.  It is the
recommended way for working with small updates in large files.
Its syntax is rather similar to that of `scp`.  To copy `file` to the
remote server as `file2` (in the home directory), we do
```
rsync file user2@host2:file2
```
and in order to copy a `file1` from server as local `file` (in the
current working directory):
```
rsync file user1@host1:file1 file
```
I also recommend to
explore some of its many options, for instance `-v` (verbose) reports
what it's doing.
The example above with your code and figure might now look like that:
```bash
rsync -v Desktop/info201/myscript.R myusername@server.ischool.edu:scripts/
# now run the script on the remote machine
rsync -v myusername@server.ischool.edu:scripts/figure.pdf Desktop/info201/
```

Maybe the easiest way to copy your files is to copy (or rather update) the whole
directories.  For instance, instead of the code above, you can do
```bash
# copy all files to server:
rsync -v Desktop/info201/* myusername@server.ischool.edu:scripts/
# now run the script on the remote machine
# ... and copy the results back:
rsync -v myusername@server.ischool.edu:scripts/* Desktop/info201/
```
Here `*` means _all files in this directory_.  Hence, instead of
copying the files individually between the computers, we just copy
all
of them.  Even better, we actually do not copy but just update.  Huge
files that do not change do not take any bandwidth.


### Graphical Frontends

Instead on relying on command line tools, one can also use graphical
front-ends.  For instance, "WinSCP" is a nice Norton Commander-Style
frontend for copying files between the local and a remote machine over scp
for Windows.  It provides a split window representing files on the
local and the remote end, and one can move, copy-and-paste and interact
with the mouse on these panes.  On Mac you may take a look at
"Cyberduck".


### Remote Editing

Besides copying your files, many text editors also offer a "remote
editing" option.  From the user perspective this looks as if directly
working on the remote server's hard disk.  Under the hood, the files
are copied back and forth with scp, rsync or one of their friends.
Emacs and vi do it out-of-the box, VSCode, Atom and sublime require a
plugin.  AFAIK it is not possible with RStudio.

It is also possible to mount (attach) the harddisk of the remote
server to your laptop as if it were a local disk.  Look yourself for
more information if you are interested.


## R and Rscript

When your code has been transferred to the server, your next task is to
run it.  But before you can do it, you may want to install the
packages you need.  For instance, you may want to install the _ggplot2_ and _dplyr_.  This must be done from R console using
`install.packages()`.  You start R interactively by the command
```sh
R
```
It opens an R session, not unlike what you see inside of RStudio, just
here you have no RStudio to handrail you through the session.  Now all loading,
saving, inspecting files, etc must be done through R commands.

The first time you do it, R complains about
non-writeable system-wide library and proposes to install and create
your personal libary.  You should answer "yes" to these prompts.  As
Linux systems typically compile the packages during installations, installation is slow and you see many messages (including warnings) in the
process.  But it works, given that the necessary system libraries are available.  You may alo open another terminal and ssh to the server from there while the packages are compiling in the other window.

Now you can finally run your R code.  I strongly recommend to do it
from the directory where you intend to run the project before starting
R (`cd scripts` if you follow the example directory setup above).  There are two options: either start R
interactively, or run it as a script.
If you do it from an interactive R session, you have to _source_ your script:
```R
source("myscript.R")
```
The script will run, and the first attempt most likely ends with an error message.  You have
to correct the error either on your laptop and copy the file over to
the server again, or directly on the server, and
re-run it again.  Note that you don't have to exit from the R session when
copying the files between your laptop and the server.  Edit it, copy it over
from your laptop (using `scp` or
other tools), and just re-source the file from
within the R session.  If you need an open R session on the server, you may want to have several terminals connected to the server at the same time: in one, you have the R session, in another you may want to copy/move/edit files, and it may also be handy to have a window with `htop` too see how your running code is doing (see below).

![multiple terminal connections](img/ssh-to-remote-server/multiple-connections.png)
<p class="caption">
Three terminals connecting to a remote server at the same time.  The top one has been used for file management, the middle one shows tha active processes by user _otoomet_, and the bottom one has open R session for package installations.  Multiple open connections is often a convenient way to switch frequently between different tasks.
</p>

Opening a separate R session may be useful for installing packages.
For running your scripts, I recommend you to run it entirely from
command line, either as
```bash
R CMD BATCH myscript.R
```
or
```bash
Rscript myscript.R
```
The first version produces a little more informative error messages,
the other one handles the environment in a little more consistent and
efficient manner.


### Graphics Output with No GUI

If the server does not have any graphics capabilities, you have to
save your figures as files.  For instance, to save the image in a pdf
file, you may use the following code in your R program:
```R
pdf(file="figure1.pdf", width=12, height=8)
    # width and height in inches
	# check also out jpeg() and png() devices.
# do your plotting here
plot(1:10, rnorm(10))
# done plotting
dev.off()
    # saves the image to disk and closes the file.
```
Afterwards you will have to copy the image file _figure1.pdf_ to your laptop for future use.  Note that the file will be saved in the current working directory (unless you specify another folder) for the R session.  This is normally the folder where you execute the `Rscript` command.

Besides of pdf graphics, R can also output jpg, png, svg and other formats.  Check out the corresponding devices `jpeg`, `png`, `svg` and so forth.  Additionally, _ggplot_ has it's own dedicated way of saving plots using `ggsave` although the base R graphics devices, such as `pdf` will work too.


## Life on Server

The servers operate the same in many ways as the command line
on your own computer.  However, there are a number of differences.

### Be Social!

While you laptop is yours, and you are free to exploit all its
resources for your own good, this is not true for the server.  The server is a
multiuser system, potentially doing good work for many people at the
same time.  So
the first rule is: **Don't take more resources than what you need!**

This that means don't let the system run, grab memory, or occupy disk space
just for fun.  Try to keep your R workspace clean (check out `rm()`
function) and
close R as soon as it has finished (this happens automatically if you
run your script through `Rscript` from command line).  Don't copy the dataset without a
good reason, and keep your copies in a compressed form.  R can open
gzip and bzip2 files on the fly, so usually you don't even need to
decompress these.  Avoid costly recalculations of something you
already calculated.  All this is even more important the last days before the deadline
when many people are running using the server.

Servers are typically well configured to tame misbehaving programs.
You may sometimes see your script stopping with a message "killed".
This most likely means that it occupied too much memory, and the system
just killed it.  Deal with this.


### Useful Things to Do

There are several useful commands you can experiment with while on the
server.
```bash
htop
```
(press `q` to quit) tells you which programs run on the server, how much memory and cpu do
these take, and who are their owners (the corresponding users).  It
also permits you to kill your misbehaving processes (press `k` and
select `SIGKILL`).  Read more with `man htop`.

```bash
w
```
(**w**ho) prints the current logged-in users of the server.

```bash
df -h
```
(**d**isplay **f**ree in **h**uman-readable units) shows the free and
occupied disk space.  You are mainly influenced by what is going on in the file system
`/home`.

### Permissions and ownership

Unix systems are very strict about ownership and permissions.  You are
a normal user with limited privileges.  In particular, you cannot
modify or delete files that you don't own.  In a similar fashion, you
cannot kill processes you did not start.  Feel free to attempt.  It
won't work.

In case you need to do something with elevated privileges (as
"superuser"), you have to contact the system administrator.  In practice,
their responsiveness and willingness to accommodate your requests will
vary.

### More than One Connection

It perfectly possible to log onto the server through multiple terminals at the
same time.  You just open several terminals and log onto the
server from each of these.  You can use one terminal to observe how your script is
doing (with `htop`), the other one to run the script, and the third one to inspect
output.  If you find such approach useful, I recommend you to
familiarize yourself with gnu screen (command `screen` that includes
many related goodies.)


## Advanced Usage

### ssh keys, .ssh/config

Without further configuration, every time you open a ssh connection,
you have to type your password.  Instead of re-entering it
over and over again&mdash;this may not be particularly secure and it is definitely not convenient&mdash;you can configure
your ssh keys and copy it to the server.  Next time, you will be
automatically authenticated with the key and you don't have to type
the password any more.  Note: this is the same ssh key that is used by GitHub if
you use ssh connection to GitHub.

As the first step, you have to create the key
with `ssh-keygen` (you may choose an empty passphrase) unless you
already have created one.  Thereafter copy
your public key to the server with `ssh-copy-id`.  Next time you log
onto the server, no password is needed.  A good source for help with creating
and managing ssh keys is
[GitHub help](https://help.github.com/articles/connecting-to-github-with-ssh/).

You can also configure your ssh to recognize abbreviated
server names and your corresponding user names.  This allows you to
connect to server with a simple command like `ssh info201`.  This
information is stored in the file
`~/.ssh/config`, and should contain lines like
```
Host info201
	User <your username>
	Hostname info201.ischool.uw.edu
```
The `Host` keyword is followed by the abbreviated name of the server,
the following lines contain your username and the publicly visible
hostname for the server.  Seek out more information if you are interested.


### More about command line: pipes and shell patterns

_bash_ is a powerful programming language.  It is not particularly well suited to peform calculations or produce graphs, but it is excellent in glueing together other programs and their output.

One very powerful construct are _pipes_.  These are in many ways similar to _magrittr_ pipes in R, or perhaps we should say the other way around as shell pipes were introduced in 1973, a quarter of century before R was created.  Pipes connect output of one command into input of another command.  For instance, let's take commands `ls -s` and `head`.  The former lists the files (in long form) and the latter prints out a few first lines of a text file.  But `head` is not just for printing files, it can print the first few lines of whatever you feed it.  Look, for instance, the following command (actually a compound command):
```sh
ls -l | head
```
`ls -l` creates the file listing (in long form).  But instead of printing it on screen, it will now send it over pipe `|` to the `head` utility.  That one will extract the first lines and print only those.

![example with `ls` and shell pipes](img/ssh-to-remote-server/shell-pipe.png)
<p class="caption">
Example of `ls -l` command that prints a number of files (above).  Below, the same command is piped through `head -3` that retains only the three first lines (and prints those).  Note that the first line is not a file, but a total size of files in this directory (in kilobytes).
</p>

Pipes are not limited to two commands only.  You can pipe as many commands together as you with.  For instance, you may want to see a few first lines in a large compressed csv file that contain the word _Zilong_.  We use the following commands:

* **bzcat** prints bzip-compressed data (you normally invoke it like `bzcat file.txt`).  But it just prints and does not do anything else with the output.
* **grep** searches for a pattern in text.  This can be used as `grep pattern file`, for instance grep `salary business-report.txt`.  Note that _pattern_ is a regular expression (rather similar as in R `gsub` and `grep` functions), so `grep` can search for a wide range of patterns.  However, it cannot open compressed files, and neither can it limit the output to just a few lines.
* **head** prints first few lines of text.  You can print out the first _n_ lines of a file as `head -n file.txt`, but again--this does not work with compressed files.

We pipe the commands together as
```sh
bzcat data.csv.bz2 | grep Zilong | head
```
and achieve the result we want.  So pipes are an excellent way to join  small commands, each of which is good at only a single task, into complex compound tasks.

Another handy (albeit much less powerful) tool in shell is the shell patterns.  These are a little bit like regular expressions for file names, just much simpler.  There are two special characters in file names:

* **\*** means any number of any characters.  For instance, `a.*` means all files like `a.`, `a.c`, `a.txt`, `a.txt.old`, `a...` and so on.  It is just any number of any characters, including none at all, and "any" also means dots.  However, the pattern does not cover `ba.c`.
* **?** means a single character, so `a.?` can stand for `a.c` and `a.R` but not for `a.txt`.

Shell patterns are useful for file manipulations where you have quicly sort though some sort of fine name patterns.  These are handled by shell and not by individual commands, so they may not work if you are not at shell prompt but running another program, such as R or a text editor.

For instance, let's list all _jpg_ files in the current directory:
```sh
ls *.jpg
```
This lists all files of patten `*.jpg`, i.e. everything that has _.jpg_ at it's end.
Now let us copy all _png_ files from server to the current directory:
```sh
scp user@server.com:*.png .
```
This copies all files in the form `*.png` from the server here, i.e. all files that end with _.png_.


### Running RScript in ssh Session

Passwordless ssh connection gives you new wonderful possibilities.
First, you don't even have to log into the server explicitly.  You can
run a one-command ssh session on the server directly from your
laptop.  Namely, ssh accepts commands to be run on the remote
machine.  If invoked as
```bash
ssh myusername@server.ischool.edu "Rscript myscript.R"
```
It does not open a remote shell but runs `Rscript script.R` instead.
Your command sequence for the whole process will accordingly look something like:
```bash
rsync -v Desktop/info201/* myusername@server.ischool.edu:scripts/
ssh myusername@server.ischool.edu "Rscript scripts/myscript.R"
rsync -v myusername@server.ischool.edu:scripts/* Desktop/info201/
```
All these command are issued on your laptop.  You can also save these
to a text file and run all three together as a single **shell
script**!

Further, you don't even need the shell.  Instead, you may explain R on
your laptop how to start R on the
remote server over ssh.  In this way you can turn your laptop and
server combination
into a high-performance-computing cluster!  This allows you
to copy the script and run it on the server directly from within your
R program that runs
on your laptop.  Cluster computing is out of scope of this chapter, but if you
are interested, look up the **makePSOCKcluster()** function in **parallel**
package.