wget

Sat May 27 15:43:47 UTC 2006

| From: Paul King <pking123-rieW9WUcm8FFJ04o6PK0Fg at public.gmane.org>

| On Fri, 2006-05-26 at 16:15 -0400, Daniel Armstrong wrote:

| To do systematic downloads you have two choices: either download
| specific files, or mirror the entire site. I don't know of an
| "in-between" solution, and I have done both.

wget is a horribly complicated and poorly described program.

To figure out how to use wget, it helps to have some conceptual
framework.

- wget does http and ftp

- the ftp protocol has a "dir" command so it is possible to find all
  the files in an ftpable tree (unless games are being played).

- the http protocol does not have a dir command.  The only way to
  attempt to find everything is to walk every link.

- the http protocol does not just deal with trees: some URLs are
  actually queries.  There can even be an infinite number of
  apparently distinct queries.  So walking every link is potentially
  dangerous and useless.

- Does "every link" include URLs not extensions of the initial URL?
  Surely it does, but how far do you go?  The whole web?  Only this
  site?  There is no single correct answer.  I don't actually know
  what wget does.

- put another way,

  + FTP sites conventionally contain trees of files
    with a standard way of finding files that are intended to be found.
    They may, in fact, be DAGs (i.e. with links but not cycles) without
    causing big problems.

  + HTTP sites have no similar expectations.  So traversing them
    is not guaranteed to work.

Some other factoids:

- traditionally FTP file names are suitable for any file system
  (including tops10), so spaces, question marks, accented characters,
  etc. are not found there.

- http "file names" (URLs) can contain a lot of wierd stuff.  Shell
  scripts often break when they hit these oddities.

- wget with a simple wildcard only works with FTP.  Note that the
  wildcard is to be expanded on the other system.
  + ok: wget ftp://ocw.mit.edu/ans7870/7/7.012/f04/video/*.rm
  + better: wget 'ftp://ocw.mit.edu/ans7870/7/7.012/f04/video/*.rm'
  + no: wget http://ocw.mit.edu/ans7870/7/7.012/f04/video/*.rm

- wget treats 'http://x' differently from 'http://x/'!
  In particular, the --no-parent flag would download
  http://y if you used the first form and not if you used the second.

- some sites have robots.txt to stop robots accessing their stuff.
  wget only ignores robots.txt if you use -erobots=off

- I find --mirror useful.  I don't fully understand it, but I
  definitely use --no-parent with it.  Otherwise it has a tendency to
  go wild.

- I always use -N so that timestamps are set usefully.  (This is
  redundant when --mirror is used.)

- my firewall only allows passive_ftp (and only outbound).  So I have
  a .wgetrc file with the line:
	passive_ftp = on

- If you want to be selective when mirroring, --accept or --reject
  might do the trick (I've not tried them).  If you use them,
  subsequent mirrorings probably must do more work (i.e. refetching
  files that were not saved just so that all the links get walked).

| I have downloaded specific files over wget in a systematic way in order
| to get, say things like MP3s of radio simulcasts of programs I like to
| hear/play frequently. For those, I write a perl script, and keep the
| site and path in one string, and for the specific filename, I only vary
| the changeable parts of the string by listing them in an array.

That way you avoid the automated discovery problem that I've talked
about.

| On Fri, 2006-05-26 at 16:15 -0400, Daniel Armstrong wrote:

| > If I use wget to download a single video file from this location:
| > 
| > wget \ http://ocw.mit.edu/ans7870/7/7.012/f04/video/ocw-7.012-lec-mit-10250-22oct2004-1000-220k.rm
| > 
| > ...it works as expected.

Not too surprising, but good to know.

| > 
| > But I would like to know how to use wget to download *all* the video
| > files of a certain compression size with a single command. I checked
| > the manpage and used the "-A" option to specify a filetype, using this
| > command:
| > 
| > wget -A "*220k.rm" http://ocw.mit.edu/ans7870/7/7.012/f04/video/
| > 
| > ...which returns the following error...
| > 
| > --16:10:53--  http://ocw.mit.edu/ans7870/7/7.012/f04/video/
| >            => `index.html'
| > Resolving ocw.mit.edu... 209.123.81.89, 209.123.81.96
| > Connecting to ocw.mit.edu|209.123.81.89|:80... connected.
| > HTTP request sent, awaiting response... 404 Not Found
| > 16:10:53 ERROR 404: Not Found.
| > 
| > How do I manage to setup wget to ignore the fact that there is no
| > index.html at this location, and just download the *.rm files I
| > requested? wget would be a perfect tool for downloading a series of
| > files like this unattended vs. downloading each file by hand
| > one-by-one... Thanks in advance for any help.

Go back to the framework I mentioned.  If there is no
	http://ocw.mit.edu/ans7870/7/7.012/f04/video/index
then how is wget to find a starting point from which it can walk every
link?

Can you find a page that lists all the files?  If you point wget at
that AND suitably constrain wget (that's tricky!), you might be able
to get what you want.
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml