Page 1 of 1

Anyone know how to dowload an entire website?

Posted: 2007-04-04 05:17pm
by Adrian Laguna
I recently was pointed in the direction of The official Crimson Skies website, which I'm interested in reading because the game is that awesome. However, as said in the front page, it will be closed soon. This sucks because I want to be able to read it at my leisure.

So, anyone know how I can transfer the whole thing to my hard-drive?

Posted: 2007-04-04 05:25pm
by General Zod
Flashget might do it.

Posted: 2007-04-04 05:29pm
by phongn
wget to the rescue!

Posted: 2007-04-04 06:54pm
by rhoenix
phongn wrote:wget to the rescue!
I concur - finding an open-source version of wget (hopefully with a graphical interface - nice, but certainly not required) should work for you just fine.
That snippet will recursively vaccuum an entire website, three link-levels deep. The max is 5. (corrections welcome)

Posted: 2007-04-04 07:11pm
by Netko
Yeah, at the end of the day, wget is the best option. I tried using various windows-based utilities in the hope that I would find something that works, however, in the end, I got the site you were talking about with wget, which was the only one that got it with a minimum of fuss (the only issue was it first downloading the site into the void because I had it in Program Files and didn't specify an output directory which caused windows to ignore wget's output).

Get wget for Windows on this link, and then unzip it to a directory.

Then run cmd.exe, navigate to the directory (folder) where you unzipped it (type "cd C:\nameofdirectory\nameofsubdirectory", without quotes and with proper names obviously) then execute wget with something like this (you can get help by typing "wget -h"):

wget -r -np -p --directory-prefix=C:\dirwhereyouwanttodownload www.crimsonskiesuniverse.com/story/

(this will get you the story pages - unfortunately, the links seem to be screwed up because they were made as static links in a weird way and apparently the -k option is not smart enough to parse them correctly to compensate)

Posted: 2007-04-04 08:54pm
by aerius
I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.

Posted: 2007-04-04 09:07pm
by Solauren
Ultrasucker is also very good, but can be annoying to configure "just right"

Webreaper is also pretty good

Posted: 2007-04-05 04:16am
by Bounty
I've just used the Linux version of wget to mirror the CS website and it worked like a charm.

Posted: 2007-04-05 04:21am
by Shroom Man 777
aerius wrote:I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.
You mean to say I can download any pornsite, without paying?!

Posted: 2007-04-05 04:41am
by Einhander Sn0m4n
Shroom Man 777 wrote:
aerius wrote:I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.
You mean to say I can download any pornsite, without paying?!
How about'sa test? NSF FUCKING W for the slower people in the audience!

And I believe the act being discussed in this thread is called a 'siterip'. Ignore the pornsite links at the top :P

Posted: 2007-04-05 04:44am
by Bounty
Shroom Man 777 wrote:
aerius wrote:I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.
You mean to say I can download any pornsite, without paying?!
If you have the password, but any competent admin will make sure leaked passwords don't work for long.

And damnit man, there are better ways of getting free porn than ripping websites.

Posted: 2007-04-05 04:48am
by Ace Pace
Bittorrent? :wink:

Posted: 2007-04-09 04:00pm
by Adrian Laguna
Well I got around to trying a couple of these programs out. HTTrack first, since it seemed the most simple couldn't figure it out. Then I used Wget with Netko's instruction and it seems that it almost worked. I seem to have downloaded a lot of files, but not a coherent webpage.

This whole getting an internet webpage seems more complicated than I thought it would be.

Posted: 2007-04-09 04:06pm
by Bounty
Adrian Laguna wrote:Well I got around to trying a couple of these programs out. HTTrack first, since it seemed the most simple couldn't figure it out. Then I used Wget with Netko's instruction and it seems that it almost worked. I seem to have downloaded a lot of files, but not a coherent webpage.
No default.htm? Wget worked fine for me.

Posted: 2007-04-09 04:22pm
by Adrian Laguna
There is, actually, a "default.htm" thought it's complete name is "default.htm@MSID[bunch of numbers]". Windows can't open it.

Posted: 2007-04-09 04:24pm
by Bounty
Adrian Laguna wrote:There is, actually, a "default.htm" thought it's complete name is "default.htm@MSID[bunch of numbers]". Windows can't open it.
Remove everything after the .htm, or force Widows to open it with Firefox. If you remove the letters after the extension, the link back to the main page will break, but that's a minor issue.

Posted: 2007-04-09 04:35pm
by Adrian Laguna
Okay, I tried again with a slightly different approach. Everything works but the pop-up pages. Basically, all the subsections in the "universe" page has a subject (corporations, aircraft, pilots, etc) that when you click on it shows a pop-up page that tells you about said subject. But all of the pop-ups are blank.

I put the following in the command line:
wget -r -np -p -k --directory-prefix=C:\ http://crimsonskiesuniverse.com/

Posted: 2007-04-09 04:39pm
by Bounty
Crapsicles, wget didn't get those. You can get to them by adding the corp name to the /universe/corporations URL (or pilots, or tech or whatever). Start downloading, man!

So, can wget be made to follow Javascript links?

Posted: 2007-04-10 06:57pm
by Rogue 9
I have the Universe and Story pages from the Crimson Skies site saved already if that's all you want.

Posted: 2007-04-10 08:11pm
by Adrian Laguna
Do they work completely with all the links and stuff? Because that would be really awesome.

Posted: 2007-04-10 09:55pm
by Pu-239
You could write a script to grep all links that are in javascript and run a 2nd pass using wget...

Posted: 2007-04-11 06:14am
by Rogue 9
Adrian Laguna wrote:Do they work completely with all the links and stuff? Because that would be really awesome.
I didn't bother altering the HTML to make them link to each other, mainly because I only barely know what I'm doing and would fuck it up. But I have them organized into a folder system similar to the tiers of pages on the site.

Posted: 2007-04-11 09:01pm
by Pu-239

Code: Select all

#! /bin/bash
#ROOT=$1
ROOT='crimsonskiesuniverse.com'
#wget -m -k -p $1
LIST=`egrep -roH "\('[a-zA-Z_./]*\.[a-zA-Z]{3,4}'\)" $ROOT`
for F in $LIST; do 
        K=`echo $F|cut -f 2 -d':'|sed "s/('//"|sed "s/')//"`
        if [ `echo $K |sed 's/\(^.\{1\}\).*/\1/'` = '/' ]; then
                TOGET=$ROOT$K
        else
                J=`echo $F|cut -f 1 -d':'| sed "s#^(.*##" |sed "s/\/[^/]*\$//#"`; 
                if [ ! -z $J ]; then 
                        I=$J; 
                fi 
                TOGET=echo $I$K;
        fi
        wget -r -l 1 -N -p $TOGET
done
will suck down the javascript and css links as well as downloading most of the site- still need to fix absolute/relative URL problems in the scripts. It probably would work if you threw up the site mirror on the root directory of a server- absolute paths don't work well straight as files though, and I'm too lazy to get apache back up to test. Should also be somewhat straightforward to write a script to fix all the links, but again, I'm lazy. I think most of the content is there, unless the CSS/javascript is inconsistent and also uses double quotes as well as single quotes (should be trivial to rectify this). Only 1 level of recursion on the 2nd pass for the javascript links since it looked like it went into an infinate recursion state.