Anyone know how to dowload an entire website?

GEC: Discuss gaming, computers and electronics and venture into the bizarre world of STGODs.

Moderator: Thanas

Post Reply
Adrian Laguna
Sith Marauder
Posts: 4736
Joined: 2005-05-18 01:31am

Anyone know how to dowload an entire website?

Post by Adrian Laguna »

I recently was pointed in the direction of The official Crimson Skies website, which I'm interested in reading because the game is that awesome. However, as said in the front page, it will be closed soon. This sucks because I want to be able to read it at my leisure.

So, anyone know how I can transfer the whole thing to my hard-drive?
User avatar
General Zod
Never Shuts Up
Posts: 29211
Joined: 2003-11-18 03:08pm
Location: The Clearance Rack
Contact:

Post by General Zod »

Flashget might do it.
"It's you Americans. There's something about nipples you hate. If this were Germany, we'd be romping around naked on the stage here."
User avatar
phongn
Rebel Leader
Posts: 18487
Joined: 2002-07-03 11:11pm

Post by phongn »

wget to the rescue!
rhoenix
Jedi Council Member
Posts: 1910
Joined: 2006-04-22 07:52pm

Post by rhoenix »

phongn wrote:wget to the rescue!
I concur - finding an open-source version of wget (hopefully with a graphical interface - nice, but certainly not required) should work for you just fine.
That snippet will recursively vaccuum an entire website, three link-levels deep. The max is 5. (corrections welcome)
User avatar
Netko
Jedi Council Member
Posts: 1925
Joined: 2005-03-30 06:14am

Post by Netko »

Yeah, at the end of the day, wget is the best option. I tried using various windows-based utilities in the hope that I would find something that works, however, in the end, I got the site you were talking about with wget, which was the only one that got it with a minimum of fuss (the only issue was it first downloading the site into the void because I had it in Program Files and didn't specify an output directory which caused windows to ignore wget's output).

Get wget for Windows on this link, and then unzip it to a directory.

Then run cmd.exe, navigate to the directory (folder) where you unzipped it (type "cd C:\nameofdirectory\nameofsubdirectory", without quotes and with proper names obviously) then execute wget with something like this (you can get help by typing "wget -h"):

wget -r -np -p --directory-prefix=C:\dirwhereyouwanttodownload www.crimsonskiesuniverse.com/story/

(this will get you the story pages - unfortunately, the links seem to be screwed up because they were made as static links in a weird way and apparently the -k option is not smart enough to parse them correctly to compensate)
User avatar
aerius
Charismatic Cult Leader
Posts: 14801
Joined: 2002-08-18 07:27pm

Post by aerius »

I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.
Image
aerius: I'll vote for you if you sleep with me. :)
Lusankya: Deal!
Say, do you want it to be a threesome with your wife? Or a foursome with your wife and sister-in-law? I'm up for either. :P
User avatar
Solauren
Emperor's Hand
Posts: 10390
Joined: 2003-05-11 09:41pm

Post by Solauren »

Ultrasucker is also very good, but can be annoying to configure "just right"

Webreaper is also pretty good
User avatar
Bounty
Emperor's Hand
Posts: 10767
Joined: 2005-01-20 08:33am
Location: Belgium

Post by Bounty »

I've just used the Linux version of wget to mirror the CS website and it worked like a charm.
User avatar
Shroom Man 777
FUCKING DICK-STABBER!
Posts: 21222
Joined: 2003-05-11 08:39am
Location: Bleeding breasts and stabbing dicks since 2003
Contact:

Post by Shroom Man 777 »

aerius wrote:I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.
You mean to say I can download any pornsite, without paying?!
Image "DO YOU WORSHIP HOMOSEXUALS?" - Curtis Saxton (source)
shroom is a lovely boy and i wont hear a bad word against him - LUSY-CHAN!
Shit! Man, I didn't think of that! It took Shroom to properly interpret the screams of dying people :D - PeZook
Shroom, I read out the stuff you write about us. You are an endless supply of morale down here. :p - an OWS street medic
Pink Sugar Heart Attack!
User avatar
Einhander Sn0m4n
Insane Railgunner
Posts: 18630
Joined: 2002-10-01 05:51am
Location: Louisiana... or Dagobah. You know, where Yoda lives.

Post by Einhander Sn0m4n »

Shroom Man 777 wrote:
aerius wrote:I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.
You mean to say I can download any pornsite, without paying?!
How about'sa test? NSF FUCKING W for the slower people in the audience!

And I believe the act being discussed in this thread is called a 'siterip'. Ignore the pornsite links at the top :P
Image Image
User avatar
Bounty
Emperor's Hand
Posts: 10767
Joined: 2005-01-20 08:33am
Location: Belgium

Post by Bounty »

Shroom Man 777 wrote:
aerius wrote:I've used HTTrack to download quite a few porn sites as well as archiving several tech reference sites. Comes with a GUI and more options than I can count, including link-level depth, bandwidth usage, passwords for pay sites, and many others.
You mean to say I can download any pornsite, without paying?!
If you have the password, but any competent admin will make sure leaked passwords don't work for long.

And damnit man, there are better ways of getting free porn than ripping websites.
User avatar
Ace Pace
Hardware Lover
Posts: 8456
Joined: 2002-07-07 03:04am
Location: Wasting time instead of money
Contact:

Post by Ace Pace »

Bittorrent? :wink:
Brotherhood of the Bear | HAB | Mess | SDnet archivist |
Adrian Laguna
Sith Marauder
Posts: 4736
Joined: 2005-05-18 01:31am

Post by Adrian Laguna »

Well I got around to trying a couple of these programs out. HTTrack first, since it seemed the most simple couldn't figure it out. Then I used Wget with Netko's instruction and it seems that it almost worked. I seem to have downloaded a lot of files, but not a coherent webpage.

This whole getting an internet webpage seems more complicated than I thought it would be.
Last edited by Adrian Laguna on 2007-04-09 04:20pm, edited 2 times in total.
User avatar
Bounty
Emperor's Hand
Posts: 10767
Joined: 2005-01-20 08:33am
Location: Belgium

Post by Bounty »

Adrian Laguna wrote:Well I got around to trying a couple of these programs out. HTTrack first, since it seemed the most simple couldn't figure it out. Then I used Wget with Netko's instruction and it seems that it almost worked. I seem to have downloaded a lot of files, but not a coherent webpage.
No default.htm? Wget worked fine for me.
Adrian Laguna
Sith Marauder
Posts: 4736
Joined: 2005-05-18 01:31am

Post by Adrian Laguna »

There is, actually, a "default.htm" thought it's complete name is "default.htm@MSID[bunch of numbers]". Windows can't open it.
Last edited by Adrian Laguna on 2007-04-09 04:24pm, edited 1 time in total.
User avatar
Bounty
Emperor's Hand
Posts: 10767
Joined: 2005-01-20 08:33am
Location: Belgium

Post by Bounty »

Adrian Laguna wrote:There is, actually, a "default.htm" thought it's complete name is "default.htm@MSID[bunch of numbers]". Windows can't open it.
Remove everything after the .htm, or force Widows to open it with Firefox. If you remove the letters after the extension, the link back to the main page will break, but that's a minor issue.
Adrian Laguna
Sith Marauder
Posts: 4736
Joined: 2005-05-18 01:31am

Post by Adrian Laguna »

Okay, I tried again with a slightly different approach. Everything works but the pop-up pages. Basically, all the subsections in the "universe" page has a subject (corporations, aircraft, pilots, etc) that when you click on it shows a pop-up page that tells you about said subject. But all of the pop-ups are blank.

I put the following in the command line:
wget -r -np -p -k --directory-prefix=C:\ http://crimsonskiesuniverse.com/
User avatar
Bounty
Emperor's Hand
Posts: 10767
Joined: 2005-01-20 08:33am
Location: Belgium

Post by Bounty »

Crapsicles, wget didn't get those. You can get to them by adding the corp name to the /universe/corporations URL (or pilots, or tech or whatever). Start downloading, man!

So, can wget be made to follow Javascript links?
User avatar
Rogue 9
Scrapping TIEs since 1997
Posts: 18681
Joined: 2003-11-12 01:10pm
Location: Classified
Contact:

Post by Rogue 9 »

I have the Universe and Story pages from the Crimson Skies site saved already if that's all you want.
It's Rogue, not Rouge!

HAB | KotL | VRWC/ELC/CDA | TRotR | The Anti-Confederate | Sluggite | Gamer | Blogger | Staff Reporter | Student | Musician
Adrian Laguna
Sith Marauder
Posts: 4736
Joined: 2005-05-18 01:31am

Post by Adrian Laguna »

Do they work completely with all the links and stuff? Because that would be really awesome.
User avatar
Pu-239
Sith Marauder
Posts: 4727
Joined: 2002-10-21 08:44am
Location: Fake Virginia

Post by Pu-239 »

You could write a script to grep all links that are in javascript and run a 2nd pass using wget...

ah.....the path to happiness is revision of dreams and not fulfillment... -SWPIGWANG
Sufficient Googling is indistinguishable from knowledge -somebody
Anything worth the cost of a missile, which can be located on the battlefield, will be shot at with missiles. If the US military is involved, then things, which are not worth the cost if a missile will also be shot at with missiles. -Sea Skimmer


George Bush makes freedom sound like a giant robot that breaks down a lot. -Darth Raptor
User avatar
Rogue 9
Scrapping TIEs since 1997
Posts: 18681
Joined: 2003-11-12 01:10pm
Location: Classified
Contact:

Post by Rogue 9 »

Adrian Laguna wrote:Do they work completely with all the links and stuff? Because that would be really awesome.
I didn't bother altering the HTML to make them link to each other, mainly because I only barely know what I'm doing and would fuck it up. But I have them organized into a folder system similar to the tiers of pages on the site.
It's Rogue, not Rouge!

HAB | KotL | VRWC/ELC/CDA | TRotR | The Anti-Confederate | Sluggite | Gamer | Blogger | Staff Reporter | Student | Musician
User avatar
Pu-239
Sith Marauder
Posts: 4727
Joined: 2002-10-21 08:44am
Location: Fake Virginia

Post by Pu-239 »

Code: Select all

#! /bin/bash
#ROOT=$1
ROOT='crimsonskiesuniverse.com'
#wget -m -k -p $1
LIST=`egrep -roH "\('[a-zA-Z_./]*\.[a-zA-Z]{3,4}'\)" $ROOT`
for F in $LIST; do 
        K=`echo $F|cut -f 2 -d':'|sed "s/('//"|sed "s/')//"`
        if [ `echo $K |sed 's/\(^.\{1\}\).*/\1/'` = '/' ]; then
                TOGET=$ROOT$K
        else
                J=`echo $F|cut -f 1 -d':'| sed "s#^(.*##" |sed "s/\/[^/]*\$//#"`; 
                if [ ! -z $J ]; then 
                        I=$J; 
                fi 
                TOGET=echo $I$K;
        fi
        wget -r -l 1 -N -p $TOGET
done
will suck down the javascript and css links as well as downloading most of the site- still need to fix absolute/relative URL problems in the scripts. It probably would work if you threw up the site mirror on the root directory of a server- absolute paths don't work well straight as files though, and I'm too lazy to get apache back up to test. Should also be somewhat straightforward to write a script to fix all the links, but again, I'm lazy. I think most of the content is there, unless the CSS/javascript is inconsistent and also uses double quotes as well as single quotes (should be trivial to rectify this). Only 1 level of recursion on the 2nd pass for the javascript links since it looked like it went into an infinate recursion state.

ah.....the path to happiness is revision of dreams and not fulfillment... -SWPIGWANG
Sufficient Googling is indistinguishable from knowledge -somebody
Anything worth the cost of a missile, which can be located on the battlefield, will be shot at with missiles. If the US military is involved, then things, which are not worth the cost if a missile will also be shot at with missiles. -Sea Skimmer


George Bush makes freedom sound like a giant robot that breaks down a lot. -Darth Raptor
Post Reply