The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Extracting links from a webpage

Ok guys, I have a question.  When you use Firefox, have you seen this cool feature... where you click on page info and then go to the links tab and you get to see a list of all the url links in that web page.  How could I do that using IWebBrowser, IHtmlDocument and C++?

Any pointers and help will be appreciated.  Thanks.
Wednesday, March 01, 2006
IE automation. You can control IE through the COM interface it exposes. WATIR is a great example, but it is written in Ruby.
smalltalk Send private email
Thursday, March 02, 2006
IHTML through COM and IE exposes a few fields, but if you want to go in a browser independent direction, you'd need a fairly naive html parser - for your purposes, a very naive html parser would do.

We utilize a home grown html parser in Printer Friendly to keep away browser dependencies, and it wasn't that difficult to implement. There is free code floating around if you want to look into other peoples' implementations of such things.

- Cheers
Andrey Butov Send private email
Thursday, March 02, 2006
...and for actual plugging into IE through COM to get the page details in the first place, lookie here:
Andrey Butov Send private email
Thursday, March 02, 2006
IIRC, there is a sample application called IEDocMon that you might find instructive.

However, if you want complete control over things like script execution, whitespace preservation, etc., or if you have some control over the input HTML, you may also consider writing your own parser.  SuperBot is one example of a web analysis tool with its own parsing engine:

Chris Marshall Send private email
Thursday, March 02, 2006
IHTMLDocument2 interface in MSHTML has a *links* collection.

Open MSDN, go to index tab and type IHTMLDocument2.
Umair Send private email
Saturday, March 04, 2006
Hi, I have been using IWebBrowser i.e. IE Browser Objects.  The problem is when their is a dynamically created web page, for example, if the page has a script that creates a frame and add some links or whatever dynamically, then I cannot seem to get the elements from this frame, I used IHTMLFramesCollection2 and other ways, they work on "normal" webpages, but not with ones that have dynamically updated webpages. :-(  Thanks anyway.
Tuesday, March 07, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz