Details on the Crawler
(C) 1997-98, Christof Prenninger
10 credit project (Java)
This page is intended to explain the internal structure and functions of the Crawler-object. A developer doesn't necessarily need to read this, but it might help to understand what some of the components do.
To see a bigger version of this picture simply click here.
(Dotted lines are code-fragments of the Crawler itself)
Overview:
When the user starts the Crawler (e.g. by clicking start in the Controller-window) the root-node of the tree is created and sent to the Readers to be downloaded. The locally stored HTML-file is then sent to the Parsers which usually find links in that page. Those links are in turn sent to the Readers, and so on.
The Crawler uses different nodes to represent links. There are URLNodes, which only represent an URL and can't be loaded (mail, gopher,...), LoadableNodes, which represent an URL that can be downloaded (pictures, FTP-files), and HTMLNodes, which represent HTML-files. Since LoadableNodes are also URLNodes, and HTMLNodes can also be loaded, this is the class hierarchy of the different node-types:
Readers:
Parsers:
Whenever something is removed from a queue, or one of the Reader-/Parser-threads is done, the Crawler is informed and sends out VisualizerMessages to all attached Visualizers (see the green arrows going to the Visualizer in the graphics).
Before a HTMLNode is sent out to be parsed, the Controller is asked if it's sons will be loaded in the future. Remember a Parser loades the content-info of a newly found son-node; this takes time. If a node‘s sons are not expected to be downloaded in the future, the Parser shouldn't load that content-info to save time. Whether a HTMLNode's sons will be loaded or not is stored in the HTMLNode itself.