® WeOnlyDo! COM (2002-2006)
 

General information

wodHtmlParser is lightweight component that is used to parse HTML document. It creates series of Entity objects - each of them corresponding to HTML tag found inside the document. It will provide information about tag text, size, attributes etc. You can use wodHtmlParser to extract information you require from the document, in easy way.

HTML documents are parsed so that each html tag is found - it's starting and ending position, its child tags are enumerated, links extracted etc. To extract pictures from the document you need just one line of code with wodHtmlParser, something like this:

Set ImageLinks = wodParser.Parts.Filter("IMG").Search(ByAttributeName, "HREF")



How does it work?

wodHtmlParser creates tree-like structure from the document. For example, HTML document like this:

<HTML><TITLE>this is title</TITLE>
<BODY>
<TABLE><TR><TD>this is one cell</TD></TR><TR><TD>this is another cell</TD></TR></TABLE>
</BODY></HTML>


will create several wodHtmlEntity objects accessible through wodHtmlParser.Parts property:

TITLE
BODY
 TABLE
  TR
   TD
  TR
   TD


but some of those entities may also contain corresponding child entities accessible also through wodHtmlEntity.Parts property, such as

TR
  TD




Easy to recreate original document

wodHtmlParser will only read document, you will not be able to replace certain attributes or text, but if you need to do so, recreating original document is fairly easy from data provided in RawData property. Also, if you need to parse only part of HTML document - you can easily set it in wodHtmlParser.Body property and initiate new parsing that way - only on piece of data you provide.



Licensing

wodHtmlParser is free for use to all wodHttpDLX's customers. It cannot be purchased as standalone product.