Wednesday, April 30, 2014

Parsing HTML Pages using HTML::Parser

Introduction to HTML::Parser

There are times when you will need to read an HTML file and extract a field from that file. Perl has a module called HTML::Parser that simplifies this task.

This module reads an HTML file and allows you to define actions when it reads a starting tag, the body and the end tag. To do this, you can define subroutines that are to be executed during these events. The HTML::Parser documentation lists all the events that can happen during processing. For our discussion, we will discuss only the starttext and end events.

You define the subroutine to handle an event in this format:

event => [\&handler, token]
  1. Event is the name of the event 
  2. handler is the name of the subroutine
  3. tokens represent the values to be passed to the subroutine. To pass the tag name to the subroutine, you specify the literal 'tag'.
This will be clearer in the sample code.

Sample Code for HTML::Parser

First thing to do is to create an instance of the parser. When you create the instance, you can specify which subroutine is to handle processing at a specific event.
# Define module to use 
use HTML::Parser(); 
# Create instance 
$p = HTML::Parser->new(start_h => [\&start_rtn, 'tag'], 
   text_h => [\&text_rtn, 'text'], 
   end_h => [\&end_rtn, 'tag']); 

# Start parsing the following HTML string 
$p->parse(' Sample HTML Page Hello World This is a test '); 

sub start_rtn { # Execute when start tag is encountered 
   foreach (@_) { 
   print "===\nStart: $_\n"; 

sub text_rtn { # Execute when text is encountered 
   foreach (@_) { 
            print "\tText: $_\n"; 

sub end_rtn { # Execute when the end tag is encountered 
      foreach (@_) { 
          print "End: $_\n"; 


Save this and run it. The result will be something like this:

Text:=== Start: htmlText:
=== Start: headText:
=== Start: titleText: Sample HTML PageEnd: /titleText:
End: /headText:
=== Start: bodyText:Hello WorldThis is a testEnd: /bodyText:
End: /html

Notice that the text subroutine is always executed. Likewise, every time the start tag is encountered, the start_rtn is executed.

What use is this then?

You can write routines to execute when a specific tag is encountered. You can also write routines to execute only if it is part of a specific tag.
In our example also, we passed an HTML string to the parser. You can also pass a file to it by using the parse_file($file) method of the module.

No comments:

Post a Comment