Introduction to HTML::Parser
There are times when you will need to read an HTML file and extract a field from that file. Perl has a module called HTML::Parser that simplifies this task.This module reads an HTML file and allows you to define actions when it reads a starting tag, the body and the end tag. To do this, you can define subroutines that are to be executed during these events. The
HTML::Parser
documentation lists all the events that can happen during processing. For our discussion, we will discuss only the start
, text
and end
events.You define the subroutine to handle an event in this format:
event => [\&handler, token]
Event
is the name of the eventhandler
is the name of the subroutinetokens
represent the values to be passed to the subroutine. To pass the tag name to the subroutine, you specify the literal 'tag'.
This will be clearer in the sample code.
Sample Code for HTML::Parser
First thing to do is to create an instance of the parser. When you create the instance, you can specify which subroutine is to handle processing at a specific event.
# Define module to use
use HTML::Parser();
# Create instance
$p = HTML::Parser->new(start_h => [\&start_rtn, 'tag'],
text_h => [\&text_rtn, 'text'],
end_h => [\&end_rtn, 'tag']);
# Start parsing the following HTML string
$p->parse('
Sample HTML Page
Hello World
This is a test
');
sub start_rtn {
# Execute when start tag is encountered
foreach (@_) {
print "===\nStart: $_\n";
}
}
sub text_rtn {
# Execute when text is encountered
print "\tText: $_\n";
}
}
sub end_rtn { # Execute when the end tag is encountered
foreach (@_) {
print "End: $_\n";
}
}
Result
Save this and run it. The result will be something like this:
Text:
=== Start: html
Text:
=== Start: head
Text:
=== Start: title
Text: Sample HTML Page
End: /title
Text:
End: /head
Text:
=== Start: body
Text:
Hello World
This is a test
End: /body
Text:
End: /html
Notice that the
text
subroutine is always executed. Likewise, every time the start
tag is encountered, the start_rtn
is executed.What use is this then?
You can write routines to execute when a specific tag is encountered. You can also write routines to execute only if it is part of a specific tag.
In our example also, we passed an HTML string to the parser. You can also pass a file to it by using the
In our example also, we passed an HTML string to the parser. You can also pass a file to it by using the
parse_file($file)
method of the module.
No comments:
Post a Comment