Home Optimize your code - faster html parsing in PowerShell
Post
Cancel

Optimize your code - faster html parsing in PowerShell

First idea

In PowerShell I need to parse an HTML page to get some stuff from HTML table. The web page is really simple - it contains one HTML table. I search goole and find out that I can use a lot of build in stuff.

So my first try was following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$result = Invoke-WebRequest $url
$table = $result.ParsedHtml.getElementsByTagName("table")[0];
$startDate = (Get-Date)
$downloadedFiles = 0;
$result.ParsedHtml.getElementsByTagName('tr') | %{
    $tr =  $_;
    $filename = $tr.childNodes.item(0).outerText;
    $time = $tr.childNodes.item(5).outerText.Replace("-","").Replace(" ","T").Replace(":","");

    if($time.StartsWith($filterDate))
    {
        Write-host $filename " " $time
        $downloadedFiles++;
    }
}
Write-Host "Found $downloadedFiles files in " ((Get-Date)-$startDate)

The result of above code was: Found 192 files in 00:20:34.3135602 The total rows in table is: 3490 rows. FAIL.

Why above fact happened? Because PowerShell in standard mode in parsing HTML uses IE and COM object t communicate. This probably slows down the process too much.

Speed up

Optimization was simple: instead of parsing HTML using IE I need to parse XML using build in functions. But there is a problem my page is not a valid XML. So I extracted a table, fix XML and do following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$downloadedFiles = 0;
$tableHTML = $result.ParsedHtml.getElementsByTagName('table').item(0);
[xml]$tableXML =$tableHTML.innerHTML.ToString().Replace("<BR>","");

$tableXML.ChildNodes.Item(0).ChildNodes| %{
    $tr =  $_;
    $filename = $tr.childNodes.item(0).InnerText;
    $time = $tr.childNodes.item(5).InnerText.Replace("-","").Replace(" ","T").Replace(":","");

    if($time.StartsWith($filterDate))
    {
        Write-host $filename " " $time
        $downloadedFiles++;
    }
}
Write-Host "Found $downloadedFiles files in " ((Get-Date)-$startDate)

The result: Found 192 files in 00:00:03.0420975.

Much better :D

This post is licensed under CC BY 4.0 by the author.