HtmlAgilityPack Extension to Load a Uri/Url that handles redirects

Posted by Blake on 11/20/2012
)

This extension allowed me to load an HtmlDocument from a Uri. Now, I created this as a function that also returns a Uri in case the one that you specified redirects to another page. This is important, because you will use the redirected Uri to turn any relative links on your page into absolute links. I can also show you can example of how to do this.

First, the extension method for “LoadUri” that extends HtmlDocument (this should be cleaned up to allow you to also specify a user agent):

VB.Net

        ''' <summary>
        ''' Loads an HtmlDocument give a specified uri.  A System.Uri will be returned from this function that is the page that
        ''' responded (in the case of a redirect).  This Uri can then be used to correct turn the relative links into absolute links.
        ''' </summary>
        ''' <param name="hd"></param>
        ''' <param name="uri"></param>
        ''' <param name="timeoutMs"></param>
        ''' <returns></returns>
        ''' <remarks></remarks>
        <Extension()> _
        Public Function LoadUri(ByVal hd As HtmlDocument, ByVal uri As System.Uri, ByVal timeoutMs As Integer) As System.Uri
            Dim hwr As HttpWebRequest = DirectCast(WebRequest.Create(uri), HttpWebRequest)
            hwr.Timeout = timeoutMs
            hwr.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

            Dim resp As HttpWebResponse
            resp = DirectCast(hwr.GetResponse(), HttpWebResponse)

            Dim returnUri As System.Uri = resp.ResponseUri

            If resp.ContentType.StartsWith("text/html", StringComparison.InvariantCultureIgnoreCase) Then
                Dim resultStream = resp.GetResponseStream()
                hd.Load(resultStream)
            End If

            Return returnUri
        End Function

Second, we have an extension method off of HtmlDocument that will return all the links on a page. This particular extension requires the baseUri to be provided in order to construct absolute links from relative ones. The returnUri from the previous LoadUri method is what would be passed into this method:

VB.Net

        ''' <summary>
        ''' Returns all links from an HTML Document as a generic list of strings.  The baseUri will be for turning relative links
        ''' into absolute links.
        ''' </summary>
        ''' <param name="doc"></param>
        ''' <returns></returns>
        ''' <remarks>
        ''' </remarks>
        <Extension()> _
        Public Function GetLinks(ByVal doc As HtmlAgilityPack.HtmlDocument, ByVal baseUri As Uri) As List(Of String)
            Dim linkList As List(Of String) = GetLinks(doc)
            Dim newList As New List(Of String)
            Dim baseUrl As String = ""

            baseUrl = baseUri.AbsoluteUri.ToString.Substring(0, baseUri.AbsoluteUri.LastIndexOf("/") + 1)

            For Each link As String In linkList
                Dim uri As New Uri(link, System.UriKind.RelativeOrAbsolute)
                If uri.IsAbsoluteUri = False Then
                    newList.Add(baseUrl & link.TrimStart("/"))
                Else
                    newList.Add(uri.AbsoluteUri.ToString)
                End If
            Next

            Return newList

        End Function

Here is how I would use it as a simple test. In this test case, I just dump the contents of the list into a WinForms RichTextBox in order to quickly see it’s values. I also have an extension method off of list that allows me to send it to a delimited string quickly (you can replace that with a for each loop or put a debug point to see the contents of the list):

VB.Net

        Dim hd As New HtmlAgilityPack.HtmlDocument
        Dim responseUri As System.Uri = hd.LoadUri(New System.Uri("http://www.blakepell.com"), 3000)
        Dim linkList As List(Of String) = hd.GetLinks(responseUri)

        RichTextBox1.Text = linkList.ToDelimitedString(vbCrLf)

This may need some touching up, it doesn’t handle stuff like javascript in links, etc. It’s a basic starting point for how to collect links on pages.