|
|
|
start date: Tue, 14 Aug 2007 11:01:26 +0500,
posted on: microsoft.public.dotnet.framework.aspnet
back
| Thread Index |
|
1
Enigma Boy
|
|
2
Alexey Smirnov
|
|
3
Jesse Houwing am
|
Retrievel Hyperlinks for a web page in code
Hi folks,
I am retrieving a website for a site using httpWebRequest. What I want to
do with the retrieved webpage is list all the hyperlinks in the page. If I
do a simple regex search for <a then I get links that are commented out in
code and I don't want that. I want links that are actually active. This is
to do with reciprocal link check.
Can someone please point me in the right direction.
Thanks.
--
<a href="http://1pakistangifts.com">Send Gifts to Pakisan at #Pakistan Gifts
Store</a> | <a href="http://dotspecialists.com">Leading Software offshoring
and outsourcing service provider</a> | <a
href="http://websitedesignersrus.com">Professional Websites at affordable
prices</a>
Date:Tue, 14 Aug 2007 11:01:26 +0500
Author:
|
Re: Retrievel Hyperlinks for a web page in code
On Aug 14, 8:01 am, "Enigma Boy" wrote:
> Hi folks,
>
> I am retrieving a website for a site using httpWebRequest. What I want to
> do with the retrieved webpage is list all the hyperlinks in the page. If I
> do a simple regex search for <a then I get links that are commented out in
> code and I don't want that. I want links that are actually active. This is
> to do with reciprocal link check.
Hi, I think you can try to clean the text before you get the links.
For example:
html_code = Regex.Replace(html_code, "<!--((.|\n)*?)-->", "");
This will replace all commented code by an empty string and then you
can get the links.
Date:Tue, 14 Aug 2007 01:17:02 -0700
Author:
|
Re: Retrievel Hyperlinks for a web page in code
Hello Enigma,
> Hi folks,
>
> I am retrieving a website for a site using httpWebRequest. What I
> want to do with the retrieved webpage is list all the hyperlinks in
> the page. If I do a simple regex search for <a then I get links that
> are commented out in code and I don't want that. I want links that
> are actually active. This is to do with reciprocal link check.
>
> Can someone please point me in the right direction.
>
> Thanks.
Have a look at the HTML Agility pack. It allows you to treat the HTML as
it were XML.
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
--
Jesse Houwing
jesse.houwing at sogeti.nl
Date:Tue, 14 Aug 2007 10:56:55 +0000 (UTC)
Author:
|
|
|