How can I recursively visit links without revisiting links?

I want to check a site for links and then recursively check those sites for links. But I don't want to get the same page twice. I have a problem with logic. This is the Perl code:

my %urls_to_check = ();
my %checked_urls = ();

&fetch_and_parse($starting_url);

use Data::Dumper; die Dumper(\%checked_urls, \%urls_to_check);

sub fetch_and_parse {
    my ($url) = @_;

    if ($checked_urls{$url} > 1) { return 0; }
    warn "Fetching 'me' links from $url";

    my $p = HTML::TreeBuilder->new;

    my $req = HTTP::Request->new(GET => $url);
    my $res = $ua->request($req, sub { $p->parse($_[0])});
    $p->eof();

    my $base = $res->base;

    my @tags = $p->look_down(
        "_tag", "a",
    );

    foreach my $e (@tags) {
        my $full = url($e->attr('href'), $base)->abs;
        $urls_to_check{$full} = 1 if (!defined($checked_urls{$full}));
    }

    foreach my $url (keys %urls_to_check) {
        delete $urls_to_check{$url};
        $checked_urls{$url}++;
        &fetch_and_parse($url);
    }
}

      

But that doesn't sound like what I want.

Reference?!

EDIT . I want to get urls from $starting_url

and then extract all urls from the resulting sets. But, if one of the urls is linking to $starting_url

, I don't want to receive that again.

0


a source to share


5 answers


If you have a queue of links to check and want to skip duplicates, use a hash to flag the ones you've already visited. Skip the links that are in this hash:

my @need_to_check = (...); # however you make that list
my% already_checked = ();

while (my $ link = shift @need_to_check)
    {
    next if exists $ already_checked {$ link};
    ...;
    $ already_checked {$ link} ++;
    }


The situation is a little more complicated with URLs that look slightly different but end up in the same resource, like http://example.com , http://www.example.com , http://www.example.com/ , etc. .d. If I took care of this, I would add a normalization step by creating a URI object for each and then pulling the url out again as a string. If this were a big problem, I would also look at the URL that the response headers claimed I received (say a redirect, etc.) and note that I saw them too.

+2


a source


The simplest thing would be not to reinvent the wheel and use CPAN .



+9


a source


I would guess that the problem is that

foreach my $url (keys %urls_to_check) {...}

      

does not repeat itself the way you think. For every url you recover, you'll have to recursively call your function again, which is very inefficient.

Even though you are writing a program to "recursively" scan web pages, in your code you need to use iteration, not recursion:

sub fetch_and_parse {
    my ($url) = @_;
    $urls_to_check{$url} = 1;
    while(%urls_to_check) {
        // Grab a URL and process it, putting any new URLs you find into urls_to_check
    }
  }

      

Of course, as other posters have pointed out, there are other tools out there that can automate this for you.

+2


a source


If you want to extract all the links from a page, I recommend using LinkExtor from Gisle Aas and a quick CPAN search will show you that. You can then recursively traverse the found links by clicking them on the list and popping them, checking first before traversing them if you've already visited them using a hash as you did.

0


a source


maybe this can help you: blog.0x53a.de/where-do-my-links-go/ It performs a breadth-first search starting at the given website. The HTML :: LinkExtractor module you are using may also be of interest.

Regards, Manuel

0


a source







All Articles