How to keep the depth very shallow.

June 2014

The spider is trying to parse all kinds of files --- includes/css etc.

-- I've tried to set level to 1.. but that doesn't seem to work

still parsing hello/anotherlevel/more/here etc.

Do I need to restart the program? or is there a specific setting to only parse top level html files.

thanks!

June 2014

Maybe trying a filter might help here, but @sven might have a better solution when he gets back.

June 2014

ok so in the "don't parse URLs with... "

How do I add folder paths /*/* etc.?

June 2014

With Level it is not meant the level of sub folder in an URL but its meant the clicks you have to do to reach a certain page in the website. Like you open www.gsa-online.de and click on Products and then on the wanted Product which makes 3 clicks / Levels.

June 2014

since the bot gets pretty slow with several 1000s URLs I've scraped from google maps.

Is there ANY way to limit spider to mainhomepage.html and mainhomepage.html/contact?

and ignore everything else?

I'm getting javascripts includes and all sorts of site pages not relevant to my "targeted emails"

June 2014

1. limit parsing to one level deep only. This would skip all unrelated sites as the contact page is usually linked to every page and it should be reachable by one click / level only.

2. watch the URL queue for a while and maybe add the unwanted URLs to the filter like *badjavapart* .

June 2014

Thanks for your fast response.

I'm already limiting to 1 level deep still the spider crawling the open web gets snag'd on unrelated 1000's of urls.

Nothing off domain is also set.

adding unwanted urls for the chaos that is the web is really not practical.

what is needed is some more pattern matching/machine learning.

even simple matching like

- only crawl if <form> found. || or <input type=text> found.

- crawl pages with captcha found

etc.

How to keep the depth very shallow.

Comments