Skip to content

How to keep the depth very shallow.

The spider is trying to parse all kinds of files --- includes/css etc. 
-- I've tried to set level to 1.. but that doesn't seem to work

still parsing hello/anotherlevel/more/here etc.

Do I need to restart the program? or is there a specific setting to only parse top level html files.
thanks!

Comments

  • s4nt0ss4nt0s Houston, Texas
    Maybe trying a filter might help here, but @sven might have a better solution when he gets back.

    image
  • ok so in the "don't parse URLs with... " 

    How do I add folder paths /*/* etc.?

  • SvenSven www.GSA-Online.de
    edited June 2014
    With Level it is not meant the level of sub folder in an URL but its meant the clicks you have to do to reach a certain page in the website. Like you open www.gsa-online.de and click on Products and then on the wanted Product which makes 3 clicks / Levels.
  • since the bot gets pretty slow with several 1000s URLs I've scraped from google maps.

    Is there ANY way to limit spider to mainhomepage.html and mainhomepage.html/contact?
    and ignore everything else?

    I'm getting javascripts includes and all sorts of site pages not relevant to my "targeted emails"
  • SvenSven www.GSA-Online.de

    1. limit parsing to one level deep only. This would skip all unrelated sites as the contact page is usually linked to every page and it should be reachable by one click / level only.

    2. watch the URL queue for a while and maybe add the unwanted URLs to the filter like *badjavapart* .

  • edited June 2014
    Thanks for your fast response.

    I'm already limiting to 1 level deep still the spider crawling the open web gets snag'd on unrelated 1000's of urls.
    Nothing off domain is also set.

    adding unwanted urls for the chaos that is the web is really not practical.

    what is needed is some more pattern matching/machine learning.
    even simple matching like 
    - only crawl if <form> found. || or <input type=text> found.
    - crawl pages with captcha found
     etc.








Sign In or Register to comment.