Skip to content

[GSA Email Spider] How to exclude useless directories?

edited August 2013 in GSA Email Spider

When scraping emails, a site will have something like: or, or something like that, one specific directory that has the emails, and pretty much all the rest is useless, I don't really need to scrape or, I waste dozens of hours on that.


Yet for large sites, there are thousands of useless pages in those directories. Would it be possible to do something like:


- If a lot of emails are found in a specific directory, ignore the other directories. This means that if I set depth to level 3, it will go to level 3 in that directory only.


In other words, is there a smart and more efficient way to detect which directories have contact, and make it ignore all the rest?




Edit: I normally have 1,000+ urls so I can't look up each site manually and enter the directory with contacts, I would like to auto-detect it somehow.


  • SvenSven

    Usually you have the "Contact" link on the first page or at least being visible on all pages. It makes no sense to use a parsing level of 3 here. 1 level is enough.

    To only collect and spider pages with such a name in it you can simply go to options->filter and add that in the box with "URL must have" (e.g. enter *contect* or *about* or *whatever*).

  • Sure I'd love to do that, but those sections vary so much. Sometimes it's called Contact, other times Team, People, Contact Us, Management, etc.

    Am I using the software wrong? What I do is set it to level 3 depth, don't follow outside links, and subdomain only, so it stays within Then I import 1,000 urls or 5,000 urls from a list I have and run it. It takes several days to finish, it goes through millions of urls.

    What would be the smartest way to scrape emails from my list of urls?



  • SvenSven
    level 3 is to much. Maybe you see the level as folders but with levels it means "clicks" to reach the site you want from the one you import. So usually it's one click/level to get to the contact page. Try that and it will reduce the parsing time dramatically.
  • I see what you mean, but I did tests from level 1 all the way to level 5! 3 is the best balance, although level 4 gets me even more emails but takes way longer.

    This is why I did 3: on most sites, you click contact, then you click People or Directory, so you get a directory of people, then you click on each person to get their email and full details, so that's three clicks.

    With many sites it takes 4 clicks to get to the email part.

    Do you have any other ideas that might save me some time? I only wish after it find the right section for contacts it would ignore all others. Or any other clever trick you can think of.

    Thank you for the support! I appreciate it!

  • SvenSven
    edited August 2013

    Sorry there is not much more you cna do beside the filters. And I don't think you want to invest into

    "GSA Address Completion" or? Because that one will do the exact thing you want, get the contact data from one website or address.

  • Would Address Completion work better than email spider? Keep in mind all I want is the emails. So I assume it will work the same way, no?
  • SvenSven
    Try the demo version please. I don't know if it works better for you, but with a webpage as input it searches for just one email belonging to that site and thats it. Maybe thats not what you want though.
  • I will try it. Although thanks for your help, I will follow your suggestion and make a list of keywords related to contact pages, and make it scrape only those.

    You're great thanks for the help.

  • SvenSven
    Your welcome
Sign In or Register to comment.