Skip to content

[GSA Email Spider] How to exclude useless directories?

edited August 2013 in GSA Email Spider

When scraping emails, a site will have something like:

 

domain.com/contacts or domain.com/people/contact, or something like that, one specific directory that has the emails, and pretty much all the rest is useless, I don't really need to scrape domain.com/shop or domain.com/products, I waste dozens of hours on that.

 

Yet for large sites, there are thousands of useless pages in those directories. Would it be possible to do something like:

 

- If a lot of emails are found in a specific directory, ignore the other directories. This means that if I set depth to level 3, it will go to level 3 in that directory only.

 

In other words, is there a smart and more efficient way to detect which directories have contact, and make it ignore all the rest?

 

Thanks!

 

Edit: I normally have 1,000+ urls so I can't look up each site manually and enter the directory with contacts, I would like to auto-detect it somehow.

Comments

  • SvenSven www.GSA-Online.de

    Usually you have the "Contact" link on the first page or at least being visible on all pages. It makes no sense to use a parsing level of 3 here. 1 level is enough.

    To only collect and spider pages with such a name in it you can simply go to options->filter and add that in the box with "URL must have" (e.g. enter *contect* or *about* or *whatever*).

  • Sure I'd love to do that, but those sections vary so much. Sometimes it's called Contact, other times Team, People, Contact Us, Management, etc.

    Am I using the software wrong? What I do is set it to level 3 depth, don't follow outside links, and subdomain only, so it stays within www.domain.com. Then I import 1,000 urls or 5,000 urls from a list I have and run it. It takes several days to finish, it goes through millions of urls.

    What would be the smartest way to scrape emails from my list of urls?

     

    Thanks!

  • SvenSven www.GSA-Online.de
    level 3 is to much. Maybe you see the level as folders but with levels it means "clicks" to reach the site you want from the one you import. So usually it's one click/level to get to the contact page. Try that and it will reduce the parsing time dramatically.
  • I see what you mean, but I did tests from level 1 all the way to level 5! 3 is the best balance, although level 4 gets me even more emails but takes way longer.

    This is why I did 3: on most sites, you click contact, then you click People or Directory, so you get a directory of people, then you click on each person to get their email and full details, so that's three clicks.

    With many sites it takes 4 clicks to get to the email part.

    Do you have any other ideas that might save me some time? I only wish after it find the right section for contacts it would ignore all others. Or any other clever trick you can think of.

    Thank you for the support! I appreciate it!

  • SvenSven www.GSA-Online.de
    edited August 2013

    Sorry there is not much more you cna do beside the filters. And I don't think you want to invest into

    "GSA Address Completion" or? Because that one will do the exact thing you want, get the contact data from one website or address.

  • Would Address Completion work better than email spider? Keep in mind all I want is the emails. So I assume it will work the same way, no?
  • SvenSven www.GSA-Online.de
    Try the demo version please. I don't know if it works better for you, but with a webpage as input it searches for just one email belonging to that site and thats it. Maybe thats not what you want though.
  • I will try it. Although thanks for your help, I will follow your suggestion and make a list of keywords related to contact pages, and make it scrape only those.

    You're great thanks for the help.

  • SvenSven www.GSA-Online.de
    Your welcome
Sign In or Register to comment.