[GSA Email Spider] How to exclude useless directories?

spider1 · August 2013

When scraping emails, a site will have something like:

domain.com/contacts or domain.com/people/contact, or something like that, one specific directory that has the emails, and pretty much all the rest is useless, I don't really need to scrape domain.com/shop or domain.com/products, I waste dozens of hours on that.

Yet for large sites, there are thousands of useless pages in those directories. Would it be possible to do something like:

- If a lot of emails are found in a specific directory, ignore the other directories. This means that if I set depth to level 3, it will go to level 3 in that directory only.

In other words, is there a smart and more efficient way to detect which directories have contact, and make it ignore all the rest?

Thanks!

Edit: I normally have 1,000+ urls so I can't look up each site manually and enter the directory with contacts, I would like to auto-detect it somehow.

Sven · August 2013

Usually you have the "Contact" link on the first page or at least being visible on all pages. It makes no sense to use a parsing level of 3 here. 1 level is enough.

To only collect and spider pages with such a name in it you can simply go to options->filter and add that in the box with "URL must have" (e.g. enter *contect* or *about* or *whatever*).

spider1 · August 2013

Sure I'd love to do that, but those sections vary so much. Sometimes it's called Contact, other times Team, People, Contact Us, Management, etc.

Am I using the software wrong? What I do is set it to level 3 depth, don't follow outside links, and subdomain only, so it stays within www.domain.com. Then I import 1,000 urls or 5,000 urls from a list I have and run it. It takes several days to finish, it goes through millions of urls.

What would be the smartest way to scrape emails from my list of urls?

Thanks!

Sven · August 2013

level 3 is to much. Maybe you see the level as folders but with levels it means "clicks" to reach the site you want from the one you import. So usually it's one click/level to get to the contact page. Try that and it will reduce the parsing time dramatically.

spider1 · August 2013

I see what you mean, but I did tests from level 1 all the way to level 5! 3 is the best balance, although level 4 gets me even more emails but takes way longer.

This is why I did 3: on most sites, you click contact, then you click People or Directory, so you get a directory of people, then you click on each person to get their email and full details, so that's three clicks.

With many sites it takes 4 clicks to get to the email part.

Do you have any other ideas that might save me some time? I only wish after it find the right section for contacts it would ignore all others. Or any other clever trick you can think of.

Thank you for the support! I appreciate it!

Sven · August 2013

Sorry there is not much more you cna do beside the filters. And I don't think you want to invest into

"GSA Address Completion" or? Because that one will do the exact thing you want, get the contact data from one website or address.

spider1 · August 2013

Would Address Completion work better than email spider? Keep in mind all I want is the emails. So I assume it will work the same way, no?

Sven · August 2013

Try the demo version please. I don't know if it works better for you, but with a webpage as input it searches for just one email belonging to that site and thats it. Maybe thats not what you want though.

spider1 · August 2013

I will try it. Although thanks for your help, I will follow your suggestion and make a list of keywords related to contact pages, and make it scrape only those.

You're great thanks for the help.

Sven · August 2013

Your welcome

[GSA Email Spider] How to exclude useless directories?

Comments