[GSA Email Spider] How to exclude useless directories?
When scraping emails, a site will have something like:
domain.com/contacts or domain.com/people/contact, or something like that, one specific directory that has the emails, and pretty much all the rest is useless, I don't really need to scrape domain.com/shop or domain.com/products, I waste dozens of hours on that.
Yet for large sites, there are thousands of useless pages in those directories. Would it be possible to do something like:
- If a lot of emails are found in a specific directory, ignore the other directories. This means that if I set depth to level 3, it will go to level 3 in that directory only.
In other words, is there a smart and more efficient way to detect which directories have contact, and make it ignore all the rest?
Thanks!
Edit: I normally have 1,000+ urls so I can't look up each site manually and enter the directory with contacts, I would like to auto-detect it somehow.
Comments
Usually you have the "Contact" link on the first page or at least being visible on all pages. It makes no sense to use a parsing level of 3 here. 1 level is enough.
To only collect and spider pages with such a name in it you can simply go to options->filter and add that in the box with "URL must have" (e.g. enter *contect* or *about* or *whatever*).
Sure I'd love to do that, but those sections vary so much. Sometimes it's called Contact, other times Team, People, Contact Us, Management, etc.
Am I using the software wrong? What I do is set it to level 3 depth, don't follow outside links, and subdomain only, so it stays within www.domain.com. Then I import 1,000 urls or 5,000 urls from a list I have and run it. It takes several days to finish, it goes through millions of urls.
What would be the smartest way to scrape emails from my list of urls?
Thanks!
I see what you mean, but I did tests from level 1 all the way to level 5! 3 is the best balance, although level 4 gets me even more emails but takes way longer.
This is why I did 3: on most sites, you click contact, then you click People or Directory, so you get a directory of people, then you click on each person to get their email and full details, so that's three clicks.
With many sites it takes 4 clicks to get to the email part.
Do you have any other ideas that might save me some time? I only wish after it find the right section for contacts it would ignore all others. Or any other clever trick you can think of.
Thank you for the support! I appreciate it!
Sorry there is not much more you cna do beside the filters. And I don't think you want to invest into
"GSA Address Completion" or? Because that one will do the exact thing you want, get the contact data from one website or address.
I will try it. Although thanks for your help, I will follow your suggestion and make a list of keywords related to contact pages, and make it scrape only those.
You're great thanks for the help.