Skip to content

GSA Email Spider Preparser

edited January 2013 in GSA Email Spider
Does anyone have a list of the values to put in the preparser for better scraping results?
In the support PDF I have it mentions changing href="javajopup(' to href="
I wondered if anyone had more "tricks" to get emails from those sites that no configuration of the settings seems to work. 

There is one site that should be so easy to scrape but I just can't seem to get the program to get the emails.

Thanks

Comments

  • SvenSven www.GSA-Online.de

    How about pasting the URL and I have a look? ;)

    Usually you set things up in there for uniqu sites only. That wont work for all sites in general.

  • Here is a page with an email on it 

    Regarding the preparser, can values be set in it and left alone or can values there also have a negative effect on a scrape?

    Thanks
  • SvenSven www.GSA-Online.de
    That page is readable without any preparser settings.
  • That is what I would expect but when I load a url list from that site the program does not find them. The sites go from 10001 to 15000 or something like that. I loaded that list and let the spider go but come up with an email for every 10th site or so. 

    Any ideas what I am doing wrong? A setting, proxy servers or something?

    Thanks 
  • SvenSven www.GSA-Online.de
    Try to reset the config to the default one...skip proxies here as it's not really needed and see for yourself. If you still get a page thats not parse able as expected, give it to me and I have a look.
  • Hi Sven,
    I did those things and loaded the URL list. When I started the spidering the list went through very fast and I got one result, noc_admin@163.com, which is a cache administrator. 

    If I visit the site one page at a time I can scrape the email but I can't seem to make it work with automation by importing a list. 

    Any help would be appreciated. 
  • SvenSven www.GSA-Online.de
    please send me that url list to have a closer look.
  • Sven, 
    You were a great help and the program seems to work better than before. 

    Thanks
  • Sven,
    Ran into a site that I can't get the email to pull down. 
    Seems so obvious...

  • edited February 2013
    Sven,
    I thought I had it on the organogold but i got 3 emails and not the one on the page. 

    They are using the simple mailto: tag

  • SvenSven www.GSA-Online.de
    Have a look in the source, no email is visible in that. It's all protected by javascript which eve I didn't find by looking at it.
  • I did notice that but was not sure if the email can be scraped with some settings. 
    So if there is no email in the source code there is nothing the spider can do?
  • SvenSven www.GSA-Online.de
    At least not with the one above.
  • Thanks Sven

Sign In or Register to comment.