Skip to content

Adding Scraped URLs

edited March 2013 in Need Help
When we are using SB and/or other tools to scrape for URLs to add into GSA, is there a certain format we have to adhere to? i.e. Do we need to trim to root or append anything before adding?

Comments

  • Good to know. That makes things nice and simple
  • BrandonBrandon Reputation Management Pro
    Definitely don't trim to root, but you should almost always remove dupe domains.  If you trim to root you will miss things like http://www.example.com/pligginstall/.  Removing dupe domains will lose /pligginstall2/ and /jcowinstall/ but it will be less likely that someone has installed a bunch of open source software packages and it's not worth the time IMO.
  • It can be beneficial to run your scrapes in separate batches. According to Sven, you should remove duplicate URLs for blog & image comment sites and remove duplicate domains for everything else.
  • correct. but best and faster way is to import everything to SER, let ser organize everything then you dedup.
Sign In or Register to comment.