When we are using SB and/or other tools to scrape for URLs to add into GSA, is there a certain format we have to adhere to? i.e. Do we need to trim to root or append anything before adding?
Comments
rodol
Nope
WebMagic
Good to know. That makes things nice and simple
Brandon Reputation Management Pro
Definitely don't trim to root, but you should almost always remove dupe domains. If you trim to root you will miss things like http://www.example.com/pligginstall/. Removing dupe domains will lose /pligginstall2/ and /jcowinstall/ but it will be less likely that someone has installed a bunch of open source software packages and it's not worth the time IMO.
DavidA2
It can be beneficial to run your scrapes in separate batches. According to Sven, you should remove duplicate URLs for blog & image comment sites and remove duplicate domains for everything else.
rodol
correct. but best and faster way is to import everything to SER, let ser organize everything then you dedup.
Comments