Identified folder: Time to sort-out harveste URLs + Optimal folder size

June 2014

hey guys,

I have couple of millions scarped URL's (SER footprint + KWs), and I am importing them to SER for sorting out. It takes SER about 2 days to go trough such list.

Questions are:

1/ is there any possible other way to speed u the process?
2/ Any tips on keeping the Identified folder thinner, cause I end up with >1mill URLs in it.

Feedback appreciated,

June 2014

The identification process runs using the number of threads you specify in Options, so if you haven't already raised that to a decent level that your server can handle, I'd suggest playing around with that number. Also, I take it your scraped urls are deduped? Unless you're importing blog comment or trackback urls, you'll be safe deduping at domain level before importing (no point in identifying say a social network site twice or more times)

June 2014

yeah, I was not sure if threads applied to the Identification, but I am runnign 100 thread at CPU 90+%. RAM 16GIG; 3.2 ghrz processor.

Good idea to dedupe on domain level, that will save me time...

Thanks @cherub

Also, any way to keep Identidied folderr thin, perhaps usiing Macros (if doable) to pull URLs form file?

Identified folder: Time to sort-out harveste URLs + Optimal folder size

Comments