Skip to content

Identified folder: Time to sort-out harveste URLs + Optimal folder size

hey guys,

I have couple of millions scarped URL's (SER footprint + KWs), and I am importing them to SER for sorting out. It takes SER about 2 days to go trough such list.

Questions are:

1/ is there any possible other way to speed u the process?
2/ Any tips on keeping the Identified folder thinner, cause I end up with >1mill URLs in it.

Feedback appreciated,

Comments

  • The identification process runs using the number of threads you specify in Options, so if you haven't already raised that to a decent level that your server can handle, I'd suggest playing around with that number. Also, I take it your scraped urls are deduped? Unless you're importing blog comment or trackback urls, you'll be safe deduping at domain level before importing (no point in identifying say a social network site twice or more times)
  • edited June 2014
    yeah, I was not sure if threads applied to the Identification, but I am runnign 100 thread at CPU 90+%. RAM 16GIG; 3.2 ghrz processor.

    Good idea to dedupe on domain level, that will  save me time...

    Thanks @cherub

    Also, any way to keep Identidied folderr thin, perhaps usiing Macros (if doable) to pull URLs form file?
Sign In or Register to comment.