Skip to content

Filtering out binary files

I just noticed I have sometimes .pdf files listed in my scrapped lists. Does platform identifier have some filter to dont try to download and parse such urls, which dont return text/html header? I am asking because I noticed 100% cpu usage in latest updates, using same amount of threads as weeks before and I couldnt locate the reason, because before it was taking just few percents of cpu. I am just running v1.14. Two updates back I was running flawesly at 800 threads with almost unnoticable cpu usage, now it takes it all and I have to run at 100 threads. The server is dedicated 2 cpu/12cores/24threads 3ghz, 72GB ram. I was assuming I can have more crap in scrapped urls, which makes the parser to stuck on them, what do you think?

Comments

  • s4nt0ss4nt0s Houston, Texas
    Programmer is looking into this but he said from doing a quick check he can't see anything in the latest few versions that would change the CPU usage. He's going to look into it deeper.

    Did you happen to change the  bandwidth limit? Unchecking/changing that can cause higher CPU too.

    Next update will be hard coded to skip, pdf, jpg, jpeg, bmp, tff, gif, png, doc, xls, ppt, mp4, avi, mpg, mpeg, 3gp, mov, mvi.
  • No changes at all, I do the imports and runs still with same settings. No bw limit, no deep checking. Only difference could be how I deduped the list. Before I was deduping by domain, now by url. Dunno if that could make change.
  • s4nt0ss4nt0s Houston, Texas
    ok, we're looking into it. You should definitely have the bandwidth limit on, that will lower your CPU. 
Sign In or Register to comment.