Filtering out binary files
I just noticed I have sometimes .pdf files listed in my scrapped lists. Does platform identifier have some filter to dont try to download and parse such urls, which dont return text/html header? I am asking because I noticed 100% cpu usage in latest updates, using same amount of threads as weeks before and I couldnt locate the reason, because before it was taking just few percents of cpu. I am just running v1.14. Two updates back I was running flawesly at 800 threads with almost unnoticable cpu usage, now it takes it all and I have to run at 100 threads. The server is dedicated 2 cpu/12cores/24threads 3ghz, 72GB ram. I was assuming I can have more crap in scrapped urls, which makes the parser to stuck on them, what do you think?
Comments