Imported scraped targets causing high CPU

rastarr · May 2023

hey @Sven

I've been doing some scraping of target URLs with Scrapebox.
Now the first run produced over 7 million so I exported them in 100,000 target URL batch files.

I'm unsure if there's anything else to do before importing them but when I imported 200,000 URLs across 88 projects, the CPU hit the roof. Even running the scheduler at 30 projects every 30 minutes still has the CPU at over 90%.

Any ideas what might be the cause of this? (P.S I Sent you a bugreport with this thread URL as the description in case that helps)

Sven · May 2023

try hitting help->create bugreport so I can check what causes the high load.

rastarr · May 2023

Sven said:

try hitting help->create bugreport so I can check what causes the high load.

i just did

Sven · May 2023

nothing suspicious on that report.

Many threads are busy with http/s download, others waiting for a free captcha-service slot...others are checking which article is best to use.

you use however 500+ threads and that should always cause such cpu usage i guess.

rastarr · May 2023

Sven said:

nothing suspicious on that report.
Many threads are busy with http/s download, others waiting for a free captcha-service slot...others are checking which article is best to use.
you use however 500+ threads and that should always cause such cpu usage i guess.

OK many thanks for taking a look for me. I'll play around for a better balance then.
Thanks heaps

rastarr · May 2023

Further to this @Sven , I found this thread - https://forum.gsa-online.de/discussion/29820/import-urls-questions/p1

Import URLs (holding site lists) << this will download each URL from the source you give it and extract all the URLs linked on it (also text only) and sort it in. A sample is e.g. pastebin.com URLs where a lot of URLs are usually listed one per line.

Import URLs (identify platform and sort in) << this will take e.g. your file with one URL per line and check what engine/platform it belongs to and put it into the appropriate site list file.

A few questions if I may.
[1] Import URLs (identify platform and sort in) - I'm presuming that identified platforms will end up in the site list determined as 'Identified', right? It isn't going to sort to another like Submitted and/or 'Verified', correct?
[2] If that's true, then surely I must also have to use the 'Identified' site in projects for any newly discovered 'scraped' targets to get into Submitted and/or 'Verified' - is that correct or is there something else going on behind the scenes

I'm just trying to figure out the optimal way to integrate newly found (hopefully) Scrapebox targets into my GSA SER projects. Any hints or insights gratefully appreciated by you or others

Sven · May 2023

1) Yes, only identified is getting filled up.

2) Correct!

rastarr · May 2023

Sven said:

1) Yes, only identified is getting filled up.
2) Correct!

OK great thanks sven
Can you also explain why people would use both Submitted and Verified lists in a project, instead of just Verified?
What's the gain, if any, using Submitted since these have already been tested by at least one other project.
Isn't it better to only use Verified, and not waste time with Submitting more than once?
I'm baffled why people are using both.

Sven · May 2023

You would use the "submitted" list in case previous submission failed (network issue, proxy issue, captcha error).

Imported scraped targets causing high CPU

Comments