gsa does not write all urls to custom file

marty · February 2022

i am scraping using gsa on the backend

options

advanced

tools

search online for urls

i have set it to write to "custom file"

i have cheched where it is writing the file to & it is writing to the correct file & the file size (the .txt file) goes up all the time, but it is not putting all the urls on there, the scraper says 11,907 urls identified, but there were only 2900 urls in the .txt file, i have done it a few times & it is still doing the same,

i know gsa does not always write to file straight away, but i have taken account of this as well.

every time it only puts a percentage of the urls in the .txt file

it is putting some of the scraped urls in the file, so it is sending the urls to the correct file, but not all of them.

marty · February 2022

another thing as well.

when stopping gsa, it leaves 1 project running & will not stop the 1 project, & so you cant restart using the scheduler & if you do a forced close on gsa & restart it, it changes the projects from when you closed it down.

this happens quite often.

marty · February 2022

you have to close it using task manager , clicking the top right "X" to close it does not close it, when it is is reopened again, the projects have changes made to them.

marty · February 2022

sorry, but another question i meant to ask earlier.

when gsa takes the urls from the "custom file" , or from the "identified" file, or the "verified" file, does it take the urls in sequence, i.e.

line 1 -then

line 2 then

line 3

or does it take them randomly ? (like i think)

Sven · February 2022

search + write to "custom file": This is normal. It takes one URL, downloads it and checks on how many engines it would match. It can match more than one engine and then the number of saved entries is less than the amount of identified.

---

stopping project: If you see it is not stopping and trying for a longer time, hit HELP->Create Bugreport and I have a closer look.

---

entries from a site list are always read in random order.

marty · February 2022

thank you for answering the questions,

you have answered my questions fully & i understand it now, but i dont think the "custom list" (.txt file) is writing properly, because it was adding urls, & then just stopped, this is despite adding 5,000 "identified"

based on what you said above about the number of identified being more than the number of urls, its impossible to know how many should be in the .txt file, but after getting up to 15k (size of file) it just stopped, even though its adding thousands more "identified" to the list.

if it is "storing" the urls in the .txt file & not "taking the urls out" when it is using them, then the memory size would be increasing

Sven · February 2022

Sorry, I don't get what you are referring to. No URLs are ever removed from site lists. Instead a random amount from a random position is read and stored to the project *.targets or *.new_targets file from where the project reads when it has no targets left in cache/memory.

marty · February 2022

i didnt do that, i read the other post about "custom list"

here (after doing search on forum this is the only post that mentions it)

https://forum.gsa-online.de/discussion/29618/what-is-the-new-custom-list-function-file-format/p1

& it just says that it has to be a ".txt" file - nothing else

so i have set it up in

"documents" folder

folder = CUSTOM FILE

.txt file = CUSTOM FILE

if i am doing it wrong, how come it puts the urls in the ".txt file = CUSTOM FILE" ?

because it puts the scraped urls in there, but as i said before not all of them, it has scraped another 32,000 urls (sorry not urls - 32,000 identified, i dont know how many urls )but the file is still 15k memory size, exactly the same as about 12 hours ago.

please tell me the exact procedure & where to put everything & what to name the files as.

it was definitely working , writing the .txt file with the scraped urls, but it was obvious that it was not doing all of them, the file was getting larger, but then stopped, i have done it a few times, & definitely checked that it is to "save urls to custom list" (on the scraper interface)

it started putting the urls on the .txt file, then stopped at 15k

another time it went a lot higher than 15k memory size, & i have not done anything to alter/change what it was doing to stop it saving the urls to the .txt file

----------------------------

as a seperate issue

"Instead a random amount from a random position is read and stored to the project"

i could never undestand this, because if it was taking them sequentially

line 1 -then

line 2 then

line 3

then it would do all of the urls once & only once, if it is taking them randomly, it will not do all of the urls & it will do some of them many times.

Sven · February 2022

I think there is a misunderstanding about "custom file". The custom site list is "just" a buntch of different folders you define that can have different site lists in it.

The custom file when identifying urls is just a file for yourself...whatever you want to do with it. This is not something you can use as a site list.

--

The reason to pull content from a random position of site lists is to not take care of he change you make to it (clear dupes, add new entries...).

marty · February 2022

so what is the point in it saying (tick box) "save to custom list " ? (on the scraper)

because its completely pointless ?

is it still writing to the identified lists ? (from the scraper )

why cant it have a "custom list" that it writes the scraped urls to one .txt file ?

because that would save a hell of a lot of time. then just verify all the identified by ticking all the platforms.

----------------

"The reason to pull content from a random position of site lists is to not take care of he change you make to it (clear dupes, add new entries...)."

no, i think it would be more efficient if it was doing it sequentially.

if it keeps bringing up "already parsed" then it is doing that unneccesaarrily when it could do everything in order - am i wrong ?

& for the same reason some urls will not be done at all.

Sven · February 2022

>so what is the point in it saying (tick box) "save to custom list " ? (on the scraper)

It's an option for people who requested it. Noone forces you to use it. You can e.g. use that file and import directly to your wanted projects.

>because its completely pointless ?

Just because you see no use in it, doesn'Ttmean it's useless for everyone else.

>is it still writing to the identified lists ? (from the scraper )

No, because you define to save it to a custom file instead of site lists.

>why cant it have a "custom list" that it writes the scraped urls to one .txt file ?

>because that would save a hell of a lot of time. then just verify all the identified by ticking all the platforms.

Because each project would read from that file with all kind of engines in it. Usually you are not using the same engines for each project, else it's a bad setup anyway.

> no, i think it would be more efficient if it was doing it sequentially.Sure, then code your own app please.

> if it keeps bringing up "already parsed" then it is doing that unneccesaarrily when it could do everything in order - am i wrong ?

& for the same reason some urls will not be done at all.

Of course it will not load sites that have been submitted before. It will soft them out by itself. When you get a "already parsed" it means it was a different URL on same domain or alike.

marty · February 2022

"It's an option for people who requested it. Noone forces you to use it. You can e.g. use that file and import directly to your wanted projects."

it doesnt, it stops writing to the file,

">because its completely pointless ?"

it doesnt, it stops writing to the file,& i selected to use the custom file & the projects NOT TO IMPORT from the "identified" list, but import from the "custom list" & the projects are not getting any urls because it is not writing to the custom list.

----------

>is it still writing to the identified lists ? (from the scraper )

No, because you define to save it to a custom file instead of site lists.

yes & it is not writing to the custom list, so the projects have no urls to post to,

--------------------------

>why cant it have a "custom list" that it writes the scraped urls to one .txt file ?

>because that would save a hell of a lot of time. then just verify all the identified by ticking all the platforms.

Because each project would read from that file with all kind of engines in it. Usually you are not using the same engines for each project, else it's a bad setup anyway.

the "SAVE TO CUSTOM FILE" is specifically at

options

advanced

tools

search online for urls

so therfore it is being used for scraped urls, & putting it through a project (see below) to change identified to verified would be normal procedure

------------------

ok, please tell me if i am wrong, i may well be, BUT

scrape using :-

options

advanced

tools

search online for urls

scrape using all footprints you are going to use

save to "custom list" (one .txt file)

set up project with all platforms/engines you are going to use (for anything)

in above project under "options" "use urls from global site list" everything unticked , but "custom" is ticked.

that means you are getting the urls from just one file, saving time.& then getting verifieds out at the other end.

OR

do above but dont import into project, until scraping is finished, import from file (custom list) into project

this way it does every single url, but only once, not half a dozen times or not at all.

but both of these are not working because it is not writing to the file.

doing the above, saves going through the identifieds many times, most identifieds are not verifieds, so just feed them through a project to get the verifieds out at the other end

---------------------------

"> no, i think it would be more efficient if it was doing it sequentially.Sure, then code your own app please."

everything else is done in order/sequence.

if number 3 is next, & it does number :-

25

36

85

12

25

100

175

36

25

45

25

it will take more time

-------------------------

Of course it will not load sites that have been submitted before. It will soft them out by itself. When you get a "already parsed" it means it was a different URL on same domain or alike.

how can it be a different url when it is "already parsed" ?

in order to be "already parsed" it must be the same url.

"Of course it will not load sites that have been submitted before"

yes it does it says "already parsed" , if it was doing it sequentially it would not do that because it would be doing

1

2

3

4

5

not the same ones again.

please tell me if i am wrong.

marty · February 2022

PS

it would be even better if the scraper on the backend could send the urls directly into a project set up to process the urls from identified to verified.

but that was not available, so i worked with what i had in front of me (the custom list) , but found that although it sends urls to the custom list that i set up, it stopped.

Sven · February 2022

This "you said, I said" thing is really annoying. I have explained it all before and just because you mix terms like "custom file" with "custom site list" doesn't mean there is some bug in it.

* Content is written to a file when using options->advanced->tools->search...

* A custom site list has nothing to do with "custom file" from the procedure above.

Now please lets move on.

gsa does not write all urls to custom file

Comments