independent tool for importing and sorting URLS
Would it be possible to create an independent tool for
importing and sorting URL’s, rather than doing it via GSA SER as it’s done
currently? I frequently scrape with scrapebox, and as a result, I’m constantly
trying to import these newly discovered URLs into GSA SER, so that they can be
sorted, but this has become a real pain. While importing, the program
constantly freezes, until the importing and sorting process has finished (which
takes some time, due to the size of the files I import and the frequency), and
it’s becoming a nightmare to use in conjunction with actually creating
backlinks with GSA SER.
At the moment, I pretty much have to stop my projects from
running, and set aside a specific time to import and sort my urls, which is
obviously cutting into the time I’m actually able to use GSA SER to create
backlinks. I then have to remove duplicate urls, which again takes time.
Would it not be possible, to create an independent tool, in
the same way as GSA Indexer, which can run along side GSA SER? You specify
where you want the identified url’s stored, and it then imports and sorts them within that folder?
And if you have GSA SER running at the same time, it can then also make use of
the newly imported URLS which have been identified, assuming your making use of
the global list? It would also be great, at if the end of a import and sort
run, you could set it up to automatically remove duplicate URLs, rather than me
having to do this manually.
Comments
Been thinking about it, and I think I may get the gist of
what you mean. At the moment, I leave my ‘identified’ folder pretty much alone.
As a result, after all the scraping I’ve been doing, I have some files which
are massive, a blog comment platform comes in at over 500Mb for example with duplicate urls removed, and I think
this is probably causing an issue, as I assume the program has to write to this
file each time it imports and sorts a new URL into this platform.
When importing and sorting, do you scrape together a fresh
list, which then gets put into your ‘identified’ folder. You then run through
this list, with the program saving ‘verified’ and ‘submitted’ into the correct
folders. When your projects run out of links to create, due to your ‘identified’
folder being exhausted, do you then delete the contents of the ‘identified’
folder, and then import and sort a new list which you previously scraped, and
then run your projects making use of this new list, and on and on?
I have a feeling, my issue is probably due to the fact my
identified folder is massive, which is probably causing the program to hang
frequently, due to writing to the massive files it contains.
Thanks so much for the share I scrape roughly 8 to 10million lines a day.
Would you mind pm'n me a copy of your list as well @mmtj