I have imported around 90 million unique URLs as new targets and they get deleted with these setting

edited June 15 in Bugs
I have crawled myself quite a few URLs and made them unique

I import them into a fresh project

After a while target URLs of the project get deleted

Here my settings

Are there a maximum amount of URLs file size that GSA ser supports?

Imported target urls file is 6.13 GB

The import process is being success because i see target urls file in projects folder

The GSA ser starts with around fetching 16k urls

And today when i wake up after like 10 hours i see target urls in project folder got reset and there are only several megabytes target urls









I am trying again

the fresh project starts as 

Loaded 16777 URLs from imported sites

And damn target urls already got reset

Comments

  • edited June 15
    I am going to try with splitting it to 10m lines

    It makes around 610 mb

    lets see will it work this time

    Ok now it shows avg of 10m urls left

    this is different



  • Try to duplicate your project for ex. 20 times and split the urls into each project.

  • andrzejek said:
    Try to duplicate your project for ex. 20 times and split the urls into each project.

    I have split the urls into 10 pieces

    now it is working

    each piece is about 600mb

    so there is certainly a limit application can handle
  • SvenSven www.GSA-Online.de
    6gb file imported directly to the project? was that project having any content before? Would it be possible to send me that file?
  • edited June 16
    Sven said:
    6gb file imported directly to the project? was that project having any content before? Would it be possible to send me that file?
    Project was completely reset - right click reset data select all options except articles 

    I have right clicked the project and selected import target urls from file option

    Sure i will send you the file as private message
  • SvenSven www.GSA-Online.de
    thanks, however 6gb...thats a lot...don't know if i can easily support that on a 32bit app but I try my best.
  • SvenSven www.GSA-Online.de
    fixed in next update
  • @sven ;   is it fixed or will be fixed in next update ?


    I had some 15-20 million identified and when i run the clean up as shown in figure
    , number of identified went up to 50 million. All projects were stopped when running this, it took about 2 days for it to finish. I have seen this happen 2-3 times on different servers over past 2 years.



  • SvenSven www.GSA-Online.de
    Hmm this clean-up process was not touched/changed at all. I am not aware of any issues in here?
  • it has been causing problems for nearly 2 years that i know of for 15-20 million urls
  • SvenSven www.GSA-Online.de
    well, noone told me so far? Please open another thread and post details on how to reproduce it.
  • a good workaround for memory issues is to use filemacro
  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com
    edited June 21
    Split it up into files, I've found GSA starts to have problems at around 350-400k targets per project. We had this problem with 1Linklist years ago.

    You can then select all relevant projects, and all relevant split files, and import to them all at once. Click that randomize option that pops up.

    Wait. Awhile. GSA will let you know when it has everything imported.

    Another pro-tip; your going to want to clear your history and unused account historys frequently. (Every few days.) They will fill up FAST and slow gsa down.

    If your a linux user, the "SPLIT" command should help you out. If not, I'm sure you can google-up some windows app that does the same thing.

    OR "Cygwin" might work. Im not sure if it includes split or not, but it probably does.

  • SvenSven www.GSA-Online.de
    That splitting of target import files should not be necessary anymore. It was really an issue with too big files where not everything got imported. It is now fixed and should no longer slow down things as well.
    ---
    But you are right, the unused accounts thing needs a bit of work maybe. SER should have removed these accounts itself once no longer needed.
    Thanked by 11linklist
  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com
    edited June 21

    That's great on the import fix. That will actually save me a fair bit of time going forward.

    Clearing unused accounts isn't really an issue for me. It takes me five minutes every couple of days. But they really do pile up if a person does not know to keep an eye on them.

    Some kind of time-based clearing, or just a notice when the size gets to big (Sometimes mine will end up with more than a million.) would be great though.

    Regards,

    -Jordan
  • andrzejekandrzejek Polska
    edited June 21
    I have 20 milion urls in my project and its working fine, imported from file.
  • @andrzejek
    I have tried with 5 million links but dont know why the target urls gets zero within an hour . I am still confused, how can they be so quickly process and very less links built out it. Tried it many lists. I thought the problem was with me only but after seeing this thread, It looks others are facing same.
  • SvenSven www.GSA-Online.de
    did you clear up the duplicates on that file (options->advanced->tools->...)?
  • Yes I always do
  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com

    Do you have "Stop posting if no proxies are available" checked? It possible your trying to run with bad proxies.

    I'd also enable "Stop posting if no internet connection is available".

    Thats the only thing I can think of that could cause you to burn through a list that size, that quickly. It would take hours just to crawl 5 million pages, well enough attempt to post to them.

  • I am using 200+ Blazing seo semi dedicated proxies and just got them fresh.
  • There are 2 problems here. One is ALready parsed thing and the other one is not posting links. 
    I tried to process list first with GSA PI and clean it and make it fine as much as I can . Even I trimmed the list size but the result is same for me. 
  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com
    Well, already parsed means your getting a lot of duplicates. If you look in your project settings, make sure you have "Allowing posting on same sites again" enabled. That could possibly be the problem.
Sign In or Register to comment.