Where is located project remaining URLs?

I have tried de-depu of the software and it is taking forever

So i plan to code a faster de-duper which uses more ram obviously

So my question is, where is located projects URLs so i can de-dupe them

Thank you
Tagged:

Comments

  • SvenSven www.GSA-Online.de
    Accepted Answer
    Even though I think it can not be faster, you can try it though.
    The targets are located in:
    c:\users\<login>\appdata\roaming\gsa search engine ranker\projects\<name>.targets
    c:\users\<login>\appdata\roaming\gsa search engine ranker\projects\<name>.new_targets
  • edited June 8
    By the way i have loaded entire target urls into the ram (a hashset in c#) and it is deduped in less than a minute meanwhile GSA ser takes more than an hour 

    2 GB text file takes around 2 GB in ram memory
  • SvenSven www.GSA-Online.de
    what hash algo do you use?
  • edited June 8
    Sven said:
    what hash algo do you use?
    I didnt use any hash algo because it would make it slower. although this would make it lesser ram > first hash the link like SHA256 (this would make it constant size like 64 characters), then check in hash set, if not exists, append to stream file and add to hash set, if exists continue 

    But for dirty and quick way Just added like this



        HashSet<string> hsReadLines = new HashSet<string>();
    
                    foreach (var vrLine in File.ReadLines(srSelectedFileName))                {                     hsReadLines.Add(vrLine);                     irReadLines++;
                        if (irReadLines % 10000 == 0)                    {                         Dispatcher.BeginInvoke(new Action(delegate ()
                            {                             lblStatus.Content = $"number of lines read : {irReadLines.ToString("N0")} \t hash set count: {hsReadLines.Count.ToString("N0")}";                         }));                     }                } File.WriteAllLines(srSelectedFileName, hsReadLines);
    This would of course handle file size as much as ram memory you have since it loads all urls into the ram. However, it can be split into smaller files like beginning with X letter or alternative methods and those files can be merged at the end
  • SvenSven www.GSA-Online.de
    well I do it like this:
    1. read file one line after the other
    2. take url, make a hash (Murmur2)
    3. find hash in memory
    4. add to new file + list if not found
    5. read next line

    As you see, it's almost the same as your way, just that I don't read everything into memory, just one URL and have a hash table that grows. I still think mine should be the same speed but Im here to learn. Can you give me your file that you use against your algorithm + timing and I try to optimize it against mine?
  • Sven said:
    well I do it like this:
    1. read file one line after the other
    2. take url, make a hash (Murmur2)
    3. find hash in memory
    4. add to new file + list if not found
    5. read next line

    As you see, it's almost the same as your way, just that I don't read everything into memory, just one URL and have a hash table that grows. I still think mine should be the same speed but Im here to learn. Can you give me your file that you use against your algorithm + timing and I try to optimize it against mine?
    I have sent you a PM including my URLs file and my source code file


  • SvenSven www.GSA-Online.de
    thank you, I will review this tomorrow.
Sign In or Register to comment.