Where is located project remaining URLs?

ComputerEngineer · June 2019

I have tried de-depu of the software and it is taking forever

So i plan to code a faster de-duper which uses more ram obviously

So my question is, where is located projects URLs so i can de-dupe them

Thank you

Sven · June 2019

Even though I think it can not be faster, you can try it though.

The targets are located in:

c:\users\<login>\appdata\roaming\gsa search engine ranker\projects\<name>.targets

c:\users\<login>\appdata\roaming\gsa search engine ranker\projects\<name>.new_targets

ComputerEngineer · June 2019

By the way i have loaded entire target urls into the ram (a hashset in c#) and it is deduped in less than a minute meanwhile GSA ser takes more than an hour

2 GB text file takes around 2 GB in ram memory

Sven · June 2019

what hash algo do you use?

ComputerEngineer · June 2019

Sven said:

what hash algo do you use?

I didnt use any hash algo because it would make it slower. although this would make it lesser ram > first hash the link like SHA256 (this would make it constant size like 64 characters), then check in hash set, if not exists, append to stream file and add to hash set, if exists continue

But for dirty and quick way Just added like this

&nbsp; &nbsp; HashSet<string> hsReadLines = new HashSet<string>();
<br><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; foreach (var vrLine in File.ReadLines(srSelectedFileName))</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {

</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; hsReadLines.Add(vrLine);

</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; irReadLines++;

</div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (irReadLines % 10000 == 0)</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {

</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Dispatcher.BeginInvoke(new Action(delegate ()<br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {

</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lblStatus.Content = $"number of lines read : {irReadLines.ToString("N0")} \t hash set count: {hsReadLines.Count.ToString("N0")}";

</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }));

</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }

         File.WriteAllLines(srSelectedFileName, hsReadLines);

</div>

This would of course handle file size as much as ram memory you have since it loads all urls into the ram. However, it can be split into smaller files like beginning with X letter or alternative methods and those files can be merged at the end

Sven · June 2019

well I do it like this:

1. read file one line after the other

2. take url, make a hash (Murmur2)

3. find hash in memory

4. add to new file + list if not found

5. read next line

As you see, it's almost the same as your way, just that I don't read everything into memory, just one URL and have a hash table that grows. I still think mine should be the same speed but Im here to learn. Can you give me your file that you use against your algorithm + timing and I try to optimize it against mine?

ComputerEngineer · June 2019

Sven said:

well I do it like this:
1. read file one line after the other
2. take url, make a hash (Murmur2)
3. find hash in memory
4. add to new file + list if not found
5. read next line

As you see, it's almost the same as your way, just that I don't read everything into memory, just one URL and have a hash table that grows. I still think mine should be the same speed but Im here to learn. Can you give me your file that you use against your algorithm + timing and I try to optimize it against mine?

I have sent you a PM including my URLs file and my source code file

Sven · June 2019

thank you, I will review this tomorrow.

Where is located project remaining URLs?

Comments