By the way i have loaded entire target urls into the ram (a hashset in c#) and it is deduped in less than a minute meanwhile GSA ser takes more than an hour
I didnt use any hash algo because it would make it slower. although this would make it lesser ram > first hash the link like SHA256 (this would make it constant size like 64 characters), then check in hash set, if not exists, append to stream file and add to hash set, if exists continue
This would of course handle file size as much as ram memory you have since it loads all urls into the ram. However, it can be split into smaller files like beginning with X letter or alternative methods and those files can be merged at the end
As you see, it's almost the same as your way, just that I don't read everything into memory, just one URL and have a hash table that grows. I still think mine should be the same speed but Im here to learn. Can you give me your file that you use against your algorithm + timing and I try to optimize it against mine?
As you see, it's almost the same as your way, just that I don't read everything into memory, just one URL and have a hash table that grows. I still think mine should be the same speed but Im here to learn. Can you give me your file that you use against your algorithm + timing and I try to optimize it against mine?
I have sent you a PM including my URLs file and my source code file
Comments
2 GB text file takes around 2 GB in ram memory
But for dirty and quick way Just added like this
This would of course handle file size as much as ram memory you have since it loads all urls into the ram. However, it can be split into smaller files like beginning with X letter or alternative methods and those files can be merged at the end