Loaded URLs * from site lists

deadcow · April 2015

OK this has been asked before and I have pieces but I'd like to have a definitive understanding of this (and this may help the next people googling for it).

I have lists from a project I consider to be finished, and I keep having : Loaded URLs 0/200 from site lists.

From what I understood, it keeps reading the identified lists (I checked the box so it's OK to me), and the 0 means it has found new URLs.

Question #1 : OK what about the 200? From a thread I found you (Sven) said that it takes random URLs? Correct? So you're not reading the whole file but just take 200 random lines???

Question #2 : Now if it takes random lines, why is it 200 in my case? I saw some people have different numbers. And I probably have sometimes but I didn't pay attention.

Question #3 : any way to change that 200 or have the whole file read? (assuming everything above is correct)

Thanks!

Sven · April 2015

1) Yes

2) Depends on your defined Max. Threads and the current free threads. It would make no sense to load like 1000 URLs when you just defined 1 Thread. IT would be a wast of memory.

3) You can import your site list directly to the project using right click->import target urls-> from site list

deadcow · April 2015

OK got it, makes sense except for the random part. It happens I bought Platform Identifier some days ago and it adds new targets real time.

So if I have a target file of 10000 lines and it adds 5 fresh targets, the chances that it takes those 5 is like very low. That defeats 75% of the purpose of Platform Identifier cause if I have to come to SER and right click > import target urls > from site list I might as well turn Platform Identifier off and do right click > import target urls > from file and select my Gscraper output.

Not complaining cause there are a lot of haters out there and I get your point from a typical "standalone" setup but do you get my problem when combining your softwares? Not optimal

OK thanks and hope you can consider this!

Sven · April 2015

I see your problem but also see if I would go through the whole file and see where it has new targets, it's almost killing the program in terms of performance and memory usage. Thats not something you want.

Maybe it would be a good idea to add targets from PI directly to selected projects as SER would monitor changes on that and load the file.

deadcow · May 2015

You mean PI -> project.targets? Then I lose all the benefits of PI : sort platforms directly in my identified dir, even recognizing multiple platform from 1 URL, and - what I thought I could do - direct use of these platform lists from SER in multiple projects.

At the end I might as well do GScraper -> project.targets and uninstall PI, same result.

OK thanks for the help anyway

Trevor_Bandura · May 2015

@Sven for the site list thing, could you not just remember the last line number SER took the URL from, then when SER gets more links, it would move to that next line in the text file to get more?

Sven · May 2015

@Trevor_Bandura thats possible but what happens if you clean it up or exchange it completely as many ppl do? IT would cause issues all over.

Trevor_Bandura · May 2015

Yes you're right about that @Sven Unless once SER reaches the EOF it just starts again from the top?

deadcow · May 2015

@Sven, are you willing to find an acceptable solution or do you think it's too specific to be solved?
Would an option like "[x] Reload entire lists" be OK? In that case it would reread the file and filter by hosts_done?
If "NO!" and "HELL NO!" for the previous questions , would you share the checksum / whatever is used in hosts_done? So that I could do a script myself to add the functionnality.

Thanks!

Sven · May 2015

I might add a solution that @Trevor_Bandura suggested...still not sure how.

deadcow · May 2015

@Sven Oh OK nice.

If you're implementing what @Trevor_Bandura suggested be aware also that when using PI there's sometimes the "Remove duplicates" project running that can change the number of lines, hence EOF might not be relevant.

I'll try to search for another solution but isn't my "Reload entire list" option acceptable?. Yes there could be a temporary performance issue at the very moment when it is reloading it all and checking against the hosts_done file but do you think it's critical?

Anyway about the hosts_done checksum, is it a secret? Thanks!

Sven · May 2015

no secret... http://pastebin.com/9t6Y0BgZ

A host hash is generated by the domain only like http://www.blah.com -> hash('blah.com') or http://sub.blah.com/ -> hash('sub.blah.com');

---

An URL is done with hash( lowercase( URL ) );

deadcow · May 2015

THANKS!!!

Loaded URLs * from site lists

Comments