How does URL pattern matching works for proxy sources

ComputerEngineer · June 2019

Can you show examples for each pattern matching field?

I am asking about fetching main page then extracting sub pages urls to crawl those pages and get proxies in those pages

Ty

Also what does text only option do

@Sven

Sven · June 2019

There is no pattern matching...it's my own code that checks each string for being an IP or domain followed by some integer being the port.

ComputerEngineer · June 2019

Sven said:

There is no pattern matching...it's my own code that checks each string for being an IP or domain followed by some integer being the port.

ty for the answer @Sven but the part I am asking is

Also it would be very very good to show which pages it is fetching and processing right now

For example I am fetching 21 pages of http://www.mfqqx.com/daili/index_%page%.html with same mask as yours and it doesnt show me anything

I only see searching for new proxies. I have no idea whether my mask is working, unnecessary pages are getting fetched and processed, etc

There can be another window which would show which urls are getting fetched and parsed

By the way there are thousands of proxies (at least 150 pages) here and software can't extract them : http://www.proxylists.net/us_0.html

Also can not parse the proxies listed on this site. It is a default site in GSA ser > http://www.proxytm.com/public-http-proxy-server-lists/type-distorting.htm

Can not parse proxies listed here as well. There are thousands of proxies > http://www.proxz.com/proxy_list_high_anonymous_0.html

Another can not be parsed > https://premproxy.com/socks-list/01.htm . They use some sort of special span class to print ports

Can not parse here as well it is also in official list > http://www.cybersyndrome.net/pla6.html

ComputerEngineer · June 2019

And my final question

Where is saved all of these settings so i can back up them

I mean the location in the harddrive

Image: https://forum.gsa-online.de/uploads/editor/ep/gnzq71y07d9w.png

Sven · June 2019

Don't get me wrong, but this all sounds like you are going to "recode" things. Else, whats the meaning behind all these detail questions?

ComputerEngineer · June 2019

Sven said:

Don't get me wrong, but this all sounds like you are going to "recode" things. Else, whats the meaning behind all these detail questions?

I have 0 intention to recode or release such software as yours

Certainly I can code myself because i have written so many crawlers for my own purposes

But right now my only aim is getting more proxies and increasing LPM

I do not think anyone can easily compete with your software right now as yours has been getting developed for many years now. I know that developing a decent software alone takes many many years

By the way if you ask my opinion, biggest weakness of your software is it is 32 bit and supports maximum like 2 GB ram memory. But i know that it is your intentional marketing practice

Sven · June 2019

OK, but if you really aim to get as many proxies as you can, you really should have a look on GSA Proxy Scraper as this tool gives you the best public proxies available as it constantly tests things and gives you a value for the proxy (e.g. reliability, days being up and so on).

I don't know if it makes much sense to use the SER proxy scraper for this task.

ComputerEngineer · June 2019

Sven said:

OK, but if you really aim to get as many proxies as you can, you really should have a look on GSA Proxy Scraper as this tool gives you the best public proxies available as it constantly tests things and gives you a value for the proxy (e.g. reliability, days being up and so on).

I don't know if it makes much sense to use the SER proxy scraper for this task.

It is extremely expensive for me right now

By the way I have collected over 1 million proxies with fine tuning GSA Ser and testing how many working at the moment

Could you at least answer my questions in this thread? Thank you

Sven · June 2019

your questions:

the parameters you see here and masks work as in every other part of the program:

mask1|mask2|mask3 <<< means at least one of them has to match and a mask can use * for any or none char/string or ? for any single char. You can also use ranges such as [1-9] or [a-z].

Simulate Browser: << means you enter some user agent to tell it e.g. that you are google and want to parse it.

start/end parsing from/at: just enter some unique parts of the html source where the parser should jump in and ignore the rest.

proxylists.net: This protects it's content with javascript. Reload the page with javascript turned off and you see why it can not parse this.

How does URL pattern matching works for proxy sources

Comments