How does URL pattern matching works for proxy sources

Can you show examples for each pattern matching field?

I am asking about fetching main page then extracting sub pages urls to crawl those pages and get proxies in those pages

Ty

Also what does text only option do 

@Sven

Comments

  • SvenSven www.GSA-Online.de
    There is no pattern matching...it's my own code that checks each string for being an IP or domain followed by some integer being the port.

  • edited June 13
    Sven said:
    There is no pattern matching...it's my own code that checks each string for being an IP or domain followed by some integer being the port.

    ty for the answer @Sven but the part I am asking is 



    Also it would be very very good to show which pages it is fetching and processing right now

    For example I am fetching 21 pages of http://www.mfqqx.com/daili/index_%page%.html with same mask as yours and it doesnt show me anything

    I only see searching for new proxies. I have no idea whether my mask is working, unnecessary pages are getting fetched and processed, etc

    There can be another window which would show which urls are getting fetched and parsed 

    By the way there are thousands of proxies (at least 150 pages) here and software can't extract them : http://www.proxylists.net/us_0.html

    Also can not parse the proxies listed on this site. It is a default site in GSA ser > http://www.proxytm.com/public-http-proxy-server-lists/type-distorting.htm

    Can not parse proxies listed here as well. There are thousands of proxies > http://www.proxz.com/proxy_list_high_anonymous_0.html

    Another can not be parsed > https://premproxy.com/socks-list/01.htm . They use some sort of special span class to print ports

    Can not parse here as well it is also in official list > http://www.cybersyndrome.net/pla6.html
  • And my final question

    Where is saved all of these settings so i can back up them

    I mean the location in the harddrive


  • SvenSven www.GSA-Online.de
    Don't get me wrong, but this all sounds like you are going to "recode" things. Else, whats the meaning behind all these detail questions?
  • edited June 13
    Sven said:
    Don't get me wrong, but this all sounds like you are going to "recode" things. Else, whats the meaning behind all these detail questions?
    I have 0 intention to recode or release such software as yours

    Certainly I can code myself because i have written so many crawlers for my own purposes

    But right now my only aim is getting more proxies and increasing LPM :)

    I do not think anyone can easily compete with your software right now as yours has been getting developed for many years now. I know that developing a decent software alone takes many many years

    By the way if you ask my opinion, biggest weakness of your software is it is 32 bit and supports maximum like 2 GB ram memory. But i know that it is your intentional marketing practice 
  • SvenSven www.GSA-Online.de
    OK, but if you really aim to get as many proxies as you can, you really should have a look on GSA Proxy Scraper as this tool gives you the best public proxies available as it constantly tests things and gives you a value for the proxy (e.g. reliability, days being up and so on).

    I don't know if it makes much sense to use the SER proxy scraper for this task.
  • Sven said:
    OK, but if you really aim to get as many proxies as you can, you really should have a look on GSA Proxy Scraper as this tool gives you the best public proxies available as it constantly tests things and gives you a value for the proxy (e.g. reliability, days being up and so on).

    I don't know if it makes much sense to use the SER proxy scraper for this task.
    It is extremely expensive for me right now

    By the way I have collected over 1 million proxies with fine tuning GSA Ser and testing how many working at the moment

    Could you at least answer my questions in this thread? Thank you
  • SvenSven www.GSA-Online.de
    Accepted Answer
    your questions:

    the parameters you see here and masks work as in every other part of the program:
    mask1|mask2|mask3 <<< means at least one of them has to match and a mask can use * for any or none char/string or ? for any single char. You can also use ranges such as [1-9] or [a-z].

    Simulate Browser: << means you enter some user agent to tell it e.g. that you are google and want to parse it.

    start/end parsing from/at: just enter some unique parts of the html source where the parser should jump in and ignore the rest.

    proxylists.net: This protects it's content with javascript. Reload the page with javascript turned off and you see why it can not parse this.
Sign In or Register to comment.