ive processed only 60k random urls, and i get a file with 1.2kk urls, most of them are links to css, js, jpg, png, youtube, domain register, etc. And other garbage sites with a lot of garbage urls inside
- the external links are saved if the url is identified or not.
- a filtering would cost to much time. you better filter the full url afterwards and also de-duplicate the entries in it based on domain.
it will be a really great feature if you make a filtering.
i can parse all urls from pages by my own, with other software, or with help of linux. but I don't want to reinvent the wheel
without filtering the filesize will be a very big, and it will be a problem to process it. also the disc usage is quite high when you writing all those css, js, etc every second.
Comments