independent tool for importing and sorting URLS

doubleup · October 2012

Would it be possible to create an independent tool for
importing and sorting URL’s, rather than doing it via GSA SER as it’s done
currently? I frequently scrape with scrapebox, and as a result, I’m constantly
trying to import these newly discovered URLs into GSA SER, so that they can be
sorted, but this has become a real pain. While importing, the program
constantly freezes, until the importing and sorting process has finished (which
takes some time, due to the size of the files I import and the frequency), and
it’s becoming a nightmare to use in conjunction with actually creating
backlinks with GSA SER.

At the moment, I pretty much have to stop my projects from
running, and set aside a specific time to import and sort my urls, which is
obviously cutting into the time I’m actually able to use GSA SER to create
backlinks. I then have to remove duplicate urls, which again takes time.

Would it not be possible, to create an independent tool, in
the same way as GSA Indexer, which can run along side GSA SER? You specify
where you want the identified url’s stored, and it then imports and sorts them within that folder?
And if you have GSA SER running at the same time, it can then also make use of
the newly imported URLS which have been identified, assuming your making use of
the global list? It would also be great, at if the end of a import and sort
run, you could set it up to automatically remove duplicate URLs, rather than me
having to do this manually.

mmtj · October 2012

1.) De-dupe with Scrapebox (and xrumer if you have)

2.) Compare against previous scrapes and your success lists

3.) Aftern 1+2 split your scrapes in part lists with 250,000 lines per text file

We never had a problem with identifying lists and we identify almost 2-3 million lines per day and it imports just fine if we split them in smaller lists and then import them all at once, instead of one big list.

doubleup · October 2012

Thanks for the info mmtj, but i don't fully understand your 2nd step.

So you have your freshly scraped list of urls. How do you go about comparing against previous scrapes and success lists? I'm probably missing something really obvious.

doubleup · October 2012

Been thinking about it, and I think I may get the gist of
what you mean. At the moment, I leave my ‘identified’ folder pretty much alone.
As a result, after all the scraping I’ve been doing, I have some files which
are massive, a blog comment platform comes in at over 500Mb for example with duplicate urls removed, and I think
this is probably causing an issue, as I assume the program has to write to this
file each time it imports and sorts a new URL into this platform.

When importing and sorting, do you scrape together a fresh
list, which then gets put into your ‘identified’ folder. You then run through
this list, with the program saving ‘verified’ and ‘submitted’ into the correct
folders. When your projects run out of links to create, due to your ‘identified’
folder being exhausted, do you then delete the contents of the ‘identified’
folder, and then import and sort a new list which you previously scraped, and
then run your projects making use of this new list, and on and on?

I have a feeling, my issue is probably due to the fact my
identified folder is massive, which is probably causing the program to hang
frequently, due to writing to the massive files it contains.

Sven · October 2012

What do you call "hang" or "freeze"? Do you get a bugreport window or you just have a feeling that it freeze? Please provide more details.

doubleup · October 2012

No bugreport, but the actual 'progress' window, and the main GSA SER window will freeze, and if you click on it, it'll report 'not responding'. It freezes this way frequently, say every 20 seconds or so. During this time, it's meant to be importing and sorting, but this takes a very long time.

Sven, how would you best make use of the import and sort feature, bearing in mind the massive file sizes i've accumulated within the 'identified' folder? One being over 500Mb. Do you import and sort, then once this is complete, do a run, and then remove the files contained within the 'identified' folder, as successes and verified would have been sorted in their respective folders. Or do you leave the 'identified' folder alone, and continue to add to it. Surely this would slow importing and sorting in the future?

For me personally, it just seems that the import and sort process really seems to be struggling, and could be much easier. As an example, it's currently importing and sorting while posting, You'd expect it to constantly write to the files within the 'identified ' folder as it progresses through my imported file, which is only 100Mb, yet it hasn't written to any file within the folder in over 2 minutes.

Sven · October 2012

sorry but your problems must come from something else. the site list import function is not limited to any filesize. it is loading like 100 urls from the file, proceed it and loads the next 100 until everything is done. I can not reproduce your behavior in any way.

mmtj · October 2012

Hey doubleup,

save your previous scrapes into a separate folder.

Then you compare your current scrapes against your previous scrapes and your already identified lists from the identified folder. To de-dupe just use scrapebox -> import URL list -> select url list to compare on domain level (if you want to remove all dup. domain) or select url list to compare on URL level if you just want to remove duplicate URLs (but thats only useful for blog comments, image comments and guestbooks).

doubleup · October 2012

@mmtj - Wasn't even aware of that feature within scrapebox. Learn something new everyday. Cheers mmtj, appreciated.

@ Sven - i thought others may have had the same issue, but maybe not. Thanks anyway

AlexR · November 2012

@mmtj - could you do me a big favour? Could you write up a brief how to use Scrapebox with GSA effectively. It seems you have this process very sorted and I can't find too much info on the threads about this.

Like do you use GSA footprints? How often to scrape? etc...

Would help a lot of users here. :-) (I know I'll benefit tremendously...as it's an area I'm quite weak at)

mmtj · November 2012

Sure,

I'll get something together, tho im using xrumer + hrefer 99% of the time, since they can handle text file with more than 1 million lines, are faster and more stable, but scrapebox works aswell.

Here's what we do:

1. We scrape on a daily basis (24/7) with hrefer and over 1700 footprints + a huge keyword list. We scrape roughly 2-4 million unique lines per day. Once we reach that number, we save the file and start a new scrape for the following day.

2. Then we compare the scrape against the previous scrapes and previously identified lists by using xrumer (or scrapebox) - in scrapebox you will have to use the "import url list -> "select the URL list to compare against" (this will just remove duplicate exact URLs) or you use "select the URL list to compare against on domain level" (this will remove all duplicate domains, so if you scraping comments or guestbooks, this may not be ideal).

3. After we have removed the duplicate urls/domains of our current scrape, we will compare it against current xblack.txt (we update it daily) - this is especially needed if you want to do forum posting. We still use a VPN for that and really good proxies with unmetered bandwith and a good line for all other platforms, our real IP is never on stopforumspam. Then we always remove sites that contain the following words: xxx, porn, gay, lesbian, adult, erotic, gambling, viagra, etc. (basically we remove domains and urls that contain bad words) - in scrapebox you can do that with "Remove / Filter" -> "Remove URLs containing.." then you enter your preferred bad word.

4. Now the current scrape is clean (de-duped and honeypot sites removed). For some platforms it's a good idea to trim the URLs to their main root (i.e. /index.php) - makes things faster when posting. You can do that in scrapebox with: "Trim to root" (this will delete everything after the domain extension) or you can use "Remove/Filter" -> Trim URL to last folder" (this will trim the URL to the last available folder, i.e. (www.domain.com/mediawiki/) - this is good for forums, wiki sites, bookmarks, article sites etc. - don't do it for blog comments where you need the exact URL.

5. Now we identify the scrape in GSA (no proxies, re-download 1x) - we have a good dedi. with a superb line (poweruphosting) and some blazing fast VPS on a private cloud server and we can identify the 3mill. in one night easily.

6. Once GSA is done identifying, we save the unknown and re-run those once. After that the scrape goes into our "Tested Scrapes" folder, cause we may run the raw list in other tools.

7. Now you can either compile all your lists in the global site lists folder (in this case the identified folder), but then you have to constnatly de-dupe after posting/using the lists, so we save them externally in our dropbox, so we can access them from other servers aswell. We sort them by platform and compile all identified lists in an individual folder, i.e. "Article Sites", "Bookmarks", "Guestbooks" etc. then we can import them individually whenever we need them. You can add them to your existing lists with Scrapebox by using "Export URL list" -> Add to existing list.

That's basically it. The problem with scrapebox is, that it can only support text files up to 1 million lines, which is quite annoying, especially with big scrapes. But overall it's still an awesome tool that we use daily.

My advice:

1.) Get Hrefer, it's superior to scrapebox. We scrape very very broad with over 1700 footprints, 500+ anon. proxies that are re-newed every 35 minutes and a big keyword list (no niche based scrapes).

2.) If you start a niche based project in GSA it's always best to pre-scrape links externally, it's faster and you will have more successfull links. You can use s4ntos footprint extractor to grab the footprints you need (add your own if you like) then scrape for a day, clean the scrape, run it in gsa with your preferred settings.

3.) Use good public proxies for scraping, if you don't have a private source, then check out this post here.

There may be some features that could be included in GSA:

1.) Comparing your global sitelists (all at once) against an external blacklist.txt and badwords.txt

2.) Comparing your global sitelists on URL and Domain level against external lists (i.e. previous scrapes).

AlexR · November 2012

@mmtj - thanks! Post is bookmarked as a reference post!

Couple of questions:

1) Let's say I want blog comments. Why use SB with the GSA footprints? SB only has 3 SE's, while with GSA, I can select multiple SE's...so not sure why I'd need SB?

2) How do you remove honeypot sites?

3) Let's say I want to target "dog training". Do I just run that through the SB keyword tool, generate 1000 keywords, merge with blog footprints and get the 50 000 keywords and then just run SB?

Thanks again in advance!

mmtj · November 2012

1.) That's why we don't use scrapebox for scraping, it's too limited, with hrefer you can add SE externally.

2.) These are basically sites that exist to catch spambots/automated software. You can find a list of these sites on stopforumspam.com -> Contributors.

3.) Personally, I would use a good KW tools + GKWT to grab a bunch of keywords. That's way more specific than scrapebox's keyword tool. The quantity comes from your footprints, not your keywords. If you want a niche based scrape, you have do define your keywords as broad as possible, while keeping it as specific as possible (if that makes sense). For example, we recently did a scrape for a small insurance site (dental) - we used roughly 600 keywords and we included our main anchors, secondary anchors and broad longtails.

Ozz · November 2012

http://www.stopforumspam.com/contributors

I've put that url into my blacklist filter in SER. If I'm not wrong than it should work.

thanks, mmtj

mmtj · November 2012

Yep, should work.

I sent you my current xblacklist too via PM, you can slap those on a simple HTML site on your webhost or import directly from GSA, they include a lot more sites.

AlexR · November 2012

@mmtj - can you PM the list too? Would really appreciate it.

SiNeX · November 2012

@mmtj - I would also be appreciative of a better Honeypot List

TIA

collywobbles · November 2012

Could I have your blacklist too

Thanks!

AlexR · November 2012

It's a very good list ;-) Thanks for sharing!

eagleflux · April 2013

PM me also the list

nicerice · April 2013

What an incredible thread. Thanks to @doubleup and @mmtj for the great information and to everyone else for asking these questions. Bookmarked for later reference.