CUSTOM GLOBAL LISTS

NocT · January 2013

Yeah I know some of you guys been waiting for me to come out with it here. Now that CB is here and relatively stable, I want to shift the conversation to a feature I think will solve a LOT of big issues we're having with submission quality.

SER is essentially a plain-text database of lists segmented into platforms. Something I've noticed is that the larger my database gets (now 2.5 million... yiikes!), the more threads I see being used to filter out the low PR targets from my global list each run. Sometimes a half hour goes by until I get anything higher than a PR2.

So here's a no-brainer... What we need is a way to sort and send selections of links to our own separate, named lists. Lists that are totally seperate from the global list because otherwise it just churns through hundreds of thousands of low PR links in the global list looking for high enough PRs. In my case, it's sometimes an hour later until it stumbles on a few decent targets. At least that's what it's like with a big fatty-mcfatterson list.

So is it just me noticing this or does the PR filtering process seem grossly redundant? It wastes all of those threads rechecking PRs from the global list which it should already have recorded from when it first identified the platform (or at least on first submission attempt).

I've been wondering for awhile now... why can't I specifically target PR links instead of having to filter through the whole global list over again each time? Why not use a database system like every other submission software instead of just churning through a plain text grab-bag of platform links?

I love the platform segmentation, but now how about quality segmentation?

If I want to submit to my highest quality links efficiently, the only way I know to do this is to filter every one of my global list segments with scrapebox and setup an entirely separate instance of SER to churn it.

jiggsaw · January 2013

I agree. That would be a nice feature.

AlexR · January 2013

I totally agree! I've been pushing for something similar for a while...Let's get quality and efficiency improved.

Have a read here: (Feature 5)

https://forum.gsa-online.de/discussion/1439/feature-requests-please-discuss-add-your-thoughts#Item_29

cre8iveq · January 2013

I totally agree with the concept, but I have a suggestion that might be a little more user friendly, while achieving a slightly better result. You mention the being that you have to search for hours to find the PR rank you are after, but it's also an issue that it crawls these sites over and over again to find the engine type, no follows and number of outbound links on the page.

The simple solution is that rather than having the global site list simply a list of URLs, it should be a CSV that contains all this information (or better still, a database that can be exported to a CSV if required). Now from a user perspective, you set your project up as normal, and it instantly knows which sites meet your criteria (even with a couple million sites, if it's over a second it won't be by much). It shouldn't be too hard from a programming perspective either (simple select statement).

AlexR · January 2013

That's exactly what I said in Feature 5! It basically stores all the filter data with it, so you can apply a "pre-filter" since PR and factors may change on a page over time.

Tank · January 2013

Nothing is really simple in programming, as a programmer myself, I can attest to this. Who knows how much code he would have to change around switching from a flat text file to a Embeddable Database Like SQLite, ect..

Switching to a Embeddable Database alone brings their own issues such as concurrency. When you write to any of these databases they get locked which means you can't have multiple threads writing to the database at once or they will fail. GSA does writes so fast that Sven would probably have to build sort of a queue system to keep up and again that adds way more work than simple database selects and inserts/updates/deletes.

SENuke, MS and UD all use database systems. Now I challenge anyone to try to add "2.5 MILLION ANYTHING" to them and see if they even load. SENuke starts to freeze and lag when you add 30K+ Sites.

There is a reason Xrumer, ScrapeBox and GSA are faster than ALOT of these other programs out here and that mainly because they use flat .TXT files the way they do. IF Xrumer used a Embeddable Database for everything it would not be anywhere near as fast.

thisisalex · January 2013

Even four just a few thousand sites, It seems to takes hours..

Sven · January 2013

Well said @Tank

KayKay · January 2013

indeed @Tank

Bytefaker · January 2013

But when you take a look at http://www.sqlite.org/limits.html, it should handle a few mio records without any problems. I think it depends on the used database. But especially flat databases like sqlite would be great and very performant for this (sqlite is also based on flat text files, google around you'll see that 2.5mio records would be a piece of cake for it).

NocT · January 2013

Why not use a database only to index the txt files and just tag the PR and OBL at the end of each text file line?

@Sven How are those metrics currently recorded for verified URLs?

Sven · January 2013

Let the coding part be done by me! Using a database is not required in most cases. It however is for massive amount of data where you have to search in a lot. Thats not the case for a SEO program as SER. Programs where this is required are things like our GENOM2005 program where DNA analysis produced a lot data. Belief me it is not needed and I will not add it.

Bytefaker · January 2013

I think when you did DNA analysis, I'm sure you know how to best store a lot of data. But @Sven do you see any possibilities to speed up "remove duplicate urls"?

Sven · January 2013

Everything can be speed up, but it is not much that I can get out of that as everything is very optimized here already.

1. sort lines

2. go from top to buttom and delete things if two lines are the same (same url) or the domain is the same.

I don't see and optimization here.

cre8iveq · January 2013

@Sven, you are totally right when you say let the programming part be done by you, I very much doubt that anyone here (including me) is qualified to advise you on the topic... but I still think the feature itself it really valuable, whether done in a DB or flat file. It just doesn't seem to make sense to recrawl all of those sites for every single project, only to find that 90% of them (at times) don't meet the criteria.

Slight side note question while we are on the topic... if a site is verified, but then later fails, is it removed from the verified list? (Or should I leave this for another thread?)

Sven · January 2013

>Slight side note question while we are on the topic... if a site is verified, but then later fails, is it removed from the verified list? (Or should I leave this for another thread?)

No it stays there.

The problem with saving all kind of information to a URL is a proper way to keep them updated. If you save PR1 to it and it actually is now PR0 you have another problem.

AlexR · January 2013

@sven - "The problem with saving all kind of information to a URL is a proper way to keep them updated. If you save PR1 to it and it actually is now PR0 you have another problem."

It's not that we need it updated, but if we can use it as a pre-filter it would be super useful. I.e. if you are targetting PR3 or PR4+ sites, if it was a PR3 and is now a 2, it's no big issue. These numbers are fairly slow moving. What it does achieve by having a pre-filter, you can avoid even loading all the PR N/A's for your projects since you can set the pre-filter to use the PR2+ sites. This avoids loading hundreds of thousands of URL's from your sitelists, only to reject them based on a filter.

You could load them using a pre-filter and then only apply the actual filter to the pre-filtered list.

cre8iveq · January 2013

Obviously the ideal situation is that the data is updated as it changes, but if that's too hard, then I agree with GlobalGoogler... even if the data were not updated, a "rough initial guess" would still save loads of time and be very valuable. Also, there is data that is unlikely to even change that would be useful, like the engine type, and whether it's follow or no follow (I know these are possible to change, but it won't happen often).

cre8iveq · January 2013

oh, I forgot... as for the sites not being removed from "verified" when they stop working... I'd like to put the option to do this in as a feature request. Or if you were going to implement the feature of this thread, then perhaps this could just be another field of data that is saved... "worked last time submitted", and you could choose to filter out anything that didn't. Thoughts anyone?

MrX · February 2013

if things are only about the PR of a site, why not work with a folder structure other than an embedabble db ?
maybe I am not thinking complex enough here but let me give an example:
Folder PR 1: Sitelist Engine 1, Sitelist Engine 2 ...
Folder PR 2: Sitelist Engine 1, Sitelist Engine 2 ...
...
If the PR of a specific site is changing, you could just delete the entry from the list and add it to the corresponding site list in other PR folder.

NocT · February 2013

Just to revisit this again... could we have PR information saved with each URL? It prevents wasted threads each time the global list is used with minimum criteria.

If a project is set to use PR 3+ using global list, it is incredibly inefficient without saved records. 80-90% of the threads currently recheck PR values for links that do not meet the criteria. This occurs with any project with minimum PR, OBL, and dofollow requirements.

Appologizes @Sven, it was never my intention when opening this thread to criticize programming preferences or debate whether or not to use a specific database technology... that's not my expertise to know about

But I DID want to emphasize the reason we need PR, OBL, dofollow metrics somehow saved with each URL. It avoids rechecks on every URL for each and every global list project we run with specific PR, OBL, and dofollow requirements.

I'm sure we don't want to continue wasting threads, bandwidth, and CPU resources on something that can be recorded for future reference.

Is the solution to this too difficult because of the way GSA SER has been already programmed? It would seem these metrics could be easily tagged onto the end of each text line using pipes to avoid the use of slow database technologies? (I'm speculating, so please don't be offended if this is a stupid proposition).

http://www.domain.com/target|2 (for PR 2)

http://www.domain.com/another-target|3 (for PR 3... etc)

@Sven

CUSTOM GLOBAL LISTS

Comments