Scraping and importing lists

nitinsy · February 2014

I am scraping using Gscraper and had a question on importing those into SER.

I currently scrape then de-dup based on domain (not looking for blogs)

In SER, I tried import target URLs and selected randomize/split between projects.

If say in a project I don't have articles selected and some of the scraped urls are article sites, will SER add them to identified list or will it ignore those URLs?

@Sven your inputs please.

Sven · February 2014

it will ignore them if that project does not have that engine checked in options.

nitinsy · February 2014

@Sven, got that. For that particular project it will be ignored, but is it added to identified lists - to be used later in other projects where that engine is checked

johnmiller · February 2014

@nitinsy I don't think it's in identified either since you didn't import/sort it but imported it directly into projects (which haven't posted to these sites if they are not checked as engines, so SER has no way to identify them) but I'll let Sven confirm that.

You should setup a couple of pure spam projects with all engines checked. The only purpose of these projects would be to identify/submit/verify your scraped URLs.

nitinsy · February 2014

@johnmiller, thanks for your input. It would be more efficient for SER to identify the site since it would have already downloaded the HTML to test for the selected engine.

@Sven, can you please respond?

Sven · February 2014

@nitinsy I thought I answered all already. No, if you uncheck an engine in a project. It does not know what site that is from and will not add it to identified as it did not use the engine and it's settings to identify.

coneh34d · February 2014

nitinsy I find that it is best to scrape for Tier One (contextuals) seperately from Tier Two stuff. Then I have two scrape types and can import them efficiently.

nitinsy · February 2014

@Sven thanks.

@coneh34d, thanks for the tip. I am still trying to figure out a better way to scrape & identify. If I use SER's identify and sort feature, I have to pretty much stop normal linking building since that feature uses the same thread pool. Identifying a million URLs can take over a day and that is just a waste of time (i.e. not building links)

Right now, seems like the best way is to split by contextuals/non-contextuals and then split further into projects while importing (import target urls).

coneh34d · February 2014

I would suggest not using the sort and identify feature. Just multi select all your Tier One projects for exampe, and then import target URLs. You are given the option to randomize those URLs and split them across the projects you multi-selected. Works like a charm.

johnmiller · February 2014

@coneh34d I tried doing separate scrapes for contextuals/non-contextuals but found that both scrapes will have absolutely every platform in them. So now I just do one big scrape and import the same list 2 times: 1x into T1 (only contextuals) and 1x into T2 (everything else). That way I don't miss anything.

coneh34d · February 2014

johnmiller Are your footprints defined well for both scrape types and still producing this behavoir. If so I had no idea. Thanks for the tip.

johnmiller · February 2014

@coneh34d Well I just take the footprints that are inside of SER, merge them with niche-related keywords and put that into SB. But even if I just do the above with article footprints only, I end up with all other stuff in there, too.

nitinsy · February 2014

Completely agree with @johnmiller. Right now what I see is every project is scraping using the same 100K keyword list and the same platforms. Irrespective of how good the footprints are, G will give you mixed results.

Since every URL is fetched to figure out the engine I thought it would be better to identify any engine match at that point and store for later use.

I like @johnmiller's idea of doing it 2x time across both types of projects. Thanks much.

Scraping and importing lists

Comments