DRASTIC Efficiency Improvements - Part 2 - SE Logic Module

AlexR · January 2013

@ozz -

1) How do I stop the footprint from actually submitting? I just edited the Search term, but it's still doing submissions. Basically I want a footprint that only parses results. (See pastebin http://pastebin.com/2QCXqFxL)

2) I am reviewing the Google International, and see it has 2500 results. [1281/2129] - Is it possible that the SE has this many results? I thought it was limited to 1000 only for Google.

Ozz · January 2013

i can't tell for sure but if you only want to parse i think you can delete everything but the header

try this script.

[setup]
enabled=1
default checked=0

engine type=Parser
description=Parsing SEs.

search term=a
add keyword to search=0

AlexR · January 2013

@Ozz - Thanks! Testing it out now.

Maybe you can help me structure my next test. I want to measure the overlap between keywords and SE parse results.

So if you have

"a blue widget"

"b blue widget"

"c blue widget"

You are likely to get an overlap in SE results. (Obviously these are very very similar so will be high), but I want to measure the overlap in Search terms and thus remove keywords that are too similar before we waste resources parsing them only to reject them as already parsed.

Any ideas on how to do this test?

Ozz · January 2013

no idea how to accomplish that with SER only, sorry.

why not just using the "search online for url" tool -> safe results to file -> load them up in scrapebox and compare how many duplicates were deleted?

search term A = 1000 results

search term B = 1000 results

merged scrapebox = 2000 results

dedupe = 1400 results

--> 400 additional results for search term B.

however, this test would take months to get decent results.

AlexR · January 2013

@ozz - thanks. Will see if I can come up with a solution for keyword overlap.

Regarding the parsing footprint:

I get a lot of:

[15:02:23] test.com: [ ] 090/093 [Page 001] results on google for Compeditor with query link:drummondgolf.com.au -site:drummondgolf.com.au http://www.google.com/search?q=link:drummondgolf.com.au+-site:drummondgolf.com.au&as_qdr=all&filter=0&num=100&start=0

It seems it's adding a lot more than the "a" search term. Any ideas how to fix?

I have only added 1 keyword to project "golf" and only selected "Google International"

Ozz · January 2013

try this with "a" in your project as keyword. define no footprint in the script.

[setup]
enabled=1
default checked=0

engine type=Parser
description=Parsing SEs.

search term=
add keyword to search=0

AlexR · January 2013

@Ozz - That gives an error. "No search terms defined and no fixed URL"

It seems the first one you provided is working. Looks like it's all sorted. Now for some macros, to sort data, and some research to solve the keyword overlap issue. :-)

AlexR · January 2013

Is this possible: To have a total of 1545 Google.com results?

Google SE results parsed for a single keyword?

What I have done, is taken "090/093 [Page 001]" and removed first 090, taken the next 093 and placed into total results column, and then done this for each Page and totaled it all.

I just thought Google only has 1000 results per keyword?

Ozz · January 2013

how many results you get with 'search online for url'? save to file and scan the file. the only reason that comes in mind for me is that the google links (to google image, video, ....) are saved but those should be blocked by the SE script.

Sven · January 2013

Also keep in mind that sometimes it adds time frames to the google search on it's own...like only show links from last week or so.

AlexR · January 2013

@Sven - I have created a macro in excel, but I am running into an issue with a log file being way too big!

It's like 350mb and I can't import it.

Do you know of any tools that can handle large log files and is there a way to delete all entries that do not have "[Page" in the line?

Once I can do this, I can run my analysis.

Sven · January 2013

Im using PsPad that can handle large files. Notepad++ might do that as well.

AlexR · January 2013

@sven - "Is there a way to delete all entries that do not have "[Page" in the line?"

Ozz · January 2013

notepad++

1. mark "[Page"

2. ctrl+f -> mark

3. check 'Mark line' -> 'Find All'

4. 'Search' Tab -> 'Bookmark' -> 'Delete Lines without Bookmarks'

done

KayKay · January 2013

UltraEdit is very good for things like that ,)

AlexR · January 2013

Thanks Ozz - I spent about 2.5 hours trying to solve that issue. Tried over 5 programs...

AlexR · January 2013

@ozz-

Is there a way to:

1) remove all "[16:53:03] " i.e the timestamp.

2) remove all duplicate lines after I've removed the timestamp?

Ozz · January 2013

1) i don't know exactly how to do it with notepad++. maybe its doable with regex commands and wildcards??

sometimes i do such things with excel (or openoffice calc in my case). i copy/paste all lines to excel and when i'm ask how to import them i make sure to seperate each row with a SPACE or "[" or "]". once they are imported i delete the first rows and copy/paste all lines back to your editor.

2) for that you need the TextFX plugin have installed.

http://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad

AlexR · January 2013

@ozz - the reason for stripping this out, is so that I can then remove duplicate lines, since the timestamp is unique, but adds no value for my tests. Basically, I'm trying to get the filesize down from 50mb, so I can import into excel and do all the delimiter functions you speak of.

Basically, I need to remove all duplicates and get filesize down...

Will start some Googling for this...

AlexR · January 2013

Issue solved. Use Text FX Tools, and delete first word. :-)

senty4love · January 2013

to delete duplicates..go to 'text FX tool' .. check 'sort ascending' and 'sort output only unique lines' .. now go and select your text and then go text FX tool=>'sort line case sensitive at colomn'

this will sort all your lines case sensitive and will delte any duplicate lines

AlexR · January 2013

Can anyone explain, why in Excel when I import the data, the excel file is 3mb, but in .txt format it's over 30mb? Why are .txt files so inefficient???

Sven · January 2013

I use PsPad and there you can do...

1) remove all "[16:53:03] " i.e the timestamp.

reg expression: search for ... \[..:..:..\] replace with ""

2) remove all duplicate lines after I've removed the timestamp?

sort and remove duplicates is an option there.

AlexR · January 2013

Thanks...I have got around it using notepad++

@Sven - looking at data wanted to ask you a few things.

I did a run selecting "Google International". This will then be the reference point, so I can measure which SE's add unique results COMPARED to G int.

Question 1:

1) How come it's showing unique url's again for Google international?

Question 2: How come it loads a page and has a different number of unique results for the same page?

Question 3: Why is it loading the same page multiple times?

pietpatat · January 2013

+1 for global googler for putting all this time into testing all this.

Reading this post I figured I must have a huge overlap in results as I just selected all english SE's ( 156 ) as someone told me to do this when I was learning GSA. So now I'm changing it up a little bit but I'm finding it very hard to pick SE's. How many SE's do you guys have selected? And are they all international or spread out over random countries?

ron · January 2013

@pietpiet - I did a lot of work on this researching as did @LeeG and a few others. I think we might need to have a separate thread to hash this out.

Google results are identical for a number of America/English properies like the Cook Islands, Bermuda, etc. So you are duplicating efforts when you choose those.

Startpage pulls Google data, so that is a dup. Of course, Bing = Yahoo, but a question might be which is better, Bing or Yahoo.

Then there's the issue of international SE's where @LeeG says it defaults to the country that your proxy IP resides.

And then all the metacrawlers appear to be supplied the same data by Infospace according to my research.

So if you think you are confused, well, so am I. There are a lot of variables. The more you research it, you begin to realize there are different reasons for different choices.

Then of course our European friends on this board are going to have a different strategy because they are probably trying to rank in different places than the Americans.

It needs to be hashed out in a thread, and @Ozz and @Sven will need to help guide it as we are also dealing with issues on how GSA retrieves data, and how much the different SE's provide, that needs to be accounted for as well.

pietpatat · January 2013

Good idea, I started a new thread here: https://forum.gsa-online.de/discussion/1577/which-search-engines-to-use

Sven · January 2013

@GlobalGoogler I can not really answer your question here. You have to keep in mind that the program is sometimes using a timeframe on google to e.g. only show results in the last X days/hours and so on. This can change the result as well. Also it might find some promotion URLs (adwords) and that changes also somethimes.

AlexR · January 2013

@Sven - I understand. Just doing some testing here for SE's. I'm trying to measure the SE overlap that occurs with the different SE's.

You've answered my question about why the Google is giving unique results on a new search. I'd like your thoughts on the below? (Then I can compile my data)

If you look at Question 3 - It has multiple page 4 "Ask.com" results.

1) Why is this?

2) To measure total results would I add all of these? (0+1+2+4=7 unique) or does he highest one include the others? i.e. 4 (since this incorporates the 0,1,2 results)

Sven · January 2013

@GlobalGoogler sorry Im a bit lost here and have no clue what you mean with your questions. It would be easier to have the URLs for the lines in your excel sheets.

DRASTIC Efficiency Improvements - Part 2 - SE Logic Module

Comments