1) How do I stop the footprint from actually submitting? I just edited the Search term, but it's still doing submissions. Basically I want a footprint that only parses results. (See pastebin http://pastebin.com/2QCXqFxL)
2) I am reviewing the Google International, and see it has 2500 results. [1281/2129] - Is it possible that the SE has this many results? I thought it was limited to 1000 only for Google.
Maybe you can help me structure my next test. I want to measure the overlap between keywords and SE parse results.
So if you have
"a blue widget"
"b blue widget"
"c blue widget"
You are likely to get an overlap in SE results. (Obviously these are very very similar so will be high), but I want to measure the overlap in Search terms and thus remove keywords that are too similar before we waste resources parsing them only to reject them as already parsed.
@Ozz - That gives an error. "No search terms defined and no fixed URL"
It seems the first one you provided is working. Looks like it's all sorted. Now for some macros, to sort data, and some research to solve the keyword overlap issue. :-)
Is this possible: To have a total of 1545 Google.com results?
Google SE results parsed for a single keyword?
What I have done, is taken "090/093 [Page 001]" and removed first 090, taken the next 093 and placed into total results column, and then done this for each Page and totaled it all.
I just thought Google only has 1000 results per keyword?
how many results you get with 'search online for url'? save to file and scan the file. the only reason that comes in mind for me is that the google links (to google image, video, ....) are saved but those should be blocked by the SE script.
1) i don't know exactly how to do it with notepad++. maybe its doable with regex commands and wildcards??
sometimes i do such things with excel (or openoffice calc in my case). i copy/paste all lines to excel and when i'm ask how to import them i make sure to seperate each row with a SPACE or "[" or "]". once they are imported i delete the first rows and copy/paste all lines back to your editor.
2) for that you need the TextFX plugin have installed.
@ozz - the reason for stripping this out, is so that I can then remove duplicate lines, since the timestamp is unique, but adds no value for my tests. Basically, I'm trying to get the filesize down from 50mb, so I can import into excel and do all the delimiter functions you speak of.
Basically, I need to remove all duplicates and get filesize down...
to delete duplicates..go to 'text FX tool' .. check 'sort ascending' and 'sort output only unique lines' .. now go and select your text and then go text FX tool=>'sort line case sensitive at colomn'
this will sort all your lines case sensitive and will delte any duplicate lines
Can anyone explain, why in Excel when I import the data, the excel file is 3mb, but in .txt format it's over 30mb? Why are .txt files so inefficient???
+1 for global googler for putting all this time into testing all this.
Reading this post I figured I must have a huge overlap in results as I just selected all english SE's ( 156 ) as someone told me to do this when I was learning GSA. So now I'm changing it up a little bit but I'm finding it very hard to pick SE's. How many SE's do you guys have selected? And are they all international or spread out over random countries?
@pietpiet - I did a lot of work on this researching as did @LeeG and a few others. I think we might need to have a separate thread to hash this out.
Google results are identical for a number of America/English properies like the Cook Islands, Bermuda, etc. So you are duplicating efforts when you choose those.
Startpage pulls Google data, so that is a dup. Of course, Bing = Yahoo, but a question might be which is better, Bing or Yahoo.
Then there's the issue of international SE's where @LeeG says it defaults to the country that your proxy IP resides.
And then all the metacrawlers appear to be supplied the same data by Infospace according to my research.
So if you think you are confused, well, so am I. There are a lot of variables. The more you research it, you begin to realize there are different reasons for different choices.
Then of course our European friends on this board are going to have a different strategy because they are probably trying to rank in different places than the Americans.
It needs to be hashed out in a thread, and @Ozz and @Sven will need to help guide it as we are also dealing with issues on how GSA retrieves data, and how much the different SE's provide, that needs to be accounted for as well.
@GlobalGoogler I can not really answer your question here. You have to keep in mind that the program is sometimes using a timeframe on google to e.g. only show results in the last X days/hours and so on. This can change the result as well. Also it might find some promotion URLs (adwords) and that changes also somethimes.
@Sven - I understand. Just doing some testing here for SE's. I'm trying to measure the SE overlap that occurs with the different SE's.
You've answered my question about why the Google is giving unique results on a new search. I'd like your thoughts on the below? (Then I can compile my data)
If you look at Question 3 - It has multiple page 4 "Ask.com" results.
1) Why is this?
2) To measure total results would I add all of these? (0+1+2+4=7 unique) or does he highest one include the others? i.e. 4 (since this incorporates the 0,1,2 results)
@GlobalGoogler sorry Im a bit lost here and have no clue what you mean with your questions. It would be easier to have the URLs for the lines in your excel sheets.
Comments
enabled=1
default checked=0
engine type=Parser
description=Parsing SEs.
search term=a
add keyword to search=0
enabled=1
default checked=0
engine type=Parser
description=Parsing SEs.
search term=
add keyword to search=0
Do you know of any tools that can handle large log files and is there a way to delete all entries that do not have "[Page" in the line?
this will sort all your lines case sensitive and will delte any duplicate lines
I use PsPad and there you can do...
1) remove all "[16:53:03] " i.e the timestamp.
reg expression: search for ... \[..:..:..\] replace with ""
2) remove all duplicate lines after I've removed the timestamp?
sort and remove duplicates is an option there.
Reading this post I figured I must have a huge overlap in results as I just selected all english SE's ( 156 ) as someone told me to do this when I was learning GSA. So now I'm changing it up a little bit but I'm finding it very hard to pick SE's. How many SE's do you guys have selected? And are they all international or spread out over random countries?
@pietpiet - I did a lot of work on this researching as did @LeeG and a few others. I think we might need to have a separate thread to hash this out.
Google results are identical for a number of America/English properies like the Cook Islands, Bermuda, etc. So you are duplicating efforts when you choose those.
Startpage pulls Google data, so that is a dup. Of course, Bing = Yahoo, but a question might be which is better, Bing or Yahoo.
Then there's the issue of international SE's where @LeeG says it defaults to the country that your proxy IP resides.
And then all the metacrawlers appear to be supplied the same data by Infospace according to my research.
So if you think you are confused, well, so am I. There are a lot of variables. The more you research it, you begin to realize there are different reasons for different choices.
Then of course our European friends on this board are going to have a different strategy because they are probably trying to rank in different places than the Americans.
It needs to be hashed out in a thread, and @Ozz and @Sven will need to help guide it as we are also dealing with issues on how GSA retrieves data, and how much the different SE's provide, that needs to be accounted for as well.