url/meta crawler to respect webapps and fragment meta tag

myhq · December 2015

hi @sven,

we work with webapps, and since these cannot be crawled, we cache html content to serve to search engines.

For this we place the <meta name="fragment" content="!"> tag in the content as google's best practise.

content for that url is then available for crawlers on: www.mydomain.com?_escaped_fragment_=#

so add: ?_escaped_fragment_=# behind the urls to get the html cached version that includes all meta needed for GSA SER.

This would save us a whole lot of time, probably easy to implement, and i guess is a feature that will be more and more appreciated.

Thanks!

myhq · December 2015

to give real live example:

https://getmeinside.com/madrid

see the meta tag, and

https://getmeinside.com/madrid?_escaped_fragment_=#

check source... includes all meta required...
notes: on https://getmeinside.com?_escaped_fragment_=# the <meta name="fragment" content="!"> can be ignored not to go in circles.

>> else, all urls will use the default meta set on the homepage...

myhq · December 2015

and with regards to my previous feature request, it would be nice to have a negative filter for the sitemap scrawl, so i could filter out languages based on parts of the url (regex): *es.* *de.* (subdomains in my case, or *https://getm* to filter out the english urls) and this way create a project per language and link with CB & SC4 accordingly.

Sven · December 2015

you mean hash bag urls as seen with twitter e.g. . they use url/path#!tag which is then translated to this ?_escaped_fragment_=...

Thats already working in SER.

myhq · December 2015

how could i make this work for our website then, since we use the meta name, not hash bag urls.

>getmeinside.com/sitemap.xml

Also, I tried to visit twitter urls with this addition, but i get 404, can you give me an example?

Sven · December 2015

I don't get you at all!? What url in that sitemap is not resolved as it should?

myhq · December 2015

the urls are, but the keywords are not.. every url takes the keywords from the homepage, rather than from the cached html for that page.

Sven · December 2015

Maybe we are talking about different things. It's hard to follow all of your threads. You speak of something completely different and jump to a next topic it seems. Now keywords are a problem?

myhq · December 2015

Long story short:

I found, the solution, it is to just have a tickbox to add:
?_escaped_fragment_=
after every url (at the end) scraped.

>>for the people using the <meta name="fragment" content="!"> option.

NOTE: the ?_escaped_fragment_= should only be appended for scraping the meta. we would not want to build links to [URL]?_escaped_fragment_=

You can ignore previous messages. (i deleted their content)

myhq · December 2015

@sven, can this be supported?

Sven · December 2015

I thought you fixed it?

myhq · December 2015

no i need an alternative to the #!, since we are using the meta names instead..

you dont need to adjust the scraper to look for this meta tags, instead this could be a simple checkbox in GSA that attaches ?_escaped_fragment_= at the end of every url in the sitemap when scraping the meta.

note: we don't want the ?_escaped_fragment_= attached to the urls we are building links to, it appears that is the case now for the #! so you might want to review that.

myhq · December 2015

Hi @sven,

Can this method be added?

myhq · December 2015

@Sven

here are the specs:

If a page has no hash fragments, but contains <meta name="fragment" content="!"> in the <head> of the HTML, the crawler will transform the URL of this page fromdomain[:port]/path to domain[:port]/path?_escaped_fragment= (ordomain[:port]/path?queryparams to domain[:port]/path?queryparams&_escaped_fragment_= and will then access the transformed URL. For example, if www.example.com contains <meta name="fragment" content="!"> in the head, the crawler will transform this URL into www.example.com?_escaped_fragment_= and fetch www.example.com?_escaped_fragment_= from the web server.

Source:https://developers.google.com/webmasters/ajax-crawling/docs/specification?hl=en

Sven · December 2015

Well at least I understand now what this meta and "!" is all about. I actually never saw anyone using this.
So you are talking of what function where this should be added!? Project edit->urls->edit-> "Crawl URLs" ?

myhq · December 2015

Hi @Sven, thanks for your reply. I understand its not something you hear often, web-apps are still rather new, but upcoming. So I am sure more and more people are going to need this..

so either when your META crawler crawls the page, and sees this meta name in the header, it can instead crawl the alternate url for the keywords:

domain[:port]/path?_escaped_fragment=

or you add a checkbox in the interface, to add the ?_escaped_fragment= behind every url for the scraping of the keywords. http://screencast.com/t/gjpMRzlnK1jP
whatever is easiest

Also note, that currently the #! add ?_escaped_fragment= to the target urls for link building, and I dont think you or anybody wants this, you only want to use this parameter for scraping the meta, links should be build to the clean url, without ?_escaped_fragment= ..: http://screencast.com/t/H498Htx5

So although we use domain.com/targeturl/?_escaped_fragment= for getting the keywords, we want to build backlinks to: domain.com/targeturl/

I hope its clear now, let me know if you have any more questions

myhq · December 2015

@Sven. is this ok now?

Sven · December 2015

It'S on my todo list.

myhq · December 2015

ok @sven, keep me posted. you have any timeline i can work with?

Sven · December 2015

well not this year anymore. At the start of next year I am probably busy fixing things here and there and then Im back to work on the to-d list...so maybe in the 2nd week next year.

myhq · March 2016

@sven, did this get included yet?

Sven · April 2016

will try to add support for next update.

myhq · April 2016

thanks @Sven. Can I make an additional suggestion for the SitemapScrape feature.

An average urls looks like:
http://ES.domain.com/maincategory/subcategory/postname

where ES in this case stands for spanish on a multi lingual site.

Can you make the scraper that we set a "minimum fragment" that needs to be present in URLs added to the list to build links to, e.g.:
http://ES.domain.com/maincategory/subcategory/

And then set to how many levels deep we want to build links, e.g.: 1

This will build links, in this example, only to the spanish portion of the website, and only to urls from the selected subcategory.. (including the subcategory page itself)

Its a simple filter, but allows much greater targeting when using the sitemap feature...

Another example:

http://domain.com/maincategory/
levels: 1

>> this would only build links to the ENGLISH pages (if that is default) and only to the subcategories within the selected main category (and the main category page itself)... (BUT not to individual post pages)

I assume that every day these links are updated before resuming campaigns? so updates on the site would be picked up the next day and included in the project..

myhq · April 2016

if we want all urls in the sitemap, we would simple put:
*.domain.com/
levels: 10

Sven · April 2016

Well please simply use the popup menu to check/uncheck things.

myhq · September 2016

I solved this by creating RSS feeds for the required links instead )

url/meta crawler to respect webapps and fragment meta tag

Comments