Skip to content

url/meta crawler to respect webapps and fragment meta tag

myhqmyhq usa
edited December 2015 in Feature Requests
hi @sven,

we work with webapps, and since these cannot be crawled, we cache html content to serve to search engines.

For this we place the <meta name="fragment" content="!"> tag in the content as google's best practise.

content for that url is then available for crawlers on:
so add: ?_escaped_fragment_=# behind the urls to get the html cached version that includes all meta needed for GSA SER.

This would save us a whole lot of time, probably easy to implement, and i guess is a feature that will be more and more appreciated.



  • myhqmyhq usa
    edited December 2015
    to give real live example:

    see the meta tag, and
    check source... includes all meta required...
    notes: on the 
    <meta name="fragment" content="!"> can be ignored not to go in circles.

    >> else, all urls will use the default meta set on the homepage...
  • and with regards to my previous feature request, it would be nice to have a negative filter for the sitemap scrawl, so i could filter out languages based on parts of the url (regex): *es.* *de.* (subdomains in my case, or *https://getm* to filter out the english urls) and this way create a project per language and link with CB & SC4 accordingly.
  • SvenSven
    edited December 2015
    you mean hash bag urls as seen with twitter e.g. . they use url/path#!tag which is then translated to this ?_escaped_fragment_=...

    Thats already working in SER.
  • how could i make this work for our website then, since we use the meta name, not hash bag urls.

    Also, I tried to visit twitter urls with this addition, but i get 404, can you give me an example?

  • SvenSven
    I don't get you at all!? What url in that sitemap is not resolved as it should?
  • the urls are, but the keywords are not.. every url takes the keywords from the homepage, rather than from the cached html for that page.
  • SvenSven
    Maybe we are talking about different things. It's hard to follow all of your threads. You speak of something completely different and jump to a next topic it seems. Now keywords are a problem?
  • myhqmyhq usa
    edited December 2015
    Long story short:

    I found, the solution, it is to just have a tickbox to add:
    after every url (at the end) scraped. 
    >>for the people using the <meta name="fragment" content="!"> option.

    NOTE: the ?_escaped_fragment_=  should only be appended for scraping the meta. we would not want to build links to [URL]?_escaped_fragment_= 

    You can ignore previous messages. (i deleted their content)
  • @sven, can this be supported?
  • SvenSven
    I thought you fixed it?
  • no i need an alternative to the #!, since we are using the meta names instead..

    you dont need to adjust the scraper to look for this meta tags, instead this could be a simple checkbox in GSA that attaches ?_escaped_fragment_= at the end of every url in the sitemap when scraping the meta.

    note: we don't want the ?_escaped_fragment_= attached to the urls we are building links to, it appears that is the case now for the #! so you might want to review that.
  • Hi @sven

    Can this method be added?
  • @Sven

    here are the specs:

    1. If a page has no hash fragments, but contains <meta name="fragment" content="!"> in the <head> of the HTML, the crawler will transform the URL of this page fromdomain[:port]/path to domain[:port]/path?_escaped_fragment= (ordomain[:port]/path?queryparams to domain[:port]/path?queryparams&_escaped_fragment_= and will then access the transformed URL. For example, if contains <meta name="fragment" content="!"> in the head, the crawler will transform this URL into and fetch from the web server.

  • SvenSven
    Well at least I understand now what this meta and "!" is all about. I actually never saw anyone using this.
    So you are talking of what function where this should be added!? Project edit->urls->edit-> "Crawl URLs" ?
  • Hi @Sven, thanks for your reply. I understand its not something you hear often, web-apps are still rather new, but upcoming. So I am sure more and more people are going to need this..

    so either when your META crawler crawls the page, and sees this meta name in the header, it can instead crawl the alternate url for the keywords:

    or you add a checkbox in the interface, to add the ?_escaped_fragment= behind every url for the scraping of the keywords.
    whatever is easiest

    Also note, that currently the #! add ?_escaped_fragment= to the target urls for link building, and I dont think you or anybody wants this, you only want to use this parameter for scraping the meta, links should be build to the clean url, without ?_escaped_fragment= ..:

    So although we use for getting the keywords, we want to build backlinks to:

    I hope its clear now, let me know if you have any more questions

  • @Sven. is this ok now?
  • SvenSven
    It'S on my todo list.
  • ok @sven, keep me posted. you have any timeline i can work with?
  • SvenSven
    well not this year anymore. At the start of next year I am probably busy fixing things here and there and then Im back to work on the to-d maybe in the 2nd week next year.
  • @sven, did this get included yet?
  • SvenSven
    will try to add support for next update.
  • thanks @Sven. Can I make an additional suggestion for the SitemapScrape feature. 

    where ES in this case stands for spanish on a multi lingual site.

    Can you make the scraper that we set a "minimum fragment" that needs to be present in URLs added to the list to build links to, e.g.:

    And then set to how many levels deep we want to build links, e.g.: 1

    This will build links, in this example, only to the spanish portion of the website, and only to urls from the selected subcategory.. (including the subcategory page itself)

    Its a simple filter, but allows much greater targeting when using the sitemap feature...

    Another example:
    >> this would only build links to the ENGLISH pages (if that is default) and only to the subcategories within the selected main category (and the main category page itself)... (BUT not to individual post pages)

    I assume that every day these links are updated before resuming campaigns? so updates on the site would be picked up the next day and included in the project..
  • if we want all urls in the sitemap, we would simple put:
    levels: 10
  • SvenSven
    Well please simply use the popup menu to check/uncheck things.
  • I solved this by creating RSS feeds for the required links instead )
Sign In or Register to comment.