Google Indexing

I first posted this Google Sites website nearly a year ago. Despite persistent efforts at inviting a visit from the 'googlebots' , and the fact that the site is actually hosted by Google, it has still not been crawled, let alone indexed.

For anyone with not familiar with them, Google Sites websites are a useful starting point for anyone who has no experience of web authoring, and they have one key advantage over the competition - they're free to set up.

Before I encourage anyone to set up their own site, though, I thought it would be only fair to offer a 'no holds barred' chronicle of what I've done since June last year in an attempt to get the site onto Google searches. This should help anyone interested to make a more informed choice. If you're happy for your site to get into all the major search engines apart from Google, then go ahead - if not, you'll probably need to buy a domain to stand any chance of getting indexed by Google.

Please read on if you are contemplating making the effort to set up your own Google site, before choosing whether to go with Google or not....


Google Indexing and its Limitations – A Personal Chronicle

Introduction

I decided in mid-2021 that I needed a website, and wanted to make this as simple as possible while keeping costs to a minimum. While doing some basic research on the options available, I 'discovered' the Google Sites free website offering. I decided to try out a new site using this platform to see if it would work for me.

My main motivation was to provide a mechanism for the occasional sharing of small files with colleagues and friends, avoiding the complexities of providing them with individual links to Google Drive files or sending them as email attachments. As it happened, I’d also recently authored a discussion paper on the design of Covid19 2nd Generation vaccines, and I wanted to share this with the scientific community. Using a website in this way would avoid the protracted delays I’d experienced in the past when publishing scientific papers via the conventional peer-reviewed routes. This is obviously a key requirement in a healthcare area where things are changing so rapidly, and where last week’s headlines quickly become this week’s ‘old news’.

Here's how I got on...

Setting Up

The initial process of creating a Google Sites website was fairly intuitive, and after a bit of getting used to and some research elsewhere on how best to use the system, I managed to set up an embryonic site with some file sharing capability using GDrive as a source. I then added some additional ‘feed in’ text on a couple of extra pages detailing the titles and a summary of the contents of the downloadable files, since, by that time, I’d realised that GDrive files cannot be indexed by search engines. I also added some more detailed text explaining the objectives behind the Covid19 discussion paper, which I’d already added to the GDrive shared directory. At this point, I thought it would be sensible to ask a colleague to check out the site for me and confirm that it was actually visible via the URL.

It was - so far, so good…..

Indexing

I then turned to the thorny subject of Google Searches. Perhaps naively, I assumed that a site offered by Google would have some degree of priority for the googlebots and would automatically get indexed within a few days of being set up, so I waited……and waited…and waited.

After a couple of weeks, I decided that I should at least check whether something was amiss, on the assumption I’d probably done something wrong during the setup process. A bit of additional research and a lot more time spent going round in circles within Google’s help topic system sadly left me little the wiser. In an attempt to get some more insight from deeper within Google, I registered accounts for Google Analytics and Google Search Console (GSC). This led to a lot more poring over ‘help’ files, but eventually I managed to work out how to get Analytics readouts for my site and inspect URLs for the site and its sub-pages on GSC. I couldn't persuade GA and GSC to connect, though.

That was nearly a year ago….

Despite regular searching on Google in the interim, using the specific ‘site:https://sites.google.com/view/websitename/’ search format, nothing ever came up, and GSC’s URL inspections option has never to this day reported the site as being ‘Known to Google’ or claimed to have trawled it.

Follow-Up

Having become quite unimpressed (to say the least) after 3 months at the complete lack of progress on indexing, I decided to see if I could get some help from the community forums. In the event, this proved quite helpful, a) in confirming I wasn’t the only one to struggle with Google indexing and b) that it’s a largely, if not completely opaque process, both in terms of how the indexing ‘bots’ decide which sites to crawl, and the time they take to do it.

Some useful insight also emerged on the criteria for crawling – first and foremost was that there is no guarantee that any site, however well thought out and professional, will be indexed. There is rumoured to be an emphasis is on ‘originality’ and ‘interest’ in the criteria for indexing used by the bots' algorithms, although I’ve never been able to pin down a definition of what either term actually means in this context, or how the algorithm assesses when a site is lacking in interest and originality (i.e. is too boring for the bots to bother with).

I know for a fact that my own site does contain content which appears to be original, in that nothing matching it appears anywhere else in Google or Bing searches. (This could, of course, be because there was other related info somewhere, but it appeared in websites also deemed insufficiently ‘interesting’ and ‘original’ to be worthy of the bots’ time…). Since the Covid19 content is newly authored, is regularly updated, and hasn’t been published elsewhere, it would by most normal standards be regarded as original. Whether the content could be classified as ‘interesting’ depends on your point of view. I would however submit that anything relating to Covid19 vaccine design which might in future have any chance of helping us ward off repeated waves of this dreadful scourge would be of passing interest…to the scientific community if no one else.

Conclusions and Discussion

At the time of writing it does appear increasingly unlikely that my site will be indexed by Google.

Since the criteria for site validity are so opaque, the extensive and time-consuming process of trial and error that would be needed to establish what actually does ‘work for the bots’ does not appeal. The urge to junk the site at this stage, and find alternative ways of sharing the info. is certainly a strong one, although my lifelong training and experience as an experimental scientist does push me towards a few more experiments to see if I can ‘crack’ the problem. I will endeavour to update this page if I discover anything that actually works….

In all fairness, on present evidence I would have to recommend that anyone choosing to start a new website and aiming for global visibility consider carefully the time and effort likely to be involved before embarking on the project, given the risks of their site effectively remaining forever hidden, however much effort they put in. I suspect not even professional web designers will have all the answers to the mysteries of indexing.

In this context it might also be worth considering why you would actually want go to the trouble of setting up a website at all, given that there are many other ways of sharing data electronically. Indeed, open web access may not be the best solution from a security point of view, if any of your content needs to be kept confidential. If you want to share data on a restricted basis via the net, a simple site such as this could be a viable solution, particularly since Google Sites provides a 'no robots' tick box within the settings option. Although selecting this option does not completely prevent crawling, it renders it highly unlikely, given that most search engines do adhere to this as a global request. Without indexing, in practice only users who already have the URL will 'see' your site. Even if it does eventually get indexed, it's widely acknowledged that few 'casual' Google searches ever progress much beyond the first page of Google's results. Only refined (i.e. more specific) searches are likely to present your own site sufficiently early on in the vast swathe of results that usually emerges from such a search, for your website to be seen.

There are other issues specific to the Google Sites new format, the main one being the limitations on what you can do to customise the site. Among other things, it’s not possible to suppress the last update time and ID data on GDrive file listings without a Google Administrator account – for which of course you have to pay a regular subscription charge. This could well present a security issue for some, who might want to make files available to others without revealing their ID. The new site format also precludes generating sitemaps – this may be less of a problem for simple sites, but could be an issue for more complex ones. Neither is it possible to modify the robots.txt file, which is locked to users. In short, if you need anything other than a simple repository for limited file sharing and actually want your efforts to be visible to all on the net within a reasonable timeframe, then perhaps, with all its associated uncertainties and limitations, Google Sites may not be the ideal solution. A paid-for site with greater flexibility and its own domain name might prove less frustrating. Although I have no direct evidence for this, my gut feeling is that the new Sites format may actually deter crawling by Googlebots. The fact that all the other major search engines appear to have indexed my site within the first 6 months and Google hasn't, despite there being no obvious blockages reported by GSC, does make you wonder....

My experiences to date have been valuable in pointing out the hidden limitations associated with searching the net. If my relatively simple site layout and content fails Google indexing, how many others have already, or will in the future, suffer the same fate ? How much useful and possibly quite novel info. will not be shared with the community as a result ?

While the undoubted power and usefulness of the major search engines including Google continues to benefit us all, and we should be grateful for them, the vast and ever-increasing pool of internet knowledge makes these questions ever more relevant.

I will certainly be interpreting my own search results in a slightly different light from now on, and will cast the net a little wider in future searches, particularly when it comes to fast developing topics where new information is of the essence. There are some quite useful 'multi-engine' search sites which allow you to apply a range of different search engines to a single search term without re-entering it each time.

Let’s hope this article will ‘flesh out’ the home page of this site sufficiently to tempt the googlebots to visit - and actually do some indexing…..time will tell.

I hope this brief review of what happened will be helpful to others who may be experiencing similar issues with indexing….please feel free to leave any comments you may have via the Contacts page.


Postscript, Updates and Some Additional Thoughts

17.9.21: I finally found out why I wasn’t able to persuade Google Search Console to link with Google Analytics. (Association is recommended by Google to optimise tracking). The secret is to make sure you generate a ‘Universal’ Property rather than just the default ‘GA4’ property which is apparently the new standard. (The universal property ID will start with a UA- prefix rather than with a G-). Without this type of property, you can't get the two accounts to associate.

No wonder I was confused…..

In case others are similarly ‘mired’ in help page-inspired confusion, here’s a link to a page that might help explain things and actually help you get your Analytics and GSC accounts linked:

https://www.analyticsmania.com/other-posts/how-to-create-a-universal-analytics-property/

Surely Google could make life a little simpler for us novices....

18.9.21: At last...a look at today's Analytics demographics revealed hits from Russia and China. A quick follow-up search on Yandex yielded a search result for 'vivweb01'. Nothing yet on Google or Bing despite continued indexing requests...this does beg the question as to why the Asian search engines seem to be getting there first....

25.9.21: Finally, some substantial progress. A quick search on 'vivweb01' reveals entries on 4 more of the major search engines: Yahoo, AOL, DucDuckGo and Bing. I've only so far requested indexing on Bing, so the crawls on the other engines must have occurred 'naturally'.

11.10.21: Yandex now yields a vivweb01 website entry for the specific text: "SARS-CoV-2: An Alternative Development Strategy for Second Generation Vaccines aimed at Combating Escape Variants". I presume this means indexing is complete for all sub-pages of this site. So far Yandex is the only search engine that has managed to do this.

29.10.21: Still no progress with Google indexing. The more I research other folks' experience with their Google Sites, the more convinced I become that they aren't really indexable in their unmodified free form.... (could it perhaps be because Google want to sell you a domain to go with them !?).

5.11.21: All major search engines except Google, i.e. Yahoo, Bing, and Yandex have now indexed this site. The discrepancy has been reported to Google via the GSC feedback page. No reply, or evidence of indexing, as of 12.12.21.

2.12.21: A bit more digging into some of the community posts revealed some interesting theories about preferences for the algorithms used to determine site 'indexability'. There appears to be a perception that a strong bias exists towards older sites. If this is really the case, the policy is a short sighted one on Google's part and will do 2 things - it will effectively exclude new websites, thereby depriving Google searchers of new and arguably more up to date info, and it will also risk retaining old 'zombie' sites which have long since become defunct and are no longer relevant, thus failing to remove 'clutter' from the index. This could put Google at a disadvantage against the competition, as borne out by my experience so far.

Take home message: If you really want your website indexed on Google before you retire, think carefully about the suitability of a Google site !

If you do want to 'sample the delights', however, then head over to my 'Google Site Setup' page for step-by-step instructions on how to set one up.

25.1.22: An interesting discovery - Starting a Google Blog yesterday with a single post resulted in overnight indexing for the new blog. I have posted a link to this website on the blog post, and will monitor GSC to see if this has any effect on the bots....this is strong circumstantial evidence for the existence of a 'covert' restriction on indexing new Google SItes.

18.5.22: As we near the first anniversary of this site's first posting with still no Google indexing apparent, it is perhaps time to consider again the reasons why it hasn't happened. This is an important consideration not only for users such as myself who would quite like their sites to be 'visible', but also for Google.

Assuming that Google actually want people to use their sites, the main problem users are likely to have in committing to a Google Site in the first place is the lack of any available information as to why their sites might not get indexed. The cult of secrecy that seems to surround the Google algorithms and how they work certainly doesn't help inspire confidence. Indeed it does beg the question as to whether anyone (at Google or elsewhere) actually fully understands how the search algorithms work in their entirety....

My own experience as a scientist developing and implementing methods provides a valuable clue as to why this might have happened. Scientific methods tend to evolve as time goes on and technology/management changes. Rather than junking the old method and starting again (which in many cases would be the most efficient course), modifications and additions are usually added to adapt to changing circumstances, new technology, and other changes in the world around them. The more people that contribute to the design of a method, the more complex (and less comprehensible) it becomes - and the more likely it is to 'fall flat on its face' when subjected to use in the real world. This in turn makes it much more difficult to troubleshoot.

The same is also true of computer coding, and I suspect that Google's search algorithms may have suffered a similar fate. It would be fascinating to see what happened if a US court subpoena-ed 'the Google algorithm' as evidence in a criminal trial......

The 'indexing problem' could, of course, just be the result of the sheer volume of new websites coming on line every day, which simply overwhelms the 'bots' capacity, and prevents them crawling all but a small proportion of them. This can't be the whole story, though, since it doesn't explain why all other search engines picked up my site within 3 months of it being published, and my blog was indexed overnight.

I suspect that, as with many problems, the causes are multiple. Like our dear UK NHS, the indexing system was designed in a much simpler age, and the many 'patches' that have been introduced to keep it afloat may have made it so complex and unwieldy that it is no longer fit for purpose. The inordinately long waiting times for sites to be indexed is reminiscent of NHS treatment waiting lists, which are already driving patients to the private sector, and risk the UK health system declining into a 2-tier with a significant privately funded element and a basic 'safety net' funded by the taxpayer. Despite Google's dominance in the internet search field, evolutionary pressure in the form of consumer demand may well leave them behind if they fail to adapt....

22.5.22 To satisfy my curiosity, I've just carried out an in-depth review of forum correspondence, and the results are pretty clear - out of the many aggrieved reports I unearthed from users who have set up their sites in good faith, expecting them to be indexed, no one reports actually getting their Google Sites indexed. Given the pre-eminence of the 'buy a domain' message throughout Google's writeups for these sites, the two real-world options are pretty clear - either buy a domain from us or your site will stay off the Google index.

In my view, from a PR point of view, Google would do better to be up-front about this when offering 'free' webspace....it's very unlikely that anyone who has been caught out by spending time and effort on a site they expected to be indexed by the host will ever actually buy anything from Google in future. Perhaps more significantly, they're also likely to bad-mouth the organisation whenever the opportunity arises (much easier to do, now that social media are so widely available...and followed). Bad experiences tend to stick in the memory, and 'big tech' organisations are prime targets for adverse publicity nowadays...

5.10.22 Wonder of wonders....GSC is now reporting the site is now indexed - and it only took them 1 year and 4 months ! However, don't get too excited (like I just did). A closer inspection reveals that only the Home page has been indexed - none of the others. More to the point, although the date of the crawl was in mid-September, a search on Google today for a specific string I have previously embedded on the home page reveals ....nothing.

The question of course now is - why ? So apparently random are Google's search manifestations that I suspect we will never know.......

Version date 5.10.22

Specific search key: covvivwebindexlimits099