Note: Parts of this page require fonts for the Assamese script. The fonts and installation instructions can be found here.
Finding Assamese content on the internet -- can search engines eliminate the haystack?
[Originally as this google document]
Popular internet search engines such as Google and Bing are not effective for finding content in the Assamese language. The current situation is essentially equivalent to ‘looking for a needle in a haystack’. Search results for most Assamese terms list content in the Bengali language first, with no Assamese content until several pages into the results. This is a critical problem for the 30 million people spread across Assam, the rest of India, and elsewhere in the world, who use the language regularly and need a solution urgently.
Parallels can be drawn to a language like Dutch, which has a user population comparable in size to that for Assamese. The Dutch language shares a character code chart with other popular languages like English, French, and Spanish. These character code charts are used to represent written text digitally in most of the world’s languages, and are managed by the Unicode Consortium, a global computing standards organization. Search engines produce results by matching character codes of a search term against character codes stored in a web-page. Therefore, when a character code chart is shared by multiple languages, search results would list web-pages written in all the languages that share the chart, unless the user is given some way to limit the results to one of the languages. Fortunately, for Dutch and its related languages, search engines do provide a way for users to limit results to their preferred language. Unlike those languages, search engines do not currently provide any way for users to limit results to Assamese. Users have to painstakingly sift through the combined results in all languages that share the character code chart -- Assamese, Bengali, and Bishnupriya Manipuri -- and manually skip over web-pages in the other languages.
To solve this problem, it can be tempting to try to find a way to get Assamese web-pages to be listed earlier in the combined results. But, that will not work for two reasons. First, the automated algorithms used by internet search engines do not allow for such reordering. For example, Google uses the number of links from other web-pages as a key factor in computing the PageRank of web-pages, a technique named after one of Google’s founders, Larry Page. These ranks are used to determine the importance of search results -- a higher-ranked web-page is listed before a lower-ranked one. The lower current volume of Assamese web-pages, compared to web-pages written in Bengali, causes Assamese web-pages to have less links on average, resulting in lower PageRanks on average. These lower ranks cause most Assamese web-pages to be listed late in search results. Bing uses a similar approach with a similar result. The second reason is that any method to rank Assamese web-pages earlier in the results would cause Bengali users to have difficulty finding their content. Thus, filtering out the Bengali web-pages by limiting the results to the Assamese language is the only sensible solution.
The specifics of the solution to limit results to the language are simple, well-understood, and standards-based - - (1) while crawling for web-pages, search engines need to recognize values representing Assamese, Bengali, and Bishnupriya Manipuri in standard tags used by web-pages to identify the natural language of the content (such as the Content-Language meta-tag in HTML) and (2) allow users to specify one of these languages to limit search results (by including the languages as supported values for a 'Search Language' setting). These changes should not be hard to make, since the technique has already been implemented for languages like Dutch, as mentioned earlier.
Until such support is added, millions of Assamese users will continue to look for ad-hoc methods to ‘shrink the haystack’. One such method is presented with this article (see inset). However, it is not intuitive for people to use and has limitations that will miss some relevant web-pages. It is the author’s hope that the development teams of search engines like Google and Bing will eliminate the need for such workarounds by urgently taking on the work to implement the standard solution for Assamese and its related languages.
Even though the Assamese language is not supported in Google and other popular search engines, there is a way to search for Assamese content using Google and possibly other search engines (this has been tested to work on Bing as well). By taking advantage of the encoding of unique Assamese characters in Unicode, search results can be limited to Assamese even without official support. Here is how. Type in the Assamese search term in Google as usual, and then add the search term ‘ৰ’ (the Assamese ra) at the end.
The additional search term prevents Google from listing most web-pages in Bengali and Bishnupriya Manipuri while showing most (not all, see limitations noted below) Assamese web-pages that match the main search term. This is because the letter ‘ৰ’ is unique to the Assamese language and is unlikely to be present in web-pages written in the other languages that share the script. Further ‘ৰ’ is very frequently used in Assamese writing, making it likely that most Assamese pages will contain it.
Here is an example. To search for Assamese pages about poetry, the screenshot below shows the search needed. Without the ‘ৰ’, the search results from Google, continued into many pages, would list many, if not all, Bengali language sites about poetry. But with ‘ৰ’ added, the search results will list only Assamese language sites about poetry.
Limitations