What Is Latent Semantic Indexing

Latent Semantic Indexing

LSI changed the way that search engines provided results to those who are searching for information on a given topic or theme. Put simply LSI gives the search engines the ability to provide its users a more relevant list of options to choose from by running a series of smart algorithms over web pages.  These algorithms are used in conjunction with semantic analysis to provide meaningful search results to queries.


09-Sep 19, 2006

To try and help you understand LSI better (and more importantly how it affects your business), we need to take a very quick, high-level look at search engine history.

Basic Search History:

Before the world wide web became what it is today, there were text based bulletin board servers scattered across the globe. The ‘internet’ was a realm dominated by geeks, nerds and academics. In short it was difficult to get the information you wanted unless you knew where to look. Internet connections were very slow and there were no pictures on webpages.

Around 1993 came the Mosaic Browser which allowed ‘inline’ images to be displayed on the same page as text. From that moment on, the growth of the ‘worldwideweb’ was astronomical. Combined with advances in technology, the average man on the street could have and use a computer in his home.

With more people having the ability to create web pages the web mushroomed but trying to locate information was still a chore. Unless you were in the know.

Although there were ‘search programmes’, they were in their infancy and they were struggling to keep up with the cataloguing of the 1000’s of new pages that were being added to the new ‘web of sites’ each month. It wasn’t until 1998 when Google started making its mark in the world that search was forever changed.

Love ’em or hate ’em, we owe a lot to Google because if it weren’t for company founders Larry Page and Sergey Brin we ‘might’ still be in info darkness. I doubt it but it never hurts to give credit where credit is due.

Why? Because the major players of the day took the attitude of ‘Our users don’t really care about search – what we give them is good enough and we’re making money out of that’. Google Inc’s attitude was different and they soon became the benchmark to beat when it came to search.

In the late 90’s, the business model of the search engines of the day (including Google) was ‘If a website is there, we want to know about it, spider it and offer it’s content to our users’. This was great because there weren’t ‘that’ many website portals focusing on a topic and being able to offer ANY result for a search was better than offering none.

Fast-forward to the mid noughties – 06-07 and things changed dramatically. We started suffering from information overload.

There are millions of websites, billions of webpages and let’s be honest, search engine spammers have ruined things for the average business owner. The average user wants to find useful and relevant information on the topic or product they are searching for. What are they starting to get? Page upon page of mindless, poorly written garbage surrounded by adverts. Or we could put it another way. Pages of useless inappropriate adverts with useless content.

Its almost as if there has been a battle of epic proportions between search engine marketers and the search engine users. Yahoo, Google, MSN, AOL, everyone – users worldwide are affected.

The result of this information overload is that search engine users were beginning to distrust the search results and the search engine companies were at risk of decreased profits. The more familiar a user becomes with something, the more they expect from it. This isn’t good news for the search engine companies business (or profits) and things needed to change. One of those changes is the testing, introduction and ongoing improvements to LSI technology.

OK. History Lesson Over:

Now that your history overview is behind you and that knowledge in your mind, we can now look ahead and begin to plan or integrate your web site structure around LSI and Schema.org markup. To do this you need to understand a little about LSI.

I don’t ‘want’ to get technical but … you need to go through ‘some’ of it now so you can ‘begin’ to understand it, because if you can’t understand what the search engines are ‘looking for’, you will be fighting an uphill battle in the search engine listing wars. You will be letting money go to waste or worse, giving it on a plate to your competitors.

LSI Basics:

High Level: LSI involves (don’t get scared – it is easy to grasp) statistical probability and trying and work out the semantic distance (or similarity) between words and/or phrases in relation to a know topic.

In English: What LSI software is trying to understand are the relationships between certain words in a paragraph. The paragraphs in the document and when you take this further, LSI will then look for the relationships between the pages and your web site theme(s). Ultimately latent semantic indexing will become a part of the process that defines your website in relation to an overall topic within its search base.

So How does it affect you?

Search engine companies employing LSI algorithms are not only studying a document for keywords, they are also studying your documents and learning to recognize and identify the words that are common between these documents. By doing this the search engine databases are indexing the ‘semantic relationship’ between your documents to discover which pages are ‘related’ or are ‘closely relevant’ to an overall context or theme.

The same technology can also then be used to tag your content as ‘too focused’, ‘not diverse enough’, ‘too repetitive’. Either way it might not be considered good enough to be served to a potential client.

Let’s consider an example here.

Lets think about a website based on the word ‘Dogs’.

Semantically related to the word ‘dogs’ would be words such as ‘canine’ ‘puppies’ ‘puppy’ ‘dog’ ‘doggy’ – and others.

NOT semantically related to the word ‘dog’ but ‘similar’, would be phrases like ‘puppy fat’ or ‘canine teeth’ because some of these phrases ‘could’ also ‘relate’ to ‘weight or child health issues’ and dentists. Remember that in general, any computer programme is logical. It normally expects a ‘yes’ or ‘no’ response. Perfect or not?

In effect what LSI technology is trying to do for search technology is add ‘… but more similar too …’ and ‘most like …’ or ‘compliments …’ into the results that get displayed. Its almost like it is aiming to provide an automated ‘human touch’.

How does it work?

LSI algorithms scan the document it is working on for other ‘expected’ words or phrases. This allows it to make the assumption that ‘the page’ is probably about ‘dogs’ because it may also mention a ‘breed of dog’ or ‘dog training’.

LSI then takes this a step further by analysing the whole website that the page is a part of as well. Like a high level overview.

You may well have one page that just has the words ‘canine’ ‘puppies’ ‘dog’ within it. That page ‘could be about other things but … because the ‘theme’ of other pages in that site have references to ‘breeds of dog’ or ‘dog training tips’ the LSI algorithm is happy to classify your page under the wider theme of ‘dog’.

The LSI algorithm (unlike Schema.org or semantic markup) doesn’t understand anything about the meaning of a word in a document. It just reads through the patterns and usage of particular words and calculates ‘word relationships’ to an overall theme.

Latent Semantic Indexing practicalities and how it could be applied by the search engines.

We need to think of LSI as a form of artificial intelligence. With the number of web pages increasing dramatically on a daily basis, the challenge is for the search engines to give its users an ideal search result.

LSI fits in to the search process by enhancing the search engine’s capabilities.

A conventional search engine that bases its results on ‘keyword only’ analysis may not give the best results. This is because the older search engine programs cannot tell the difference between:

  • Similar words with different meanings.e.g.: Dice – Die (dice plural) – Die (as in dead) – Die (as in mould) or Router (wood shaper) – Router (internet connectivity)
  • Words that are similar in meaning but spelled differently,e.g. : sickness – vomiting
  • Singular and plural forms of words, ex: dice/die, dog/doggies,
  • Words with similar roots, such as ‘water’ ‘watered,’ ‘watering,’ ‘waterings,’ ‘waterer,’

The LSI enabled search platform is more effective because it does not focus on a bunch of keywords. The best example of this I have seen is when you search for Tiger Woods, the search engine will not look for web pages that use the keywords ‘tiger’ and ‘woods’. It will present a collection of pages that are related to the theme of Golf. This is what is called relevance feedback. i.e. during the past x months most people who searched for ‘Tiger Woods’ clicked on a link to a ‘golf’ related web site.

This is where the ‘general’ opinion of what LSI is goes astray slightly. Many assume it is an algorithm that is bolted on to a search engine. I think it better to think of LSI as a ‘concept’ and that word is important to remember. If we mentally tie in the phrase ‘ artificial intelligence’ to LSI technology you should begin to see the importance of it.

People want better search results – so give it to them

The users of search engines want better results and users are human beings. Using the 80/20 rule, it would be safe to assume that 80% of users want good information. They don’t want to waste their time. When you put these factors together the logical assumption should be that search engines need a human touch to make them better. Google have even suggested human intervention, so there can be no doubt that things are changing.

As the search engine spammers though up more and more ways to fool the search engines and ‘catch’ the unsuspecting internet user, so the user has become more adept at ‘spotting spammy sites’. In fact the users have learnt to be more specific in the search terms that they are now using.

Here is a quick example: If you wanted to buy a new wind turbine to provide alternative home power, the chances are you would do or had done, some internet based research using a search engine.

Searching for ‘buy new wind turbine’ does not tell me what I want so then I might try ‘new wind turbine for my house’ or ‘new wind turbine installation house’ – Usually, you will find that the overall number of search results reduce with an increase in the number of keywords searched for. Then it’s just a case of improving the quality.

For years humans have been learning how to refine their own searches on a given topic. It doesn’t take a giant leap of faith or a degree to work out that the search engines have been able to record and dissect all of this free human input. LSI as a concept is ‘giving back improved results’.

LSI focuses on knowing and analysing a document before it gets indexed. Therefore, LSI optimised pages are more archive-friendly and can point towards content that may be relevant but not directly covered within the document. Think of it as a kind of automated grading system.

The key point is simple. Search engines want to be able to provide better more accurate search results for their customers. LSI is one of the technologies that is being employed to meet this aim. LSI has the power to filter out ‘ineffective’ and unwanted information. If you don’t want to have your business filtered out or overlooked you need work in harmony with latent semantic technology and not try to fool or beat it.

SEO is an ever evolving process and I’m certain that it will change again in the future. For now, by learning about, following and implementing, simple LSI orientated optimisation procedures, it will pay dividends in increased long term traffic profits.

How To Build LSI into your ecommerce website design.

I personally am lucky enough to have found a dedicated group of technologists who really do make a lot of sense. More importantly are the real world examples of successful sites that have been employing the knowledge and ethos. The results are in. It works.

There are two highly relevant ‘arms’ to the process that I follow within the community. The first is learning about the design, structure and planning of the ultimate LSI compliant web site framework and the other is the research into the relationships between keywords, keyphrases and theme density.

It is quite a lot to get your head round but the community is supportive and attacking this from all sides. LSI is not something that you should rush into. Part of the secret is in the planning and having to learn the basics again will allow you time to get it right.

Even now this community is well ahead of the game. The software I’m using blew my mind apart and forced me to take a different approach. I’m getting results I never dreamed of. Keyword research has taken on a whole new meaning for me. What used to take days or even weeks can now be done in hours. What’s more is that I’m having fun doing it.

What you are about to learn is not an ‘easy fix’ or a ‘quick solution’ and unless you are really prepared to put in the time it takes to do this properly, I suggest that you do not just click on the links below.

Like wise, if you are looking for a free solution to your problems, this is not for you. Wait until a cheap alternative comes along. However I am fairly sure there won’t be a free alternative until it’s too late.

For those of you who are prepared to learn about something new and worthwhile. Something that will, without doubt or reservation, help you significantly from now with your search engine placement, you need to be involved.

Here is the link to the community that will put you at least two or three steps ahead of the competition when you implement what you learn.

The ThemeZoom project. Getting involved here will save you hours or days. There is stuff going on with this software which is right at the forefront of getting the best from understanding how the web works and how to pull all of the pieces of the puzzle together.

When I first found these folk they were offering a ‘day rate’ ticket to try it out. ThemeZoom is such a powerful research tool that many ‘day trippers’ gave up on it. I think that was because it gave back more information that they new what to do with. Rather than clog up the resources they stopped that facility.

You can still get a three day pass to use the Krakken software but, I realised that this concept is bigger than big. You could rush it. You can play with it but why would you.

Can I also recommend that you listen to and watch all of the tutorials that are available at ThemeZoom. Learn as much as you can. You will be more than happy.

Good luck in your designing and building you Latent semantically optimised website. The time you spend learning about and implementing what you discover has the potential to bring real and strategic longer term rewards and profits.  If you don’t want to do it yourself give us a call on 01787 311514 or drop us an email.  We can do it for you.