Daniel Brandt argues that Google is dying: its index is failing to keep up with the growth of the web. And he thinks he knows why—Google hit the 4,294,967,296 limit on 4-byte ID numbers in C. (Perhaps coincidentally, perhaps not, Google claims to index more than 4 billion web pages.) If this is true, fixing it isn't trivial when you need to fix a large number of machines that are working in parallel
On sites with more than a few thousand pages, Google is not indexing anywhere from ten percent to seventy percent of the pages it knows about. These pages show up in Google's main index as a listing of the URL, which means that the Googlebot is aware of the page. But they do not show up as an indexed page. When the page is listed but not indexed, the only way to find it in a search is if your search terms hit on words in the URL itself. Even if they do hit, these listed pages rank so poorly compared to indexed pages, that they are almost invisible. This is true even though the listed pages still retain their usual PageRank.
…this became a problem that I first noticed in April 2003. That was the month when Google underwent a massive upheaval, which I describe in my Google is broken essay. When that essay was written two months after the upheaval, it would have been speculative to claim that the listed URL phenomenon was a symptom of the 4-byte docID problem described in the essay. It was too soon. But sixteen months later, the URL listings are beginning to look very widespread and very suspicious. It's a major fault in Google's index, it is getting worse, and it is much more than a mere temporary glitch.
Google is dying. It broke sixteen months ago and hasn't been fixed. It looks to me as if pages that have been noted by the crawler cannot be indexed until some other indexed page gives up its docID number. Now that Google is a public company, stockholders and analysts should require that Google give a full accounting of their indexing problems, and what they are doing to fix the situation.
If it turns out that google is missing huge quantities of stuff there will be a lot of angry IPO buyers. And I will have to change my one-stop-search habits.
Pingback: The Importance of...
There are any number of explanations for this behavior outside of the 4-byte hypothesis. Google does not treat every page equally. Never has. I would put forward the idea that if you have a site with thousands of pages, it uses a site-specific PageRank technique to figure out what to index completely. Of course, since it’s a “trade secret,” I doubt anyone outside of Google will ever really know why.
It is now official – Daniel Brandt, has confirmed: Google is dying
Yet another crippling bombshell hit the beleaguered Google community when recently Daniel Brandt, confirmed that Google only between %10-%70 of all web pages. Coming on the heels of the latest Daniel Brandt survey which plainly states that Google has lost more market share, this news serves to reinforce what we’ve known all along. Google is collapsing in complete disarray, as fittingly exemplified by failing dead last [samag.com] in the recent Sys Admin comprehensive networking test.
You don’t need to be a Kreskin [amazingkreskin.com] to predict Google’s future. The hand writing is on the wall: Google faces a bleak future. In fact there won’t be any future at all for Google because Google is dying. Things are looking very bad for Google. As many of us are already aware, Google continues to lose market share. Red ink flows like a river of blood. Froogle is the most endangered of them all, having lost 93% of its core developers.
Let’s keep to the facts and look at the numbers.
Froogle leader Theo states that there are 7000 users of Froogle. How many users of GoogleNews are there? Let’s see. The number of Froogle versus GoogleNews posts on Usenet is roughly in ratio of 5 to 1. Therefore there are about 7000/5 = 1400 GoogleNews users. BSD/OS posts on Usenet are about half of the volume of GoogleNews posts. Therefore there are about 700 users of BSD/OS. A recent article put Froogle at about 80 percent of the Google market. Therefore there are (7000+1400+700)*4 = 36400 Froogle users. This is consistent with the number of Froogle Usenet posts.
Due to the troubles of Walnut Creek, abysmal sales and so on, Froogle went out of business and was taken over by MSN who sell another troubled search engine. Now Google is also dead, its corpse turned over to yet another charnel house.
All major surveys show that Google has steadily declined in market share. Google is very sick and its long term survival prospects are very dim. If Google is to survive at all it will be among search engine hobbyist dabblers. Google continues to decay. Nothing short of a miracle could save it at this point in time. For all practical purposes, Google is dead.
Fact: Google is dead
Please pardon me for that moldy old troll, I just couldn’t help myself. And I do expect to see that crapflood on /. when they post the story.
Not sure if this is relevant at all, but Google site searches have been incredibly glitchy the last few days. I use my blog as a storage tool, and used to be able to retrieve info. with Google – but it’s not working now. Hmmm.
I dunno, Google seems to do quite a good job of indexing the pages of my rather obscure blog – usually within 48 hours of my posting a new entry.
That’s just the thing…we know that Google does not treat all pages equally and we are doing black box analysis when they keep tweaking what’s inside the box. We don’t know when they tweak it or how they tweak it. Id on’t see how anyone can claim to have verified anything.
I think that the 4 byte explanation is not very believable. Google was designed to handle huge quantities of data, the volume of data already requires 64 bit indexing and it would not be a huge problem to extend the index size over a couple of system upgrades.
Issues like the Y2K bug are hard because they are deeply embedded. The Google data store is not deeply embedded in that sense. The source codes are all available, they do not depend on frame placement.
Google has never indexed the entire Web. The real question is what is deliberately left out. If it is merely variations on the various home page wanabees then it is better left out.
What I would like is a bit more control over the search results. In particular if a page asks to set itself to my home page then I can be absolutely certain that it does not have useful information. I use an RSS viewer for most browsing these days which means it is easy to leave javascript turned off and only bring up pages where I want to enable javascript in IE.
I’ve been trying to sort out why google went from indexing my obscure blog, to not listing me at all, to listing my page then back to indexing it – all in the space of a month. The only change in that time was the CSS. Just checked, I’m back to being listed and not indexed. And yes it’s active and it’s being hosted by googles own blogger. Surely a sign all is not well with google?
Joe, isn’t it a bit odd to gauge the health of Google as a whole based on how it indexes your site? Google is always tweaking things to a) improve search results and b) foil PageRank whores. Just because you’re hosted on blogger doesn’t give you a free pass into their index, otherwise it would surely be exploited by the ‘option b’ people. Google search results are hard to analyze from the outside because the mere act of getting the result may change the result 😉
Maybe its coincidence, but in the last week Google seems to have stopped picking up anything but my home page, no other pages on my site, from searches that used to raise every page…
Brandt doesn’t really inspire confidence by making an obvious misreading of the paper he cites in his June 2003 explanation of why Google is supposedly broken.
He says it talks about a 4-byte document ID when in fact it talks about a 4-byte ID for the search term in question (so Google could theoretically run out of space for search terms once there are more than 4 billion “words” to index on). The “integer identifiers” in the paper that seem to correspond to Brandt’s idea of a document ID are described as follows: ” The term ids are stored in a variable-length format consisting of a two-bit field giving the length L of the integer plus L bits giving the actual integer. The two-bit length field is actually an index into a table of bit counts stored in the header section of the database file.” So there are four different possible values of docID length, and common sense suggests that the biggest value ain’t 4.
There are lots of other reasons why Google might be “broken”, including the obvious one that their definition of “working fine” has to meet somebody’s definition of broken somewhere. This pseudonumeric hooha doesn’t seem to be in the running.
Google sucks. It shows a current entry but then doesn’t list an archive number, so anyone thinking they’re getting a certain report gets the one of the day and the subjects run the gamut. I get bummed when this happens with other sites.
I’ve been on the net long enough to witness the birth of Google and now its last dying gasps.
pat: I agree some could take advantage of how google indexs pages, so long as google loves the anchor tag and the keywords inside, there will always be that problem … BUT … You would think if you search for http://mywebsite.com you would get a direct hit (at least a listing), no? But half the time I don’t get even that. I’m not a anchor tag junkie, I’m not out to googlebomb or heck even try to get a googlewack and don’t expect top page ranks (I just run a blog and like anyone out there I check to see what terms I get hits on. For me it’s a curiosity not an ego thing) but to go from getting indexed when searching for the url of my blog to NOTHING, back to listed, then indexed again, that to me is weird. My blog has only been around a month or so and updated daily. Whereas my business site, which really hasn’t been updated much since it first went live, is indexed by google and never stopped being indexed.
The reason why I mentioned blogger is that lot of the blogger sites do get indexed by google (and stay indexed) and curiously some don’t. I’m talking about url searches – no keywords. I find it odd that they wouldn’t crawl their own services and at least list a url – but that’s me. Sorry, I should have explained it better, I don’t expect it but when it was gone I did a “hmmm” and scratched my head.
Joe: It would probably make sense if you could see inside the black box that is Google. We know that they run a huge back-end system with different clusters here and there to allow for better performance depending on where you are. There is a good chance that you are seeing results come back from different clusters. We don’t know exactly how the clusters talk to the index, how many indexes there are, how often they are updated, how long it takes to update, when they tweak the search algorithm, et cetera…
All I’m saying is that while it makes sense to an individual to judge Google based on how their own site shows up in search results, it doesn’t necessarily reflect on the “health” of Google as a whole, weird as it may be…we’re just staring at puppet shadows on a cave wall at this point.
If google is dying, the effest on kids who want to find things on the internet would be tremendous! THIS ISN’T GOOD
Paul doesn’t inspire confidence when he misinterprets that paper. It says in the paper:
___________
“The main section of a Term Vector Database file is the term vectors section, which contains the actual term vector data. This section is a sequence of fixed-length (128-byte) records in the following format:
* The page id of the vector is stored in the first 4-bytes of the vector’s record.
” The terms in the vector are stored as a sequence of integer identifiers. The term ids are stored in a variable-length format consisting of a two-bit field giving the length L of the integer plus L bits giving the actual integer. The two-bit length field is actually an index into a table of bit counts stored in the header section of the database file.”
[and so forth…]
___________
The “docID” I’m talking about is the same as the “page id” in this paper. It is not the “terms in the vector” as you claim. That’s different. I’m not even interested in the “terms in the vector.”
But let’s talk about the Brin and Page paper, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” since that one is more directly relevant to Google’s architecture. I quote from the paper that you mention only to make the point that the docID is compressed to the greatest extent possible.
In the Brin and Page paper, they say, “Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.”
That’s the docID. Every unique URL has it’s own docID.
—–
Pingback: The Importance of...