As long as there has been written knowledge, there’s been the challenge of quickly accessing it to answer questions. We’re all familiar with the famous Library of Alexandria, but less well known is its librarian and early knowledge scientist, Callimachus of Cyrene. He developed classification techniques that would allow visitors to find the right scroll among the library’s enormous collection. Enlightenment-era polymath Denis Diderot extended this approach into the modern-era with his revolutionary Encyclopedia. And H.G Wells’s Permanent Encyclopedia almost anticipates Google keyword search! As we’ll see, while Google’s algorithms are impressive, they are built on ideas that evolved over 2000 years.
You do remember libraries, right? Those are the places where legacy printed books are stored, organized, and made available to the public for free. Before Google, if you had to answer a question, you’d visit your local library and search “the stacks” on your own. If that was too inconvenient, some libraries, most famously the New York City Public Library, had a “human Google” (or Research Desk) where anyone could call up with a query, and a librarian would do the grunt work for you. Some still do.
Libraries were the original search engines, and they have their own interesting history that spreads out over millennia. While advances in library technology were slow, they have deep connections with modern search engines, notably Google.
Knowledge today is spread out across the Web, but there was a time when all the answers could be found within the home library of one famous Greek philosopher. And a ‘search’ involved scanning through a single collection of paper scrolls!
Wealthy citizens of Athens circa 4th century B.C.E. could afford their own private libraries made up of stacks of papyrus scrolls. One prominent citizen, Aristotle, had a library containing scrolls not only of his own considerable works but also other leading Athenians on topics covering science, rhetoric, and poetry. When Aristotle had to refer to his library, he could grab the right papyrus scroll based on his own prodigious memory or, like the rest of us, could simply scan until he found what he wanted. He may even have had his own system for organizing scrolls, and it was said by later Roman writers that Aristotle taught the Egyptian pharaohs how to manage their libraries.
The Library of Alexandria was the ancient world’s greatest achievement in knowledge technology. After Alexander’s death, this Egyptian city was ruled by a line of Greek pharaohs with vast intellectual ambitions. Alexandria was also a cosmopolitan trading super-power, and Ptolemy I started building something like a university with places for scholars to lecture and study. A public library was an important part of their goals. By the 3rd century B.C.E., Ptolemaic rulers began collecting scrolls – even commandeering them from visiting trading ships – and then copying and translating them into Greek, the language of the city. At its peak, the library contained hundreds of thousands of scrolls and attracted Greek scholars, mathematicians, and scientists who came there to study.
While there are stories that Aristotle’s original library was acquired by Ptolemy II, more likely the ruler purchased copies of Aristotle’s works, along with plays by Euripides, Aeschylus, and Sophocles to stock this new center of learning.
The Alexandria library lasted well into the Roman era, and there are various stories about when and how it was destroyed. But likely a fire was involved, and certainly by the 4th-century C.E. it was no more. A great loss? Of course. But scholars have suggested that the work lost was mostly copied, and classic Greek literature lived on in smaller libraries and collections. One incredible innovation of written words is that they can be transferred the old-fashioned way using professional scribes, and so knowledge at the time was “backed-up,” to use a modern term, to other places.
A greater mystery is how the scrolls in the collection were organized and categorized so that the document could be quickly found. While a small personal library, like Aristotle’s, could be easily navigated, the enormous Alexandrian library would need an ancient version of a technological fix.
Enter Callimachus of Cyrene, a poet and super-star librarian. He may not have been the head librarian at Alexandria – this role existed! – but was likely involved in library administration. In fact, we know of his epic “Tables of Those Who have Distinguished Themselves in Every Form of Culture and What They Wrote,” which was essentially a giant inventory of the Alexandria library, but with an important twist.
Callimachus of Cyrene (Image Credit: Wikimedia Commons)
Callimachus improved on the idea of a straightforward list or table – pinakes in Greek – of the library’s scroll collection. Though Callimachus’s tables have been lost to us, we know from others commentators that he broke down topic categories into various subcategories. Perhaps he looked at non-fiction scrolls and decided they were about either history or science, and then divided science into geometry and astronomy. Ultimately, there was a list of relevant scrolls corresponding to the classification: Callimachus invented a primitive keyword search based on a subject index of the library’s enormous scroll collection!
The pinakes also directed the searcher to a physical location in the library, where the scroll of interest (and possibly related scrolls) could be found and read. As you might suspect, Callimachus’s tables spanned many scrolls. But by using it, a visitor knowing just a few topic areas could narrow down a search to a more human-sized list of relevant scrolls.
His key insight, one central to 20th-century information science, was to use a “divide and conquer” solution to organizing massive amounts of content: Take a larger problem, and break it up into more manageable subsets. In the 21st century, Google manages Internet search in a similar way.
As you might imagine, once you had the scroll in your hands it wasn’t necessarily easy to then find a relevant passage or section. As with a long internet HTML-base page, you had to ‘scroll’ through the document, but back then it was a physical process – and not an easy one at that.
Starting around the 1st-century C.E., there was a shift from the scroll to a codex (Latin for block of wood), which would be resemble what we’d call a pamphlet. The earliest codices were folded rectangular sheets of papyrus – you probably made a codex as an elementary school project by folding colored construction paper – and were available on the streets of Rome. Their convenient portability made them very popular.
A few hundred years later, monasteries became centers of classical learning, and codices completely replaced scrolls. The collections of these monastic libraries were many times smaller than the great Library at Alexandria, and likely the system of organization was simpler – probably something closer to an inventory with some location information directing the reader to a section of the library. However, there were innovations at the book level.
The Great Library of Alexandria (Image Credit: Raphaëlle Deslandes)
A table of contents was nothing new – even the older Greek and Latin scrolls had something of a micro-inventory of contents – but with information now on discrete pages, the reader could be directed to a section of the book based on – you guessed it – a page number. What an innovation!
Before the printing press, books were scarce and expensive. Medieval church libraries and early public libraries chained their books to shelves so as to prevent theft. Atlas Obscura has a wonderful collection of photos showing chained codices from libraries in the U.K., Italy, and the Netherlands.
If you want to get a feel for these medieval manuscripts – both older versions that were copied from scrolls along with best sellers of the time – take a look at the Digital Medieval Manuscripts app, which lets you view images of codexes from libraries all over Europe.
By the time of the Renaissance, the pieces were in place for knowledge to become more widely available. The convenient codex form coupled with the invention of the printing press meant that books could be published quickly and less expensively – you didn’t need a room of scribes to copy and painstakingly inscribe each word.
Libraries began to spring up outside of the Church and could be found in the secular universities, say the Sorbonne in Paris or Oxford’s Boedlian library. But classification of this burst of new knowledge was still based on techniques literally from the Middle Ages. Libraries continued to rely on an inventory approach to their collections, perhaps organizing the catalogues alphabetically by author.
It took two Enlightenment-era French polymaths, Denis Diderot and Jean-Baptiste D’Alembert, to initiate a more formal approach to grappling with the new knowledge-verse. Their ambitious project was to create a single reference source for all knowledge of the time – not only covering long standing Classical-era and medieval wisdom, but also Enlightenment-era advances and the books that it generated. A mini-library of Alexandria, so to speak.
Diderot's Knowledge Map in English (Image Credit: Univ. of Michigan Library)
But first Diderot and D’Alembert had to address the question of what was worth knowing. Their answer was a novel map of human knowledge (above) that guided the contents of their Encyclopedia, or Reasoned Dictionary of the Arts, Sciences, and Trades.
Realizing this enormous task required an army of authors, the two enlisted the help of almost 150 experts, with Voltaire and Rousseau being two of the more notable contributors. However, D’Alembert and Diderot wrote the bulk of the essays. Finally completed in 1780, the project had taken over two decades, eventually spanning 35 volumes. In the end, Diderot had taken all of Western knowledge – or at least high-level descriptions of it – and compressed it to fit into a long bookshelf.
Diderot’s encyclopedia project led him to speculate on artificial intelligence (or AI). He famously quipped that a mind “is a book that reads itself.” In 2005, Google co-founder Larry Page similarly said Google scanned books from university libraries into the search engine because eventually they’d be read by an AI!
One might have thought their knowledge map would have organized the actual encyclopedia into separate topical areas. It was instead ordered alphabetically, like a dictionary. However, once you have a topic keywords, perhaps Philosophy, you’d flip to the P-section and read the entry. The entries also included keywords from the knowledge map, so you knew where the thing being defined fit into the hierarchy.
With Diderot, an educated non-specialist could save a trip to the library and the legwork involved in exploring a library’s stacks. It was a solution to the era’s information overload problem, explaining knowledge in a way the public could understand. However, on a macro-level, there still remained the problem of helping the public physically access and find knowledge within a growing system of public libraries.
Library as search engine: Dewey numbers were like keywords (Image Credit: Manchester Central Library)
Library technology eventually caught up with the hyper-growth of knowledge triggered by the Enlightenment. In the 19th century, academic libraries were typically organized into a pre-Diderot layout of history, philosophy, and poetry stacks. Toward the end of that century, a university librarian named Melville Dewey came up with a solution that still lives today.
Great minds think alike, and Dewey devised a divide-and-conquer approach to classification that would be familiar to Callimachus and Diderot. Dewey’s particular genius was in assigning a block of numbers to different topic areas. His ten knowledge categorizations covered a range from 0 to 1000, in groups of one hundred. (Learning about the Dewey decimal system was a rite of passage for mid-20th century adolescents, similar to kids today memorizing shortcut keys on their cell phones.)
The library stacks also mirrored the Dewey system so that one could find the physical location of a book based on its number. Using this type of classification, visitors only needed the general topic areas – essentially keywords – to zero in on relevant books.
Dewey left open a spot from 0 to 100 for “miscellaneous” and that’s where computer information sciences was placed – it’s in the range 004 to 006. Artificial intelligence is a sub-classification under “special computing methods,” 006, and gets a .3 suffix. So 006.3 is the Dewey number for AI.
It took a visionary early 20th-century science fiction writer, H.G. Wells of War of the Worlds fame, to speculate how a library system might be further improved with more technology. Wells believed the then-new medium of microfilm could be used to efficiently store the massive content required to make all human knowledge available to the public, perhaps even within their own homes.
H. G. Wells (Photo by George Charles Beresford)
Wells’s World Brain project was clearly influenced by Diderot, whom he mentions by name in his talks and essays on this subject. However, unlike Diderot, Wells thought to include detailed content for specialists in science, technology, and other disciplines, as well as more general overviews.
Interestingly, Wells also foresaw the necessity of an active team of human curators to constantly index new information. Indeed the project, which he called the Permanent World Encyclopedia, does start sounding like a low-tech version of Google, or better yet, Wikipedia, which relies heavily on human curators and indexers.
We now know Wells’s vision required a technology boost that he couldn’t possibly have anticipated when he was writing in the 1930s – computers, silicon chips, advanced software and algorithms, laptops, broadband, WiFi, and more. But is the Google search algorithm a realization of Wells’s idea of a Permanent World Encyclopedia?
Not really! Google does make astonishing amounts of “content” available, but its search engine had to accommodate a practical matter that wasn’t an issue for Wells in the early 20th century. With zettabytes – that’s 10 to the 21st power – of professional and general books, articles, online lecture notes, news, analysis, videos, and audio, there isn’t a single resource or even a small number of references that would match a search for say “Aristotle philosophy.” (Try it!)
Instead, Google devised a clever, highly mathematical algorithm – known as PageRank – to order the search results in the order of what it thinks is most relevant. Its ranking system is something of a popularity contest using the “wisdom of the crowd” to find the truth; content that is more popular based on link references generally receives a higher rank.
Google makes no claim the entries on the search page correspond to Knowledge, though it does try to account for what it calls “authority” in ordering the listings – and at least claims it relies on humans to make quality judgments. Overall, though, it is not an encyclopedia curated by experts.
However, it does achieve Wells’s dream of universal access to information. And it also pulls off the not inconsiderable trick of going from a keyword to a specific book based on whether the book has the actual word. Sure, the Dewey system will get you to the right book, but Google effectively takes you to the actual page of the book!
Let’s end our long story with an abbreviated Diderot-like overview of how the Google search engine performs its technological magic.
And the answer is that Goggle uses tricks somewhat similar to … Callimachus’s table from two millennia ago!
The key point is this: When Google searches documents for keywords, it is not scanning the web pages, pdfs, Word documents, or books while you’re waiting for the results. Instead it is processing a special table of index words, which Google is constantly updating – an automated version of Wells’s human indexers.
The index is similar to the inverted index found at the end of any good non-fiction reference work: You look up a word or name in the index that you remember reading to find all the pages where the reference is found. Google does the same but instead points you to the URL,while also recording the exact place on the page of the keyword. Of course, the Google inverted index is not based on scanning a single book but all the contents found on the Internet!
The computer science behind Google search can be understood even by people who are not technically inclined, and if you’re interested in how its algorithm divides and conquers this enormous problem, Google provides this resource.
I suspect Callimachus would have understood the explanation as well!
Most of us associate the name Johannes Gutenberg with the invention of the printing press, but the first models appeared hundreds of years before, in China. Gutenberg improved upon the earlier techniques, but his efforts were not without challenges.