Misplaced Pages:Search engine test: Difference between revisions

Browse history interactively ← Previous edit Next edit →Content deleted Content addedVisual WikitextInline

Revision as of 16:28, 12 January 2008 view sourceTimidGuy (talk \| contribs)Extended confirmed users, Pending changes reviewers11,259 edits →Common search engines: add Google News archives search as an example for News and media← Previous edit		Revision as of 17:21, 12 January 2008 view source Jonathan de Boyne Pollard (talk \| contribs)Extended confirmed users1,940 edits Have a Frequently Given Answer. This page was understating the problem in several places, and downright wrong in one.Next edit →
Line 5:		Line 5:
	* On Misplaced Pages, ] trumps popularity. }}		* On Misplaced Pages, ] trumps popularity. }}

	]s allow users to examine ]s on the ], which in turn allows checking of when and how certain expressions are used. This is helpful in identifying ], ~~establishing~~ ], checking ~~], and discussing ] for different things (including articles).~~		]s allow users to examine ]s on the ], articles in scholarly journals, (some) news articles, and (parts of some) published books, which in turn allows checking of when and how certain expressions are used. This is helpful in identifying ], ], and discussing ] for different things (including articles). The web pages, articles, and books found may, if they are actually relevant to the topic, be of use in determining ].

	This page documents how to use search tools to best advantage, and covers useful search tools, examples/tutorial, pitfalls and traps to avoid, and common biases and limitations.		This page documents how to use search tools to best advantage, and covers useful search tools, examples/tutorial, pitfalls and traps to avoid, and common biases and limitations.

		⚫	: ''Common search engines include: ] () (including , , and , and ), ] (), ] (The Wayback Machine, ), and ] ()''. Specialist search engines exist for ], ], ] and ] amongst others.

⚫	: ''Common search engines include: ] () (including , , and ), ] (), ] (The Wayback Machine, ), and ] ()''. Specialist search engines exist for ], ], ] and ] amongst others.

	:''This page uses the ] search engine for its examples, but similar principles apply to most others.''		:''This page uses the ] search engine for its examples, but similar principles apply to most others.''
Line 92:		Line 91:

	===Notability===		===Notability===
			Raw hit counts do not, in fact, measure anything at all. Search engines do not in fact give correct hit counts, as scientific researchers trying to use them as research tools have been disappointed to discover. They are estimated using only word frequency tables and the words in the queries, are usually wrong, and can sometimes be ''egregiously'' wrong. These problems apply to Google, Yahoo!, Windows Live, and other search engines, although they are more well known when it comes to Google. Even the total results figures given on the last pages of search results are usually capped or otherwise restricted, and are subject to considerations such as ]; ]; people, places, and things that share the same name; intervening punctuation that the search engine ignores; and the like (see ]). As such, hit counts should not be taken to be a measure of either notability or non-notability — or of anything else for that matter.<ref name=FGA>{{cite web\|
	Raw hit count is a very crude measure of importance. Some unimportant subjects have many "hits", some notable ones have few or none, for reasons discussed further down this page.
			title=Google result counts are a meaningless metric.\|
			url=http://homepages.tesco.net./~J.deBoynePollard/FGA/google-result-counts-are-a-meaningless-metric.html\|
			work=Frequently Given Answers\|
			date=]\|
			author=Jonathan de Boyne Pollard}}</ref>

			] defines notability in terms of significant coverage by reliable, independent, sources. The way to demonstrate notability using a search engine is to employ the search engine merely ''as a tool'' for ''searching for'' those sources. Searching for stuff is, after all, what a search engine is for. But one doesn't know whether one has actually found it until one has read it. The results pages and hit counts given by the search engine don't of themselves tell one whether the web pages, scholarly articles, news articles, or books constitute reliable, independent, sources with significant coverage of a topic. But they do help to ''find'' such things, which one can then investigate by going and reading them.
	Hit count numbers alone can only rarely "prove" anything about ], without further discussion of the type of hits, what's been searched for, how it was searched, and what interpretation to give the results. On the other hand, examining the ''types'' of hit arising (or their lack) often ''does'' provide useful information related to notability.

			If one finds a number of reliable, independent, sources that turn out, upon reading, to have significant coverage, that can be used to demonstrate notability. Conversely, if one doesn't find ''any'' such sources, that, too, provides useful information related to notability.

	==Using search engines==		==Using search engines==
Line 273:		Line 279:
	===Specific uses of search engines in Misplaced Pages===		===Specific uses of search engines in Misplaced Pages===
	:* Google Groups or other date-stamped media, can help establish the timing and context of early references to a word or phrase.		:* Google Groups or other date-stamped media, can help establish the timing and context of early references to a word or phrase.

	:* Google News can help assess whether something is newsworthy. Google News used to be less susceptible to manipulation by self-promoters, but with the advent of pseudo-news sites designed to collect ad revenues or to promote specific agendas, this test is often no more reliable than the others in areas of popular interest, and indexes many "news" sources that reflect specific points of view. The news archive goes back many years but may not be free beyond a limited period.		:* Google News can help assess whether something is newsworthy. Google News used to be less susceptible to manipulation by self-promoters, but with the advent of pseudo-news sites designed to collect ad revenues or to promote specific agendas, this test is often no more reliable than the others in areas of popular interest, and indexes many "news" sources that reflect specific points of view. The news archive goes back many years but may not be free beyond a limited period.

	:* Google Book Search has a pattern of coverage that is in closer accord with traditional encyclopedia content than the Web, taken as a whole, is; if it has systemic bias, it is a very different systemic bias from Google Web searches. Multiple hits on an exact phrase in Google Book Search provide convincing evidence for the real use of the phrase or concept. Google Book Search can locate print-published testimony to the importance of a person, event, or concept. It can also be used to replace an unsourced "common knowledge" fact with a print-sourced version of the same fact.		:* Google Book Search has a pattern of coverage that is in closer accord with traditional encyclopedia content than the Web, taken as a whole, is; if it has systemic bias, it is a very different systemic bias from Google Web searches. Multiple hits on an exact phrase in Google Book Search provide convincing evidence for the real use of the phrase or concept. Google Book Search can locate print-published testimony to the importance of a person, event, or concept. It can also be used to replace an unsourced "common knowledge" fact with a print-sourced version of the same fact.

	:* Topics alleged to be notable by popular reference can have the type of reference, and popularity, checked. An alleged notable issue that only has a few hundred references on the internet may not be very notable; truly popular ]s can have millions or even tens of millions of references. However note that in some areas, a notable subject may have very few references; for example one might only expect a handful of references to some ] matter, and some matters will not be reflected online at all.

	:* Topics alleged to be genuine can be checked to test if they are referenced by reliable independent sources; a good test for hoaxes and the like.		:* Topics alleged to be genuine can be checked to test if they are referenced by reliable independent sources; a good test for hoaxes and the like.

	:* Copyright violations from websites can often be identified (as described above).		:* Copyright violations from websites can often be identified (as described above).

	:* Alternative spellings and usages can have their relative frequencies checked (eg, for a debate which is the more common of two equally neutral and acceptable terms).		:* Alternative spellings and usages can have their relative frequencies checked (eg, for a debate which is the more common of two equally neutral and acceptable terms).

	:* Google Groups (] newsgroups) is a significantly different sample from websites, and represents, ''for the most part,'' conversations in English conducted by people on various topics. Because the sources are very different, hit numbers are not comparable, however Group searches are particularly helpful in identifying matters which might be discussed, or whose presence may have been artificially inflated by promotional techniques; it is suspicious if a phrase gets, say, 100,000 Web hits but only 10 Groups hits.		:* Google Groups (] newsgroups) is a significantly different sample from websites, and represents, ''for the most part,'' conversations in English conducted by people on various topics. Because the sources are very different, hit numbers are not comparable, however Group searches are particularly helpful in identifying matters which might be discussed, or whose presence may have been artificially inflated by promotional techniques; it is suspicious if a phrase gets, say, 100,000 Web hits but only 10 Groups hits.

Line 298:		Line 296:

	===General===		===General===
	A raw hit count should never be relied upon to prove notability, ~~without~~ ~~attention~~ paid to the ~~sources~~, or ~~types~~ of ~~reference~~, or whether ~~this~~ actually ''~~does~~'' ~~evidence~~ notability or non-notability, case by case. It ~~has~~ always been, and very likely always will remain, an extremely ~~inconsistent~~ tool for measuring notability, and should be considered ~~as one part of evidence, and never~~ definitive or conclusive ~~alone~~. A manageable sample of ~~sites~~ found should be opened individually, to actually verify ~~the~~ relevance ~~of the reported pages~~.		A raw hit count should never be relied upon to prove notability. Attention should instead be paid to what (the books, news articles, scholarly articles, and web pages) is found, and whether they actually ''do'' demonstrate notability or non-notability, case by case. Hit counts have always been, and very likely always will remain, an extremely erroneous tool for measuring notability, and should not be considered either definitive or conclusive. A manageable sample of results found should be opened individually and read, to actually verify their relevance.

	Other useful considerations in interpreting results are:		Other useful considerations in interpreting results are:

	:* Article scope: If narrow, fewer references are required. Try to categorize the point of view, whether it is NPOV, or other; e.g., notice the difference between ] and ].		:* Article scope: If narrow, fewer references are required. Try to categorize the point of view, whether it is NPOV, or other; e.g., notice the difference between ] and ].

	:* Article subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet ] or a ], it may be on 700 pages and might still not be considered 'existing' enough to show any notability, for Misplaced Pages's purposes.		:* Article subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet ] or a ], it may be on 700 pages and might still not be considered 'existing' enough to show any notability, for Misplaced Pages's purposes.

Line 345:		Line 341:

	===Google unique page count issues===		===Google unique page count issues===
	Note also, that the number of hits reported by search engines ~~may be~~ only an estimate. For example, Google ~~may~~ only ~~calculate~~ the actual number of hits when the user finally navigates through all pages, to the last ~~page of the results~~, ~~since~~ ~~it's only~~ then ~~that~~ ~~Google~~ ~~applies~~ ~~all~~ ~~criteria~~ ~~to a query (such as eliminating duplicate and spam control)~~. At times, the hit count can be significantly ~~cut~~ (by a ~~factor~~ of 10 or ~~more~~) ~~when~~ the ~~list~~ ~~is fully accessed. A site-specific search may help determine if most~~ of ~~the~~ ~~hits are coming from~~ the ~~same~~ ~~web~~ ~~site;~~ ~~a single web site can account for hundreds of thousands of hits.~~		Note also, that the number of hits reported by search engines is only an estimate. For example, Google only calculates the actual number of hits when the user finally navigates through all result pages, to the last one, and even then it places restrictions on the figure. At times, the hit count estimate can be significantly different (by one or more ]) to the total count of results on the last page.<ref name=FGA />

			A site-specific search may help determine if most of the hits are coming from the same web site; a single web site can account for hundreds of thousands of hits.
⚫	For search terms that return many results, Google uses a process that eliminates results which are "very similar" to other results listed, both by disregarding pages with substantially similar content and by limiting the number of pages that can be returned from any given domain. For example, a search on "Taco Bell" will only give a couple pages from tacobell.com even though many in that domain will certainly match. Further, Google's list of unique results is constructed by first selecting the top 1000 results and then eliminating duplicates without replacements. Hence the list of unique results will always contain fewer than 1000 results regardless of how many webpages actually matched the search terms. For example, from the about 742 million pages related to "Microsoft", Google presently returns 552 "unique" results (as of ~~Jan 9,~~ 2006). Caution must be used in judging the relative importance of websites yielding well over 1000 search results.

		⚫	For search terms that return many results, Google uses a process that eliminates results which are "very similar" to other results listed, both by disregarding pages with substantially similar content and by limiting the number of pages that can be returned from any given domain. For example, a search on "Taco Bell" will only give a couple pages from tacobell.com even though many in that domain will certainly match. Further, Google's list of unique results is constructed by first selecting the top 1000 results and then eliminating duplicates without replacements. Hence the list of unique results will always contain fewer than 1000 results regardless of how many webpages actually matched the search terms. For example, from the about 742 million pages related to "Microsoft", Google presently returns 552 "unique" results (as of ]). Caution must be used in judging the relative importance of websites yielding well over 1000 search results.

	==Search engine limitations - technical notes==		==Search engine limitations - technical notes==
	Many, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.		Many, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.

	The estimated size of the ] is at least 11.5 billion pages , but a much ], estimated at over 3 trillion pages,{{cite needed}} exists within databases whose contents the search engines do not index. These ]s are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The ] website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself. <~~!--Sources: More,~~ Alvin and Brian H. Murray~~. "~~Sizing the Internet." Cyveillance~~-->~~ ~~<!--include link to Google~~.~~com/patents?-->~~		The estimated size of the ] is at least 11.5 billion pages<ref>{{cite paper\|
			url=http://www.cs.uiowa.edu/~asignori/web-size/\|
			title=The Indexable Web is more than 11.5 billion pages\|
			author=Antonio Gulli and Alessio Signorini\|
			date=]\|
			publisher=}}</ref>, but a much ], estimated at over 3 trillion pages,{{cite needed}} exists within databases whose contents the search engines do not index. These ]s are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The ] website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.<ref>{{cite paper\|
			author=Alvin More and Brian H. Murray\|
			title=Sizing the Internet\|
			publisher=Cyveillance Inc.\|
			date=2000\|
			url=\|
			format=PDF}}</ref> <!--include link to Google.com/patents?-->

	Google, as all search engines should, follows the ] and can be by sites that do not wish their content to be indexed or cached by Google. Sites that contain large amounts of copyrighted content (Image galleries, subscription newspapers, webcomics, movies, video, help desks), usually involving membership, will block Google and other search engines. Other sites may also block Google due to the stress or bandwidth concerns on the server hosting the content.		Google, as all search engines should, follows the ] and can be by sites that do not wish their content to be indexed or cached by Google. Sites that contain large amounts of copyrighted content (Image galleries, subscription newspapers, webcomics, movies, video, help desks), usually involving membership, will block Google and other search engines. Other sites may also block Google due to the stress or bandwidth concerns on the server hosting the content.
Line 363:		Line 372:

	Forums, membership-only and subscription-only sites (since Googlebot does not sign up for site access) and sites that cycle their content are not cached or indexed by any search engine. With more sites moving to AJAX/Web 2.0 designs, this limitation will become more prevalent as search engines only simulate following the links on a web page. AJAX page setups (like Google maps) dynamically return data based on realtime manipulation of javascript.		Forums, membership-only and subscription-only sites (since Googlebot does not sign up for site access) and sites that cycle their content are not cached or indexed by any search engine. With more sites moving to AJAX/Web 2.0 designs, this limitation will become more prevalent as search engines only simulate following the links on a web page. AJAX page setups (like Google maps) dynamically return data based on realtime manipulation of javascript.

			== References ==
			<references />

	== Further reading ==		== Further reading ==

Revision as of 17:21, 12 January 2008

This WP:SET contains instructions, advice, or how-to content. Please help rewrite the content so that it is more encyclopedic or move it to Wikiversity, Wikibooks, or Wikivoyage.

This page in a nutshell:

Measuring is easy. What's hard is knowing what it is you're measuring and what your measurement shows.
Or put simply: Search engines are sophisticated research tools, but often have bias and results need to be interpreted. It can be worked around but you need to know what you're doing.
On Misplaced Pages, neutrality trumps popularity.

Search engines allow users to examine web pages on the Internet, articles in scholarly journals, (some) news articles, and (parts of some) published books, which in turn allows checking of when and how certain expressions are used. This is helpful in identifying sources, checking facts, and discussing what names to use for different things (including articles). The web pages, articles, and books found may, if they are actually relevant to the topic, be of use in determining notability.

This page documents how to use search tools to best advantage, and covers useful search tools, examples/tutorial, pitfalls and traps to avoid, and common biases and limitations.

Common search engines include: Google (link) (including newsgroups, news, and scholar, and books), Alexa (link), Archive.org (The Wayback Machine, link), and Yahoo! (link). Specialist search engines exist for medicine, science, news and law amongst others.

This page uses the Google search engine for its examples, but similar principles apply to most others.

Search engine tests

Uses of search engine tests

A test using a search engine is intended to help with the following research questions:

Popularity - Identifying how popular (or how little-known) something is (often called the "Google test")
Usage - Identifying how and where a term is commonly being used, and by whom
Genuine or hoax - Identifying if something is genuine or a hoax (or spurious, unencyclopedic)
Notability - Confirm whether it is covered by independent sources or just within its own circles.
Reliable sources - Identifying what sources (and websites) may exist for something
More information - Unearthing of notable facts and citations which can be used in articles.
Names and terminology - Identifying the names used for things (including alternative names and terminology)
Copyright status checks - Identifying whether text is a direct (or near) copy of material on some web page, and (sometimes) identifying copyright holders and licensing status.

Depending on the subject matter, and how carefully it is used, a search engine test can be very effective and helpful, or produce misleading or non-useful results. In most cases, a search engine test is a first-pass heuristic or "rule of thumb".

Common search engines

Type	Examples
General search engines	Google, Yahoo, etc
Website popularity indexes	Alexa, Hitwise
General information	About.com
Professional research indexes	Medline (medical), science, law, Google Scholar
News and media	Google News archives search
Historical archives of web pages	Archive.org, Google cache (how web pages looked and their contents, at different times or if deleted)
Books and historical literature	Project Gutenberg, Google Books, Amazon.com and a9.com (for book info)

Google groups (usenet), and some other sources are date-stamped, and have been archived for over twenty years, making them useful as a historical record.

What a search test can do, and what it can't

A search engine can list pages and text which others have placed on the internet.

Search engines can:

Provide information, and leads to pages, that assist with the above goals
Confirm "who's reported to have said what" according to sources (useful for neutral citing)
Often provide full cited copies of source documents
Confirm roughly how popularly referenced an expression is
Search more specifically within certain websites, or for combined and alternative phrases (or excluding certain words and phrases that would otherwise confuse the results).

Search engines cannot:

Guarantee the results are reliable or "true" (search engines index whatever text people choose to put online, true or false).
Guarantee why something is mentioned a lot, and that it isn't due to marketing, reposting as an internet meme, spamming, or self-promotion, rather than importance.
Guarantee that the results reflects the uses you mean, rather than other uses. (Eg, a search for a specific John Smith may pick up many "John Smiths" who aren't the one meant, many pages containing "John" and "Smith" separately, and also miss out all the useful references indexed under "John M. Smith" or "John Michael Smith")
Guarantee you aren't missing crucial references through choice of search expression.
Guarantee that little mentioned or unmentioned items are automatically unimportant.

and search engines often will not:

Provide the latest research in depth to the same extent as journals and books, for rapidly developing subjects.
Be neutral.

A search engine test cannot help you avoid the work of interpreting your results and deciding what they really show. Appearance in an index alone is not usually proof of anything.

Search engine tests and Misplaced Pages policies

Verifiability

Search engine tests may return results that are ficticious, biased, hoaxes or the like. It is important to consider whether the information used derives from reliable sources before depending upon it. Less reliable sources may be unhelpful, or need their status and basis clarified, so that other readers gain a neutral and informed understanding to judge how much reliance to place upon them.

Neutrality

Google (and other search systems) do not have a neutral point of view. Misplaced Pages does. Google indexes self created pages and media pages which do not have a neutrality policy. Misplaced Pages has a neutrality policy that is mandatory and applies to all articles, and all article-related editorial activity.

As such, Google is specifically not a source of neutral titles -- only of popular ones. Neutrality is mandatory on Misplaced Pages (including deciding what things are called) even if not elsewhere, and specifically, neutrality trumps popularity.

(See WP:NPOV#Neutrality and Verifiability for information on balancing the policies on verifiability and neutrality, and WP:NPOV#Article naming on how articles should be named)

Notability

Raw hit counts do not, in fact, measure anything at all. Search engines do not in fact give correct hit counts, as scientific researchers trying to use them as research tools have been disappointed to discover. They are estimated using only word frequency tables and the words in the queries, are usually wrong, and can sometimes be egregiously wrong. These problems apply to Google, Yahoo!, Windows Live, and other search engines, although they are more well known when it comes to Google. Even the total results figures given on the last pages of search results are usually capped or otherwise restricted, and are subject to considerations such as stemming; polysemic words; people, places, and things that share the same name; intervening punctuation that the search engine ignores; and the like (see below). As such, hit counts should not be taken to be a measure of either notability or non-notability — or of anything else for that matter.

Misplaced Pages:Notability defines notability in terms of significant coverage by reliable, independent, sources. The way to demonstrate notability using a search engine is to employ the search engine merely as a tool for searching for those sources. Searching for stuff is, after all, what a search engine is for. But one doesn't know whether one has actually found it until one has read it. The results pages and hit counts given by the search engine don't of themselves tell one whether the web pages, scholarly articles, news articles, or books constitute reliable, independent, sources with significant coverage of a topic. But they do help to find such things, which one can then investigate by going and reading them.

If one finds a number of reliable, independent, sources that turn out, upon reading, to have significant coverage, that can be used to demonstrate notability. Conversely, if one doesn't find any such sources, that, too, provides useful information related to notability.

Using search engines

Search engine expressions (examples and tutorial)

This section covers search expressions for Google web search. Similar approaches will work in many other search engines, and other Google searches, but always read their help pages for further information as search engines' capabilities and operation often differ.

A search engine such as Google has both an easy, and an advanced search. The advanced search makes it easier to enter advanced options, that may help your searching. The following collapsible sections cover basic examples and help for using search engines with Misplaced Pages.

Specialized search engines such as medical paper archives have their own specialized search structure not covered here.

Basic searches.

Most searches allow searching for words ('acid'), expressions ('war on terrorism'), and combinations ('war on terror' OR 'war on terrorism'; John AND Smith), as well as excluding certain items (Bush NOT George). An expression is given in "double quote" marks, and expressions can be grouped with parentheses. Expressions are not usually case-sensitive. So the following are all valid texts to search for, on Google:

Search: John Smith

Since this isn't in quotes, Google looks for pages containing all of these terms. It finds all pages that contain "john" and "smith". This will return pages that contain "john smith", "john michael smith" but also pages that contain both terms separately, such as "The secretary, john arnold, and treasurer, mike smith..."

Search: "John Smith"

The name is in double quotes. Google will look for pages containing the exact expression "John smith", or the two words next to each other ("The author was John. Smith was the composer..."). But it won't pick up name variants such as "John M. Smith".

Search: "John Smith" OR "John M Smith" OR "John Michael Smith"

Search: "Ahmed Abu-Sayed" OR "Ahmed Abusayed"

Looks for pages with any of these expressions. Note the use of "OR" (which MUST be given in upper case) to find possible alternate spellings when it isn't clear whether or not words are joined by page authors.

Use of "NOT".

The term "NOT" (in Google represented by "-") means, exclude pages that contain this term. The danger is that pages will be excluded because of a term that actually has nothing to do with the search in hand. NOT always means "AND ALSO, NOT..." in Google. The best use of NOT (or "-" in Google) is in two circumstances:

There is a clear expression or term and a page that contains that meaning probably will NOT be relevant to the meaning you are after.
There are many references and you want to narrow down the search by excluding less likely page suggestions.

Search for a term with a 2nd meaning v1: George Bush NOT President

Search for a term with a 2nd meaning v2: "George Bush" NOT President

Search for a term with a 2nd meaning v3: George Bush NOT President NOT "white house"

You want references to george bush, but not the one who's the president. Given that 90% of George Bush references will be about the US president, it makes sense to rule out all pages with that word, or even tighter, even though some pages may contain both references to non-presidential george bushes and the word president.

Two variations are shown, one looks for the expression "George Bush", one has a second exclusion to rule out pages with the term "white house"

Narrow down widely used terms: (flavor OR flavour) (quark OR quantum OR physics) -eat -food -drink -cooking -culinary

An example of a more complex search. The author is looking for the term "flavor", in the sense of a property in quantum physics. It is unknown whether this is spelt the UK way or English way, so the first expression is to look for one OR the other. Also the page must contain some other words likely to be related to subatomic physics (quark OR quantum OR physics). Last, pages containing references related to food and cooking are explicitly excluded, since many references to "flavor" will be of this kind.

Advanced searches and copyvio checks.

Google allows all sorts of combinations of words, expressions, OR, NOT, and parentheses, which can be used to make quite detailed searches.

Search: linux (grub OR lilo) (boot OR startup OR "start-up") kernel init process

A person who wants to write an article on the Linux start-up (or boot) process, but doesn't know where on the net to look for reliable sources.

This search looks for pages that contain references to Linux, references to the two most common boot loaders ("grub" or "lilo"), references to start-up under three common terms that might be used, and other words that hopefully will be commonly related to start-up in Linux.

Copyvio search: ("zytox is the worlds leading producer of widgets" OR "merger with IBM in 1929" OR "exports radar components to over fifty countries") NOT wikipedia

Looks for any of three memorable phrases from a suspected copyright violation, which do not appear on the same page as a reference to "wikipedia".

If this text is copied from a website, a search like this will often help to locate the source.

Finding vaguely remembered information and unfamiliar terms.

Search for a vaguely known term: biology reproduction cell nucleus chromosome helix

A search for someone who wants to find what the molecule which reproduces is called ("DNA") and knows some terms it might be associated with but can't remember the term itself. Use associated terms to try and find pages that mention it.

Search for a term with unknown spelling: piometra OR pieometra OR pyametra OR pymetra

A search for "Pyometra" by someone who can't remember the spelling. Again, they could equally search using connected terms (Google: bitch womb spay open closed antibiotic - all terms associated with the veterinary condition pyometra). The odds are good someone else has already spelt it like you do and it's been indexed, and you can look upmore information from there.

Search for ambiguous terms: DNA (as in, the cell biology meaning)

An example of a problematic search. The obvious term "DNA" may pull up many unhelpful answers, such as companies whose initials are "D.N.A.". So it is likely that a person who wants to look up this item and doesn't know much already, will have to search like this:

Search DNA - finding that it has many meanings.
Search DNA cell biology spiral - using words commonly associated with that meaning of DNA, to get pages covering that meaning.
Using those pages to find the correct term is "Deoxyribonucleic acid", sometimes written "Deoxyribo-nucleic acid"
Doing a final search for "Deoxyribonucleic acid" OR "Deoxyribo nucleic acid"

Search: ("she's got" OR "she has") "do right by me" ticket ride lyrics

A search for a song title ("Ticket to Ride"), for a person who knows some phrases and thinks they might know others, including useful words that might help narrow it down.

Searches restricted to news, newsgroups, and other sources.

This section needs expansion. You can help by making an edit request adding to it .

Specialized options, including searches to include or exclude Misplaced Pages itself.

Google has options to specify web sites to search or not search, and where in the page to search. These are able to be added to the end of any search and will restrict the locations Google will report matches from. Examples of useful searches, using "(Atom OR Bomb)" as the example text being searched for:

To search like this	Enter a search string like this
Only report pages from websites ending in "en.wikipedia.org", the English Misplaced Pages.	(Atom OR Bomb) site: en.wikipedia.org
only report pages from websites ending in "wikipedia.org", Misplaced Pages in any language	(Atom OR Bomb) site: wikipedia.org
Only report pages from websites that do not end with "wikipedia.org", ie pages that are NOT on a Misplaced Pages website	(Atom OR Bomb) -site: wikipedia.org
Avoid pages that mention "Misplaced Pages". (This is a good way to avoid a deluge of results which are all either from Misplaced Pages, or from copies and mirrors of Misplaced Pages articles.)	(Atom OR Bomb) NOT Misplaced Pages
Find pages which link to a particular page, such as Misplaced Pages's Main Page	link: http://en.wikipedia.org/Main_Page
Specify that the expression must appear in the title of the page.	allintitle: (Atom OR Bomb)
"allintitle" and "site:" (or -site) can be combined, to find pages on a website (or not on the website) with the given expression in a title	allintitle: (Atom NOT bomb) site:en.wikipedia.org

Site inclusion/exclusion is often very useful to get views either from a named website, or from any other websites. For example, it can be used

To find pages on Microsoft terminology that are not self-published by Microsoft (not ending in microsoft.com),
To find pages that are official US or UK government sources (end in .gov and .gov.uk accordingly),
To find sites from a given country (more likely to end with that country's initials, such as ".fr" for France),
Or particular media publishers (eg, "cnn.com" or "bbc.co.uk")

Specialized searches work on the same principles and same basic search expressions as the above, but might be used to check in specialized archives, or with unusual options.

This section needs expansion. You can help by making an edit request adding to it .

Common traps, mistakes and pitfalls.

This section needs expansion. You can help by making an edit request adding to it .

Specific uses of search engines in Misplaced Pages

Google Groups or other date-stamped media, can help establish the timing and context of early references to a word or phrase.
Google News can help assess whether something is newsworthy. Google News used to be less susceptible to manipulation by self-promoters, but with the advent of pseudo-news sites designed to collect ad revenues or to promote specific agendas, this test is often no more reliable than the others in areas of popular interest, and indexes many "news" sources that reflect specific points of view. The news archive goes back many years but may not be free beyond a limited period.
Google Book Search has a pattern of coverage that is in closer accord with traditional encyclopedia content than the Web, taken as a whole, is; if it has systemic bias, it is a very different systemic bias from Google Web searches. Multiple hits on an exact phrase in Google Book Search provide convincing evidence for the real use of the phrase or concept. Google Book Search can locate print-published testimony to the importance of a person, event, or concept. It can also be used to replace an unsourced "common knowledge" fact with a print-sourced version of the same fact.
Topics alleged to be genuine can be checked to test if they are referenced by reliable independent sources; a good test for hoaxes and the like.
Copyright violations from websites can often be identified (as described above).
Alternative spellings and usages can have their relative frequencies checked (eg, for a debate which is the more common of two equally neutral and acceptable terms).
Google Groups (USENET newsgroups) is a significantly different sample from websites, and represents, for the most part, conversations in English conducted by people on various topics. Because the sources are very different, hit numbers are not comparable, however Group searches are particularly helpful in identifying matters which might be discussed, or whose presence may have been artificially inflated by promotional techniques; it is suspicious if a phrase gets, say, 100,000 Web hits but only 10 Groups hits.

Specialized search engines

Google Scholar works well for fields that (1) are paper-oriented and (2) have an online presence in all (or nearly all) respected venues. Most papers written by computer scientists will show up, but for less technologically current fields, representation in Google Scholar is less reliable. Even the journal Science only puts articles online back to 1996. Thus, Google Scholar should rarely be used as proof of non-notability.

Medline, now part of Pubmed, is the original broadly-based search engine, originating over four decades ago and indexing even earlier papers. Thus, especially in biology and medicine, Pubmed "associated articles" is a Google Scholar proxy for older papers with no on-line presence. E.g., The journal Stroke puts papers on-line back through the 1970's. For this 1978 paper ,Google Scholar lists 100 citing articles, while Pubmed lists 89 associated articles

There are a large number of law libraries online, in many coutnries, including: Library of Congress, Library of Congress (THOMAS), Indiana Supreme Court, FindLaw (USA); Kent University Law Library and sources (UK).

Interpreting results

General

A raw hit count should never be relied upon to prove notability. Attention should instead be paid to what (the books, news articles, scholarly articles, and web pages) is found, and whether they actually do demonstrate notability or non-notability, case by case. Hit counts have always been, and very likely always will remain, an extremely erroneous tool for measuring notability, and should not be considered either definitive or conclusive. A manageable sample of results found should be opened individually and read, to actually verify their relevance.

Other useful considerations in interpreting results are:

Article scope: If narrow, fewer references are required. Try to categorize the point of view, whether it is NPOV, or other; e.g., notice the difference between Ontology and Ontology (computer science).
Article subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet neologism or a pop song, it may be on 700 pages and might still not be considered 'existing' enough to show any notability, for Misplaced Pages's purposes.

Biases to be aware of

In most cases search results should be reviewed with a careful and aware skepticism before relying upon them. Common biases include:

General biases

General (the internet or people as a whole)

Personal bias - Tendency to be slightly more receptive to beliefs that one is familiar with, believes, or are common in their daily culture, and also to be more doubtful about beliefs and views that contradict ones preferred views.
Cultural and computer-usage bias - Biased towards information from internet-using developed countries and affluent parts of society (internet access). Countries where computer use is not so common, will often have lower rates of reference to equally notable material, which may therefore appear (mistakenly) non-notable.
Undue weight - May disproportionally represent some matters, especially related to popular culture (some matters may be given far more space and others far less, than fairly represents their standing):- popularity is not notability.
Sources not readily accessible - Some sources are accessible to all, but many are payment only, or not reported online.

General web search engines (Google, Yahoo web search etc)

Dark net - Search engines exclude a vast number of pages, and this may include systematic bias so that some matters are excluded disproportionately (for example, because they are commonly visible on sites that do not allow Google indexing, or the content for technical reasons cannot be indexed (Flash or image-based websites etc)
Search engines as promotion tool - An industry exists seeking to influence site position, popularity, and ratings in such searches, or sell advertizing space related to searches and search positions. Some subjects, such as pornographic actors, are so dominated by these that searches cannot be reliably used to establish popularity.
Review process varies, some sites accept any information, others have some form of review or checking system in place.
Self-mirroring - Sometimes other sites pick up Misplaced Pages content, which is then passed around the internet, and more pages built up based upon it (and often not cited), meaning that in reality the source of much of the search engine's findings are actually just copies of Misplaced Pages's own previous text, not genuine sources.
Popular usage bias - Popular usage and urban legend is often reported over correctness. Examples: 1/ a search for the incorrect Charles Windsor gives 10 times more results than the correct Charles Mountbatten-Windsor, 2/ A search for the most common spelling of El Niño will often report it spelt "El Nino", without the diacritic, 3/ Urban legends are often reported widely, for example hundreds of sites report that the USS Constitution set sail in 1779, although the correct date is 1797.
Popular views and perceptions are likely to be more reported. For example, there may be many references to acupunture and confirming that people are often allegic to animal fur, but it may only be with careful research that it is revealed there are medical peer reviewed assessments of the former, and that people are usually not allergic to fur, but to the sticky skin particles ("dander") within the fur.
Language selection bias For example, an Arabic speaker searching for information on homosexuality in Arabic, will likely find pages which reflect a different bias than an English speaker searching in English on the same subject, since popular and media views and beliefs about homosexuality can differ widely between English speaking countries (USA, UK, Australasia) that tend to include a relatively higher proportion of homosexuality-accepting groups, and Arabic speaking countries (Middle East) that tend to include a relatively lower proportion.

Other

Note that other Google searches, particularly Google Book Search, have a different systemic bias from Google Web searches and give an interesting cross-check and a somewhat independent view.

Alexa ratings

In some cases, it is helpful to estimate the relative popularity of a website. Alexa Internet is a tool for this (Hitwise is another). To test Alexa's ranking for a particular web site, visit alexa.com) and enter the URL.

The Alexa measuring system is based on a toolbar that users must choose to install, and which can only be installed on the Internet Explorer browser and Microsoft Windows. Sources of bias include both websites whose users disproportionately do not install such toolbars, or who are less often users of Windows and Internet Explorer, as well as webmasters who install Alexa Toolbar for the sole purpose of enhancing their ratings. Specifically, Alexa rankings are not part of the notability guidelines for web sites for several reasons:

Below a certain level, Alexa rankings are essentially meaningless, because of the limited sample size. Alexa itself says ranks worse than 100,000 are not reliable, and some critics feel it is worse than that.
Alexa rankings vary and include significant systematic bias which means the ratings often do not reflect popularity, but only popularity amongst certain groups of users (See Alexa Internet#Concerns). Broadly, Alexa rates based upon measurements by a user-installed toolbar, but this is a highly variable tool, and there are large parts of the internet user community (especially corporate users, many advanced users, many open-source and non-Windows users) who do not use it and whose internet reference use is therefore ignored.
Alexa rankings do not reflect encyclopedic notability and existence of reliable source material if so. A highly ranked web site may well have nothing written about it, or a poorly ranked web site may well have a lot written about it.
A number of unquestionably notable topics have web sites with poor Alexa rankings.

Foreign languages, non-Latin scripts, and old names

Often for items of non-English origin, or in non-Latin scripts, a considerably larger number of hits result from searching in the correct script or for various transcriptions. An Arabic name, for instance, needs to be searched for in the original script, which is easily done with Google (provided one knows what to search for), but problems may arise if - for example - English, French and German webpages transcribe the name using different conventions. Even for English only webpages there may be many variants of the same Arabic or Russian name. Personal names in other languages (Russian, Anglo-Saxon) may have to be searched for both including and excluding the patronymic, and searches for names and other words in strongly inflected languages should take into account that arriving at the total number of hits may require searching for forms with varying case-endings or other grammatical variations not obvious for someone who does not know the language. Names from many cultures are traditionally given together with titles that are considered part of the name, but may also be omitted (as in Gazi Mustafa Kemal Pasha).

Even in Old English, the spelling and rendering of older names may allow dozens of variations for the same person. A simplistic search for one particular variant may underrepresent the web presence by an order of magnitude.

A search like this requires a certain linguistic competence which not every individual Wikipedian possesses, but the Misplaced Pages community as a whole includes many bilingual and multilingual people and it is important for nominators and voters on AfD at least to be aware of one's own limitations and not make untoward assumptions when language or transcription bias may be a factor.

Google unique page count issues

Note also, that the number of hits reported by search engines is only an estimate. For example, Google only calculates the actual number of hits when the user finally navigates through all result pages, to the last one, and even then it places restrictions on the figure. At times, the hit count estimate can be significantly different (by one or more orders or magnitude) to the total count of results on the last page.

A site-specific search may help determine if most of the hits are coming from the same web site; a single web site can account for hundreds of thousands of hits.

For search terms that return many results, Google uses a process that eliminates results which are "very similar" to other results listed, both by disregarding pages with substantially similar content and by limiting the number of pages that can be returned from any given domain. For example, a search on "Taco Bell" will only give a couple pages from tacobell.com even though many in that domain will certainly match. Further, Google's list of unique results is constructed by first selecting the top 1000 results and then eliminating duplicates without replacements. Hence the list of unique results will always contain fewer than 1000 results regardless of how many webpages actually matched the search terms. For example, from the about 742 million pages related to "Microsoft", Google presently returns 552 "unique" results (as of 2006-01-09 ). Caution must be used in judging the relative importance of websites yielding well over 1000 search results.

Search engine limitations - technical notes

Many, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.

The estimated size of the World Wide Web is at least 11.5 billion pages, but a much deeper (and larger) Web, estimated at over 3 trillion pages, exists within databases whose contents the search engines do not index. These dynamic web pages are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The United States Patent and Trademark Office website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.

Google, as all search engines should, follows the robots.txt protocol and can be blocked by sites that do not wish their content to be indexed or cached by Google. Sites that contain large amounts of copyrighted content (Image galleries, subscription newspapers, webcomics, movies, video, help desks), usually involving membership, will block Google and other search engines. Other sites may also block Google due to the stress or bandwidth concerns on the server hosting the content.

Google has also been the victim of redirection exploits that may return more results for a specific search term than exist actual content pages.

Google and other popular search engines are also a target for search engine "search result enhancement", also known as search engine optimizers, so there may also be many results returned that lead to a page that only serves as an advertisement. Sometimes pages contain hundreds of keywords designed specifically to attract search engine users to that page, but in fact serve an advertisement instead of a page with content related to the keyword.

Search engines also might not be able to read links or metadata that normally requires a browser plugin, Adobe PDF,or Macromedia Flash, or where a website is displayed as part of an image. Search engines also can not listen to podcasts or other audio streams, or even video mentioning a search term.

Forums, membership-only and subscription-only sites (since Googlebot does not sign up for site access) and sites that cycle their content are not cached or indexed by any search engine. With more sites moving to AJAX/Web 2.0 designs, this limitation will become more prevalent as search engines only simulate following the links on a web page. AJAX page setups (like Google maps) dynamically return data based on realtime manipulation of javascript.

References

^ Jonathan de Boyne Pollard (2008-01-01). "Google result counts are a meaningless metric". Frequently Given Answers. {{cite web}}: Check date values in: |date= (help)
Antonio Gulli and Alessio Signorini (2005-08-29). "The Indexable Web is more than 11.5 billion pages". {{cite journal}}: Check date values in: |date= (help); Cite journal requires |journal= (help)
Alvin More and Brian H. Murray (2000). "Sizing the Internet". Cyveillance Inc. {{cite journal}}: |format= requires |url= (help); Cite journal requires |journal= (help)

External links

CustomizeGoogle an Open source Firefox Extension, which includes Google Suggest.

Categories:

Misplaced Pages

:Search engine test: Difference between revisions - Misplaced Pages

Revision as of 17:21, 12 January 2008

Search engine tests

Uses of search engine tests

Common search engines

What a search test can do, and what it can't

Search engine tests and Misplaced Pages policies

Verifiability

Neutrality

Notability

Using search engines

Search engine expressions (examples and tutorial)

Specific uses of search engines in Misplaced Pages

Specialized search engines

Interpreting results

General

Biases to be aware of

General biases

Alexa ratings

Foreign languages, non-Latin scripts, and old names

Google unique page count issues

Search engine limitations - technical notes

References

Further reading

See also

External links