This is the current revision of this page, as edited by Lowercase sigmabot III (talk | contribs) at 18:36, 11 January 2025 (Archiving 5 discussion(s) to User talk:John of Reading/Archive 28) (bot). The present address (URL) is a permanent link to this version.
Revision as of 18:36, 11 January 2025 by Lowercase sigmabot III (talk | contribs) (Archiving 5 discussion(s) to User talk:John of Reading/Archive 28) (bot)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)This is John of Reading's talk page, where you can send him messages and comments. |
|
Archives: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28Auto-archiving period: 21 days |
Hi John!
You like typofixing? I got tens of thousands of typos and I can't fix em all alone. Perhaps we can combine our forces? User:Polygnotus/typos. Polygnotus (talk) 16:21, 8 September 2024 (UTC)
- @Polygnotus: Interesting. I'm finding typos by running regular expressions on a database dump; how are you creating your work list? What's your false positive rate?
- I confess I'm so used to working with AWB and my 4000+ regular expressions that I'm unlikely to switch to a radically different method. -- John of Reading (talk) 16:47, 8 September 2024 (UTC)
- I take a list of the most frequently used words, create typos with a Levenshtein distance of 1, and check which occur in the dump. Then I do a bunch of filtering and I check which exist in the live version of Misplaced Pages.
- Which programming languages, if any, are you familiar with?
- We could use a custom AWB module in C# or perhaps just use some custom Selenium-based tool (which would be pretty damn similar, not radically different). Or perhaps a JWB-like interface on wiki. Haven't really decided how to approach that yet.
- I never really bothered to create stats of the amount of skips vs the amount of fixes but that is a good idea to have.
- I use a lot of regex to avoid typos that shouldn't be fixed, see User:Polygnotus/typo.js.
- I have at least 60.000 potential typos left to fix so it is probably worth it to create a decent tool for that.
- Polygnotus (talk) 17:14, 8 September 2024 (UTC)
- @Polygnotus: Languages? Assembler, BCPL, C, C++ - all unused for a decade, I'm afraid. But I've used regular expressions on a copy of User:Polygnotus/typos to extract the 3000+ article names and the alleged typos, and have begun an AWB run to detect those words in those articles. So far I've saved 23 edits and have skipped 25 other articles - not a bad hit rate, by my standards, so I'll press on with this over the next few days. "Gettig" is a surname; "protectin" is a kind of protein; Supremme de Luxe is a stage name; and so on. -- John of Reading (talk) 18:08, 8 September 2024 (UTC)
- Yeah that is 3489 typos and then we got 2800 here and 9300 there and 1200 here. When my Raspberry Pi is done I will have another ~60.000. The typos already have very similar regex ran on them as you saw in typo.js so much of the WONTFIX stuff has been filtered out already. Polygnotus (talk) 18:15, 8 September 2024 (UTC)
- @Polygnotus: Languages? Assembler, BCPL, C, C++ - all unused for a decade, I'm afraid. But I've used regular expressions on a copy of User:Polygnotus/typos to extract the 3000+ article names and the alleged typos, and have begun an AWB run to detect those words in those articles. So far I've saved 23 edits and have skipped 25 other articles - not a bad hit rate, by my standards, so I'll press on with this over the next few days. "Gettig" is a surname; "protectin" is a kind of protein; Supremme de Luxe is a stage name; and so on. -- John of Reading (talk) 18:08, 8 September 2024 (UTC)
- In an ideal world, AWB would accept lists in this format (christmas|chirstmas|My Christmas) as a list generator source. And AWB would contain code (very similar to typo.js) to not fix typos in certain situations. Do you know how we can get closer to that goal? WP:AWB lists some developers in the infobox. Polygnotus (talk) 18:44, 8 September 2024 (UTC)
- AWB has two checkboxes at the top left of the "Find & Replace" configuration, which aim to cover the "certain situations". I run with those turned off, though, so that I do fix errors in quotations, references, foreign-language text and so on - with appropriate care and checking. -- John of Reading (talk) 18:50, 8 September 2024 (UTC)
- I boldy created the WP:QUOTETYPO shortcut at some point and it hasn't been reverted yet. It doesn't really make sense to faithfully reproduce simple mistakes made by others when they are irrelevant and only distract imo. Your approach does affect the hitrate tho. Are there others who I should contact? I assume the 16789 typos above will keep you busy for a while but you know where to find me when you want more. Perhaps I should stick the lists in a subpage of WP:TYPO? I'll dive in the AWB code, thanks. Polygnotus (talk) 19:40, 8 September 2024 (UTC)
- Misplaced Pages:Quotations is marked as an essay; the authoritative guide is at MOS:QUOTE. Fortunately they say the same thing! I do fix typos in quotations if I think they are "insignificant" or are likely to have been copying errors. See User:John of Reading/Typo fixing with AutoWikiBrowser#Editing quotes, book titles and such like.
- If you post your links at Misplaced Pages talk:Typo Team you may attract more helpers. Oh, and are you aware of the Misplaced Pages:Typo Team/moss project? That's another attempt at co-ordinated checking using data-crunching techniques. -- John of Reading (talk) 20:14, 8 September 2024 (UTC)
- Thank you, redirect target improved. I combined typolist, typolist2 and typolist3 above (but not User:Polygnotus/typos, which you imported into AWB) into User:Polygnotus/Data/Typolist. If you want some, please delete them from the list so that its clear that they've been handled.
- I added Moss and the (code behind the) AWB checkboxes to my todolist, thanks again! Polygnotus (talk) 04:30, 9 September 2024 (UTC)
- I boldy created the WP:QUOTETYPO shortcut at some point and it hasn't been reverted yet. It doesn't really make sense to faithfully reproduce simple mistakes made by others when they are irrelevant and only distract imo. Your approach does affect the hitrate tho. Are there others who I should contact? I assume the 16789 typos above will keep you busy for a while but you know where to find me when you want more. Perhaps I should stick the lists in a subpage of WP:TYPO? I'll dive in the AWB code, thanks. Polygnotus (talk) 19:40, 8 September 2024 (UTC)
- AWB has two checkboxes at the top left of the "Find & Replace" configuration, which aim to cover the "certain situations". I run with those turned off, though, so that I do fix errors in quotations, references, foreign-language text and so on - with appropriate care and checking. -- John of Reading (talk) 18:50, 8 September 2024 (UTC)
- In an ideal world, AWB would accept lists in this format (christmas|chirstmas|My Christmas) as a list generator source. And AWB would contain code (very similar to typo.js) to not fix typos in certain situations. Do you know how we can get closer to that goal? WP:AWB lists some developers in the infobox. Polygnotus (talk) 18:44, 8 September 2024 (UTC)
- @Polygnotus: I've restarted the list after telling AWB not to sort the pages alphabetically, so I'm now processing them in the same order as they were listed in User:Polygnotus/typos. This makes it easier for me, as the fixes for the same target word turn up together, and perhaps for you, since you can compare my contribution list with the list I'm working from.
- Two of your "don't fix" tests aren't working correctly:
- In many cases the typo is embedded within a URL - example
mmiller
within Merle Miller - In some cases the typo is embedded within a file name - example
distribuion
within Lesser blue-eared starling. I exclude those by peeking ahead for a known image suffix -(?!*\.(?i:(?:gif|jpe?g|ogg|ogv|pdf|png|svg|tiff?|webm))\b)
- this regular expression isn't perfect, I know.
- -- John of Reading (talk) 07:26, 9 September 2024 (UTC)
- I make the lists with Java and then I use Javascript to actually make the edits. When I improved the url regex in Javascript I forgot to add it to the Java code as well. I had a bunch of ideas to improve my workflow so I am cooking up a fresh batch for you. Might take a while, even on a modern pc. Polygnotus (talk) 03:33, 10 September 2024 (UTC)
- Originally I used
((http|https)://)(www.)?{2,256}\.{2,26}\b(*)
for URLs but a lot of them escaped the wrath of the regex. - I am considering using something like:
\b((?:https?://|www\.)(?:\S+(?::\S*)?@)?(?:(?:{1,3}\.){3}{1,3}|(?:(?:-*)*+)(?:\.(?:-*)*+)*(?:\.(?:{2,})))(?::\d{2,5})?(?:\S*)?\b)
- instead unless you have a better idea.
- For files I used:
- File:(.*?)(\\.|\\|)"
- Image:(.*?)\\.
- Category:(.*?)\\.
- and I haven't really decided how to improve on that. Not all of them have file extensions. Perhaps Commons Special:MediaStatistics and the local one can be used?
- My todolist is steadily growing. Polygnotus (talk) 03:41, 10 September 2024 (UTC)
- Are the URL regexes running with "ignore case" turned on? If not, the first URL regex fails to match the whole URL in the Merle Miller example because parts of it are uppercase.
- The filename in the Lesser blue-eared starling has no
File:
prefix because it is being used as an infobox parameter. To exclude those, you'll either have to look backwards forrange_map =
or similar, or look forwards for.png
or similar. -- John of Reading (talk) 07:01, 10 September 2024 (UTC)- I use
Pattern.CASE_INSENSITIVE
andPattern.UNICODE_CASE
. I have added range_map to the list of disallowed parameters. I am currently trying to figure out whether Ollama can help identify typos better than a coinflip. Polygnotus (talk) 07:47, 10 September 2024 (UTC)
- I use
- @Polygnotus: Thank you very much for preparing these lists; I've had an enjoyable couple of months working through them. If my record-keeping and arithmetic can be trusted, I've made about 4,500 corrections from your 11,700 candidates. I'd be happy to work on another list like this one, but not immediately: I feel I've neglected some of my other self-assigned tasks and would like to spend some time on those.
- I ran into one more problem further down the list: by the time the data was saved in User:Polygnotus/Data/Typolist, any special characters in page names had got corrupted. For example 2014â15 Presbyterian Blue Hose men's basketball team should have been an ndash, and History of à land should have been History of Åland. I managed to guess the intended article names in most cases. -- John of Reading (talk) 16:49, 12 November 2024 (UTC)
Misplaced Pages:Talk page guidelines has an RfC
Misplaced Pages:Talk page guidelines has an RfC for possible consensus. A discussion is taking place. If you would like to participate in the discussion, you are invited to add your comments on the discussion page. Thank you. Gnomingstuff (talk) 18:14, 16 October 2024 (UTC)
ArbCom 2024 Elections voter message
Hello! Voting in the 2024 Arbitration Committee elections is now open until 23:59 (UTC) on Monday, 2 December 2024. All eligible users are allowed to vote. Users with alternate accounts may only vote once.
The Arbitration Committee is the panel of editors responsible for conducting the Misplaced Pages arbitration process. It has the authority to impose binding solutions to disputes between editors, primarily for serious conduct disputes the community has been unable to resolve. This includes the authority to impose site bans, topic bans, editing restrictions, and other measures needed to maintain our editing environment. The arbitration policy describes the Committee's roles and responsibilities in greater detail.
If you wish to participate in the 2024 election, please review the candidates and submit your choices on the voting page. If you no longer wish to receive these messages, you may add {{NoACEMM}}
to your user talk page. MediaWiki message delivery (talk) 00:20, 19 November 2024 (UTC)
Greetings of the season
~ ~ ~ Greetings of the season ~ ~ ~ Hello John of Reading: Enjoy the holiday season and winter solstice if it's occurring in your area of the world, and thanks for your work to maintain, improve and expand Misplaced Pages. Cheers, Spread the love; use {{subst:User:Dustfreeworld/Xmas3}} to send this message. --Dustfreeworld (talk) 11:49, 24 December 2024 (UTC)
- @Dustfreeworld: Thank you! -- John of Reading (talk) 17:11, 24 December 2024 (UTC)
Redirect listed at Redirects for discussion
A redirect or redirects you have created has been listed at redirects for discussion to determine whether its use and function meets the redirect guidelines. Anyone, including you, is welcome to comment on this redirect at Misplaced Pages:Redirects for discussion/Log/2024 December 27 § "Musican" Redirects until a consensus is reached. User:Someone-123-321 (I contribute, Talk page so SineBot will shut up) 12:36, 27 December 2024 (UTC)