Wmatrix corpus analysis and comparison tool
Wmatrix is a software tool for corpus analysis and comparison. It provides
a web interface to natural language processing tools such as
the USAS and
CLAWS corpus annotation tools for English,
plus the multilingual semantic tagger PyMUSAS, and
standard corpus linguistic methodologies such as frequency lists, keyness statistics, n-grams, collocations
and concordances. It extends the keywords method to key grammatical
categories and key semantic domains.
Wmatrix6 is currently only open for beta testers.
Wmatrix5 is live but Wmatrix4 was retired on 1st March 2023, so please switch to Wmatrix5 as soon as possible.
If you already have an account, login to Wmatrix6
Wmatrix6 is currently only available for invited beta testers.
If you already have an account, login to Wmatrix5
Wmatrix5 first went live for the CL2021 conference on 12th July 2021.
Wmatrix4 was mainly suitable for English texts and also introduced a Spanish tag wizard beta version.
Wmatrix4 ran in Lancaster University's cloud infrastructure with a new faster disk and used the more secure https connection method.
Wmatrix4 first went live for the IWODA2018 conference and was retired on 1st March 2023.
Existing folders in Wmatrix3 were transferred to Wmatrix4 on 16th December 2018.
Wmatrix3 was suitable for English texts only and was retired at 8am GMT/UTC on 20th December 2018.
Differences between wmatrix5 and wmatrix6 (last updated 22nd June 2023)
- There is a completely new back end indexing system for Wmatrix6 using sqlite, now named matrixdb and this will be made available open source alongside the pymusas Python semantic tagger
- POS frequency lists are more accurate than before, in that they fix a couple of issues with POS lists in previous Wmatrix versions:
(a) Semantic MWEs were previously only counted in the POS frequency list under the POS tag on the first word, but now POS frequency counts are included for all parts of MWEs
(b) Ditto tags on CLAWS POS tags are now removed
e.g. "in_II31 terms_II32 of_II33" (semantically tagged as Z5 for the whole MWE) previously would have counted as "II31" and been counted separately from "II", but now these are correctly counted under the "II" tag, in the POS frequency list.
- A corpus in one folder can consist of multiple files, and the filename is displayed for each line on the right hand side of concordances
- The concordance display is much improved with paginated results and they are very quick for much larger corpora (currently tested up to 11 million words)
- Frequency lists are paginated which improves loading speed in the browser
- Concordances can be sorted by up to 3 words on the left and right, corpus order, filename, and node/key word as well as POS/semantic tag of node/key word, and sort positions are highlighted in red
- Semantic frequency lists include a range link which shows the breakdown of hits per file
- There's a new lemma frequency list and concordances
- The tag wizard now runs on the Wmatrix server without requiring you to keep the browser window open during the tagging process.
- A queuing system (qpym) has been created to load balance the tag wizard and process multiple files in parallel (with thanks to John Vidler for design discussions on qpym).
- The tag wizard allows you to load new data in English using the CLAWS and USAS pipeline as per Wmatrix5 and previous versions
- N-grams (for N between 2 and 5) are now counted as part of running the tag wizard
- Collocation tables are calculated as part of running the tag wizard
- The tag wizard has been extended to allow zip file uploads for multiple files to be used in one folder
- All words are now reduced to lowercase for the frequency lists, n-grams and collocation calculations.
This removes apparent duplicates which could appear in some Wmatrix5 keyword lists where proper nouns were kept in their original case and had differing capitalisation between two files.
- Zip file uploads with up to three levels of subdirectories are now supported by the tag wizard.
- The semantic taggers in PyMUSAS for Chinese, Dutch, Finnish, French, Italian, Portuguese, Spanish, and Welsh are now available in the tag wizard.
- Items which didn't previously have a semantic tag (e.g. punctuation and XML tags) and were removed from Wmatrix5 analysis, now appear in the frequency lists as Z9 (indicating no semantic content).
- Reference corpora have been completely retagged and indexed for comparability with newly tagged data. Initially the following have been included: British English 2021 (BE21),
British English 2006 (BE06), American English 2006 (AmE06), BNC 1994 Sampler Written, BNC 1994 Sampler Spoken
- Added the first two statistics to the collocation table: Mutual Information and Log Likelihood
Things currently in the pipeline
- The MWEs are not yet included in the word frequency list i.e. each part of a MWE is currently counted separately in the word list
- In the semantic tag frequency list, MWEs are correctly counted once under each tag.
There are additional entries with tags like N5_MWE which represent non-first-words of MWEs and you can view concordances for, but otherwise can be ignored for frequency purposes.
- The previous two items mean that keyness comparisons in Wmatrix6 will contain some differences with results obtained from the same comparison in Wmatrix5
- The tag wizard will be extended to allow Domain and My Tag Wizard versions
- The tag wizard will be extended to support other languages included in PyMUSAS as they become available and will be updated on a regular basis as lexicons and disambiguation methods improve.
- Metaphor features such as 'broad sweep' will be re-incorporated
- Add a file or range count directly in the frequency list, along with dispersion measures like DP and DPnorm. Also add range links in word and POS frequency lists
- Keyness lists will be paginated
- Add a variety of other collocation statistics to the collocation table along with user controlled settings for filters
- Reimplementation of c-grams
- Sharing of folders will be re-implemented by an invitation/accept process
©2000-23 UCREL, Lancaster University.
For technical queries please contact Paul Rayson : email@example.com