Rosella, IA 3.0 Beta
release date: 22/03/2018
Bug Fixes And New Features
- File types have been changed to be more intuitive. 'Hybrid' is now 'Plain Text'. You will still be able to use the features that apply to 'hybrid' texts. You can now import tagged text as either 'XML and HTML' or 'TEI'. This allows for ongoing development at various levels of generality. You can import any text, or you can take advantage of it being tagged as XML, or we can build in features making assumptions about TEI texts (eg: counting by characters marked up as such in TEI).
- Split by 'grapheme' allows you to count individual 'characters' (ie: letters or the unique symbols of a character set) instead of words. The term 'grapheme' has been used to avoid confusion with counting speaking 'characters' in plays. This will be useful for Chinese, especially where the text does not have space seperated words.
- Re-arranged layout of various input options in the user interface to be more organised into coherent groupings. It is unavoidable that this change will confuse established users. Sorry about that.
- Split texts by tags (eg: enclose parts of your text in <div> tags </div>, or count paragraphs with the <p> tag </p>, etc). This can be done in any of the text types, including plain text. Any text outside the tag will be ignored.
- Import metadata for texts from a csv file.
- Include metadata, author, title, and character in output.
- List any XML tags to be included or excluded from the process.
- Split by character provides summary output of words for characters.
- Compatible with the latest version of TEI, that allows for <stage> tags inside <sp> tags.
- Some inline documentation, mouse hover over some elements shows brief explanations.
- Sort results by clicking the top of the column.
- Variant spelling no longer supported. The complexity of handling variant spelling made it impossible to give an account of it, so it could no longer be included. Variant spelling should be handled in your own preprocessing.
- Punctuation that may not be part of a word in English is no longer treated as part of a word when there are no spaces. All punctuation except ' - will split words. Eg: Use of the colon in "They included:Mary, John,..." would split 'included' and 'Mary' into two words, whereas in "It was Mary's choice." the word 'Mary's' will remain one word.
- Merge sets, or remove one set from another set.
- Ability to use <reg> tag orig and sic attribute to count either. Eg: option to count either scilens or silence in <reg orig="scilens">silence<reg>
Planned Enhancements
- PCA plot. Since this is among the most popular stylometry tools we would like to include a basic PCA Plot, so that there is no need to move to a statistics package to get basic results. There is no intention to duplicate statistics packages beyond this, except where, perhaps, a particular method is desired in the overwhelming majority of cases.
- Better user feedback while processing, giving indication that the processing is continuing and where possible, how far along it is.
- Segmentation by percentages.
- More advanced and generalised XML handling, such as showing attribute values as output metadata when splitting by tag.
- "Include words" option for n-grams.
- Upgrade to Java 8.0
- Fix bug where a split by tag conflicts with handling of head and stage tags in TEI.
- Fix bug where a tag appears mid word (eg: straw
berry) the word is treated as two words.