Package detail

wink-nlp

winkjs159kMIT2.3.2

Developer friendly Natural Language Processing ✨

NLP, natural language processing, tokenize, SBD, sentence boundary detection, negation handling, sentiment analysis, POS Tagging, NER, named entity extraction, custom entity detection, word vectors, visualization, pattern matching, stemmer, bm25, vectorizer, Embeddings, Word Vectors, winkNLP, winkjs, wink

readme

winkNLP

Developer friendly Natural Language Processing ✨

WinkNLP is a JavaScript library for Natural Language Processing (NLP). Designed specifically to make development of NLP applications easier and faster, winkNLP is optimized for the right balance of performance and accuracy.

Its word embedding support unlocks deeper text analysis. Represent words and text as numerical vectors with ease, bringing higher accuracy in tasks like semantic similarity, text classification, and beyond – even within a browser.

It is built ground up with no external dependency and has a lean code base of ~10Kb minified & gzipped. A test coverage of ~100% and compliance with the Open Source Security Foundation best practices make winkNLP the ideal tool for building production grade systems with confidence.

WinkNLP with full Typescript support, runs on Node.js, web browsers and Deno.

Build amazing apps quickly

Wikipedia article timeline	Context aware word cloud	Key sentences detection

Head to live examples to explore further.

Blazing fast

WinkNLP can easily process large amount of raw text at speeds over 650,000 tokens/second on a M1 Macbook Pro in both browser and Node.js environments. It even runs smoothly on a low-end smartphone's browser.

Environment	Benchmarking Command
Node.js	node benchmark/run
Browser	How to measure winkNLP's speed on browsers?

Features

WinkNLP has a comprehensive natural language processing (NLP) pipeline covering tokenization, sentence boundary detection (sbd), negation handling, sentiment analysis, part-of-speech (pos) tagging, named entity recognition (ner), custom entities recognition (cer). It offers a rich feature set:

🐎 Fast, lossless & multilingual tokenizer	For example, the multilingual text string `"¡Hola! नमस्कार! Hi! Bonjour chéri"` is tokenized as `["¡", "Hola", "!", "नमस्कार", "!", "Hi", "!", "Bonjour", "chéri"]`. The tokenizer processes text at a speed close to 4 million tokens/second on a M1 MBP's browser.
✨ Developer friendly and intuitive API	With winkNLP, process any text using a simple, declarative syntax; most live examples have 30-40 lines of code.
🖼 Best-in-class text visualization	Programmatically mark tokens, sentences, entities, etc. using HTML mark or any other tag of your choice.
♻️ Extensive text processing features	Remove and/or retain tokens with specific attributes such as part-of-speech, named entity type, token type, stop word, shape and many more; compute Flesch reading ease score; generate n-grams; normalize, lemmatise or stem. Checkout how with the right kind of text preprocessing, even Naive Bayes classifier achieves impressive (≥90%) accuracy in sentiment analysis and chatbot intent classification tasks.
🔠 Pre-trained language models	Compact sizes starting from ~1MB (minified & gzipped) – reduce model loading time drastically down to ~1 second on a 4G network.
↗️ Word vectors	100-dimensional English word embeddings for over 350K English words, which are optimized for winkNLP. Allows easy computation of sentence or document embeddings.

Utilities & Tools 💼

BM25 Vectorizer
Similarity methods – Cosine, Tversky, Sørensen-Dice, Otsuka-Ochiai
its & as helpers to get Bag of Words, Frequency table, Lemma, Stem, Stop word removal, Negation handling and many more.

Documentation

Concepts — everything you need to know to get started.
API Reference — explains usage of APIs with examples.
Change log — version history along with the details of breaking changes, if any.
Examples — live examples with code to give you a head start.

Installation

Use npm install:

npm install wink-nlp --save

In order to use winkNLP after its installation, you also need to install a language model according to the node version used. The table below outlines the version specific installation command:

Node.js Version	Installation
16 or 18	`npm install wink-eng-lite-web-model --save`
14 or 12	`node -e "require('wink-nlp/models/install')"`

The wink-eng-lite-web-model is designed to work with Node.js version 16 or 18. It can also work on browsers as described in the next section. This is the recommended model.

The second command installs the wink-eng-lite-model, which works with Node.js version 14 or 12.

How to configure TypeScript project

Enable esModuleInterop and allowSyntheticDefaultImports in the tsconfig.json file:

"compilerOptions": {
    "esModuleInterop": true,
    "allowSyntheticDefaultImports": true,
    ...
}

How to install for Web Browser

If you’re using winkNLP in the browser use the wink-eng-lite-web-model. Learn about its installation and usage in our guide to using winkNLP in the browser. Explore winkNLP recipes on Observable for live browser based examples.

How to run on Deno

Follow the example on replit.

Get started

Here is the "Hello World!" of winkNLP:

// Load wink-nlp package.
const winkNLP = require( 'wink-nlp' );
// Load english language model.
const model = require( 'wink-eng-lite-web-model' );
// Instantiate winkNLP.
const nlp = winkNLP( model );
// Obtain "its" helper to extract item properties.
const its = nlp.its;
// Obtain "as" reducer helper to reduce a collection.
const as = nlp.as;

// NLP Code.
const text = 'Hello   World🌎! How are you?';
const doc = nlp.readDoc( text );

console.log( doc.out() );
// -> Hello   World🌎! How are you?

console.log( doc.sentences().out() );
// -> [ 'Hello   World🌎!', 'How are you?' ]

console.log( doc.entities().out( its.detail ) );
// -> [ { value: '🌎', type: 'EMOJI' } ]

console.log( doc.tokens().out() );
// -> [ 'Hello', 'World', '🌎', '!', 'How', 'are', 'you', '?' ]

console.log( doc.tokens().out( its.type, as.freqTable ) );
// -> [ [ 'word', 5 ], [ 'punctuation', 2 ], [ 'emoji', 1 ] ]

Experiment with winkNLP on RunKit.

Speed & Accuracy

The winkNLP processes raw text at ~650,000 tokens per second with its wink-eng-lite-web-model, when benchmarked using "Ch 13 of Ulysses by James Joyce" on a M1 Macbook Pro machine with 16GB RAM. The processing included the entire NLP pipeline — tokenization, sentence boundary detection, negation handling, sentiment analysis, part-of-speech tagging, and named entity extraction. This speed is way ahead of the prevailing speed benchmarks.

The benchmark was conducted on Node.js versions 16, and 18.

It pos tags a subset of WSJ corpus with an accuracy of ~95% — this includes tokenization of raw text prior to pos tagging. The present state-of-the-art is at ~97% accuracy but at lower speeds and is generally computed using gold standard pre-tokenized corpus.

Its general purpose sentiment analysis delivers a f-score of ~84.5%, when validated using Amazon Product Review Sentiment Labelled Sentences Data Set at UCI Machine Learning Repository. The current benchmark accuracy for specifically trained models can range around 95%.

Memory Requirement

Wink NLP delivers this performance with the minimal load on RAM. For example, it processes the entire History of India Volume I with a total peak memory requirement of under 80MB. The book has around 350 pages which translates to over 125,000 tokens.

Need Help?

Usage query 👩🏽‍💻

Please ask at Stack Overflow or discuss at Wink JS GitHub Discussions or chat with us at Wink JS Gitter Lobby.

Bug report 🐛

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a PR.

New feature 🌟

Looking for a new feature, request it via the new features & ideas discussion forum or consider becoming a contributor.

About winkJS

WinkJS is a family of open source packages for Natural Language Processing, Machine Learning, and Statistical Analysis in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.

Copyright & License

It is licensed under the terms of the MIT License.

changelog

Operational update

Version 2.3.2 November 30, 2024

⚙️ Updates

Updated test cases for new model release 🤓
Add word vector example link in README. ✅

Fixed some type definitions

Version 2.3.1 Nov 24, 2024

🐛 Fixes

Updated some BM25Vectorizer methods types according to implementation — thanks to @pavloDeshko ✅

Enabled more special space characters handling

Version 2.3.0 May 19, 2024

✨ Features

Detokenization now restores em/en, third/quarter, thin/hair, medium math space characters & narrow non breaking space characters besides the regular nbsp. 👏 🙌 🛰️

Improved error handling in contextual vectors

Version 2.2.2 May 08, 2024

✨ Features

.contextualVectors() now throws error if (a) word vectors are not loaded and (b) with lemma: true, "pos" is missing in the NLP pipe. 🤓

🐛 Fixes

Refined typescript definitions further. ✅

Added missing typescript definitions

Version 2.2.1 May 06, 2024

🐛 Fixes

Added missing typescript definitions for word embeddings besides few other typescript fixes. ✅

Added non-breaking space handling capabilities

Version 2.2.0 April 03, 2024

✨ Features

Detokenization restores both regular and non-breaking spaces to their original positions. 🤓

Introducing cosine similarity for word vectors

Version 2.1.0 March 24, 2024

✨ Features

You can now use similarity.vector.cosine( vectorA, vectorB ) to compute similarity between two vectors on a scale of 0 to 1. 🤓

Word embeddings have arrived!

Version 2.0.0 March 24, 2024

✨ Features

Seamless word embedding integration enhances winkNLP's semantic capabilities. 🎉 👏 🙌
Pre-trained 100-dimensional word embeddings for over 350,000 English words released: wink-embeddings-sg-100d. 💯
API remains unchanged — no code updates needed for existing projects. The new APIs include: 🤩
- Obtain vector for a token: Use the .vectorOf( token ) API.
- Compute sentence/document embeddings: Employ the as.vector helper: use .out( its.lemma, as.vector ) on tokens of a sentence or document. You can also use its.value or its.normal. Tokens can be pre-processed to remove stop words etc using the .filter() API. Note, the as.vector helper uses averaging technique.
- Generate contextual vectors: Leverage the .contextualVectors() method on a document. Useful for pure browser-side applications! Generate custom vectors contextually relevant to your corpus and use them in place of larger pre-trained wink embeddings.
Comprehensive documentation along with interesting examples is coming up shortly. Stay tuned for updates! 😎

Added Deno example

Version 1.14.3 July 21, 2023

✨ Features

Added a live example for how to run winkNLP on Deno. 👍

Fixed a bug

Version 1.14.2 July 1, 2023

🐛 Fixes

Paramteters in markup() are optional now in TS code — squashed a typescript declaration bug. 🙌

Squashed a bug

Version 1.14.1 June 11, 2023

🐛 Fixes

Fixed a typescript declaration. ✅

Introducing helper for extracting important sentences from a document

Version 1.14.0 May 20, 2023

✨ Features

You can now use its.sentenceWiseImprotance helper to obtain sentence wise importance (on a scale of 0 to 1) of a document, if it is supported by language model. 📚📊🤓
Checkout live example How to visualize key sentences in a document? 👀

Operational update

Version 1.13.1 March 27, 2023

⚙️ Updates

Some behind the scene model improvements. 😎 🤓
Add clarity on typescript configuration in README. ✅

Improving mark's functionality in custom entities

Version 1.13.0 December 09, 2022

✨ Features

Mark allows marking w.r.t. the last element of the pattern. For example if a pattern matches a fluffy cat then mark: [-2, -1] will extract fluffy cat — especially useful when the match length is unknown. 💃
Improved error handling while processing mark's arguments. 🙌

Operational update

Version 1.12.3 November 18, 2022

⚙️ Updates

README is now more informative and links to examples and benchmarks 👍
Benchmarked on latest machine, browser versions 🖥

Ready for Node.js version 18

Version 1.12.2 October 13, 2022

🐛 Fixes

Fixed incorrect install command in README ✅

Ready for Node.js version 18

Version 1.12.1 October 13, 2022

⚙️ Updates

Ready for future — we have tested winkNLP on Node.js version 18 including its models. 🙌 🎉

Some enhancements plus earned OpenSSF best practices passing badge

Version 1.12.0 May 13, 2022

✨ Features

winkNLP earned Open Source Security Foundation (OpenSSF) Best Practices passing badge. 🎉 👏 🙌
.bowOf() api of BM25Vectorizer now supports processing of OOV tokens — useful for cosine similarity computation. 😎
Document has a new API — .pipeConfig() to inquire the active processing pipeline.

Enhancing custom entities & BM25Vectorizer

Version 1.11.0 January 30, 2022

✨ Features

Obtain bag-of-words for a tokenized text from BM25Vectorizer using .bowOf() api — useful for bow based similarity computation. 👍
learnCustomEntities() displays a console warning, if a complex short hand pattern is likely to cause learning/execution slow down.🤞❗️

Enabling loading of BM25Vectorizer model

Version 1.10.0 November 18, 2021

✨ Features

Easily load BM25Vectorizer's model using newly introduced .loadModel() api. 🎉

Enhancing Typescript support

Version 1.9.0 November 06, 2021

✨ Features

We have enhanced typescript support to allow easy addition of new typescript enabled language models. 👏

⚙️ Updates

Added naive wikification showcase in README. 😎

Operational update

Version 1.8.1 September 22, 2021

⚙️ Updates

Included NLP Pipe details in the README file. 🤓

Introducing Typescript support

Version 1.8.0 July 31, 2021

✨ Features

We have added support for Typescript. 🙌🎉

Operational update

Version 1.7.2 July 15, 2021

⚙️ Updates

Some behind the scene updates & fixes. 😎🤓

Operational update

Version 1.7.1 July 09, 2021

⚙️ Updates

Improved documentation. 📚🤓

Adding more similarity methods & an as helper

Version 1.7.0 July 01, 2021

✨ Features

Now supported similarity methods are cosine for bag of words, tversky & Otsuka-Ochiai (oo) for set. 🙌
Obtain JS set via as.set helper. 😇

Enabling configurable annotation pipeline

Version 1.6.0 June 27, 2021

✨ Features

No need to run the entire annotation pipeline, now you can select whatever you want or just even run tokenization by specifying an empty pipe. 🤩🎉

Operational update

Version 1.5.0 June 22, 2021

⚙️ Updates

Exposed its and as helpers via the instance of winkNLP as well. 🤓

Introducing cosine similarity & readability stats helper

Version 1.4.0 June 15, 2021

✨ Features

Cosine similarity is available on Bag of Words. 🛍🔡🎉
You can now use its.readabilityStats helper to obtain document's readability statistics, if it is supported by language model. 📚📊🤓

Adding long pending lemmatizer support

Version 1.3.0 May 22, 2021

✨ Features

Now use its.lemma helper to obtain lemma of words. 👏 🎉

Introducing support for browser ready language model

Version 1.2.0 December 24, 2020

✨ Features

We have added support for browser ready language model. 🤩 🎉
Now easily vectorize text using bm25-based vectroizer. 🤓 👏

⚙️ Updates

Examples in README now runs on RunKit using web model! ✅

Enabling add-ons to support new language model

Version 1.1.0 September 18, 2020

✨ Features

We have enabled add-ons to support enhanced language models, paving way for new its helpers. 🎉
Now use its.stem helper to obtain stems of the words using Porter Stemmer Algorithm V2. 👏

Operational update

Version 1.0.1 August 24, 2020

⚙️ Updates

Benchmarked on Node.js v12 & v14 also and updated the speed to minimum observed. 🏃‍♀️

Announcing the stable version 1.0.0

Version 1.0.0 August 21, 2020

⚙️ Updates

Happy to release version 1.0.0 for you! 💫👏
You can optionally include custom entity detection while running speed benchmark. 😇

Operational update

Version 0.4.0 August 9, 2020

⚙️ Updates

Getting ready to move to version 1.0.0 — almost there! 💫

Operational updates

Version 0.3.1 August 3, 2020

⚙️ Updates

Some behind the scene updates to test cases. 😎
Updated the version of English light language model to the latest — 0.3.0. 🙌

Simplified language model installation

Version 0.3.0 July 29, 2020

✨ Features

No need to remember or copy/paste long Github url for language model installation. The new script installs the latest version for you automatically. 🎉

Improved custom entities

Version 0.2.0 July 21, 2020

✨ Features

We have added .parentCustomEntity() API to .tokens() API. 👏

🐛 Fixes

Accessing custom entities was failing whenever there were no custom entities. Now things are as they should be — it tells you that there are none! ✅

Improved interface with language model

Version 0.1.0 June 24, 2020

✨ Features

We have improved interface with the language model — now supports the new format. 👍