Aho-Corasick CUDA GPU
A high-performance implementation of the Aho-Corasick string matching algorithm with GPU acceleration using CUDA, packaged as an npm module.
Description
This project provides both CPU and GPU implementations of the Aho-Corasick algorithm, a string searching algorithm that can efficiently locate multiple pattern strings in a text simultaneously. The GPU implementation leverages CUDA for parallel processing, making it suitable for large-scale pattern matching tasks.
Prerequisites
- Node.js (v14 or higher)
- CUDA Toolkit (11.0 or higher)
- C++ compiler compatible with your Node.js version
- npm (comes with Node.js)
Installation
npm install aho-corasick-cuda-gpu
For development installation from the GitHub repository:
git clone https://github.com/yourusername/aho-corasick-cuda-gpu.git
cd aho-corasick-cuda-gpu
npm install
Usage Example
Here's an example of how to use the package to search for patterns in a text string:
const ahoCorasick = require('aho-corasick-cuda-gpu');
// Define patterns to search for
const patterns = ['the', 'and', 'that', 'have'];
// Example usage with a text string
const text = 'the quick brown fox and the lazy dog';
ahoCorasick.findMatches(text, patterns)
.then(matches => {
console.log('Matches:', matches);
// Output: Matches: [
// { pattern: 'the', count: 2 },
// { pattern: 'and', count: 1 },
// { pattern: 'that', count: 0 },
// { pattern: 'have', count: 0 }
// ]
})
.catch(error => {
console.error('Error:', error);
});
Processing Large Files
For processing large files (e.g., 1GB or larger), check out the sample script test_aho_corasick.js
in the repository. It demonstrates:
- Processing large files in chunks
- GPU-accelerated pattern matching
- Progress tracking
- Memory-efficient file handling
To run the sample:
node test_aho_corasick.js
API Reference
findMatches(text, patterns)
Searches for multiple patterns in a text using the GPU-accelerated Aho-Corasick algorithm.
Parameters:
text
(string): The text to search inpatterns
(string[]): Array of patterns to search for
Returns:
- Promise that resolves to an array of objects with the following properties:
pattern
(string): The pattern that was searched forcount
(number): The number of occurrences of the pattern in the text
processFile(filePath, patterns, options)
Processes a large file directly using the GPU-accelerated Aho-Corasick algorithm. This is more efficient than reading the file in chunks and calling findMatches
on each chunk.
Parameters:
filePath
(string): Path to the file to processpatterns
(string[]): Array of patterns to search foroptions
(object): Optional configurationchunkSize
(number): Size of each chunk in bytes (default: 64MB)onProgress
(function): Callback for progress updates
Returns:
- Promise that resolves to an array of objects with the following properties:
pattern
(string): The pattern that was searched forcount
(number): The number of occurrences of the pattern in the file
Example:
const ahoCorasick = require('aho-corasick-cuda-gpu');
const patterns = ['the', 'and', 'that', 'have'];
const filePath = 'path/to/large/file.txt';
ahoCorasick.processFile(filePath, patterns, {
chunkSize: 64 * 1024 * 1024, // 64MB chunks
onProgress: (progress) => {
console.log(`Progress: ${progress.percentage.toFixed(2)}% complete`);
}
})
.then(results => {
console.log('Results:', results);
})
.catch(error => {
console.error('Error:', error);
});
Building from Source
To build the project from source:
npm run build
This will compile the C++/CUDA code and create the native Node.js addon.
Project Structure
.
├── build/ # Build output directory
├── lib/ # JavaScript wrapper code
│ └── index.js # Main module entry point
├── node_modules/ # Node.js dependencies
├── src/ # Source code
│ ├── aho_corasick.cpp # CPU implementation
│ ├── aho_corasick.h # CPU header file
│ ├── aho_corasick_gpu.cu # GPU implementation
│ ├── aho_corasick_gpu.cuh # GPU header file
│ └── node_binding.cpp # Node.js native binding
├── binding.gyp # Native addon build configuration
├── main.cu # Example standalone CUDA application
├── package.json # npm package configuration
├── test_aho_corasick.js # Example for processing large files
└── README.md # This file
Performance
The GPU-accelerated implementation can process text at rates of several GB/s on modern GPUs, making it suitable for applications that need to search for multiple patterns in large text corpora.
License
MIT