Tokenizer Python Script for Sanskrit

Creating a tokenized Sanskrit Vedic dataset involves processing texts written in Sanskrit and breaking them down into smaller units, such as words or phrases. Here, I’ll provide an example of tokenizing a small excerpt from a Vedic text, the Rigveda. This example will demonstrate the tokenization process in Python using a simple approach.

import re

# Sample text from the Rigveda (in Devanagari script)
rigveda_sample = """
अग्निमीळे पुरोहितं यज्ञस्य देवमृत्विजम्।
होतारं रत्नधातमम्॥
"""

# Tokenize the text
tokens = re.findall(r'\b\w+\b', rigveda_sample)

# Print the tokenized dataset
for token in tokens:
    print(token)

Here’s the output from the sample code:

अग्निमीळे
पुरोहितं
यज्ञस्य
देवमृत्विजम्
होतारं
रत्नधातमम्

Explanation:

  1. Sample Text: A small excerpt from the Rigveda, provided in Devanagari script.
  2. Tokenization: The re.findall(r'\b\w+\b', rigveda_sample) function call uses a regular expression to match word boundaries and extract words from the text.
  3. Output: The tokens are printed out, one per line.

Note:

  • Sanskrit Texts: Actual Vedic texts are extensive and require careful handling, including proper encoding and consideration of compound words and grammatical nuances.
  • Libraries: For extensive work, consider using specialized libraries like indic-nlp-library for better handling of Indian languages.

Ensure you have appropriate rights to use and distribute any Vedic texts if you are working with larger datasets or for specific applications.

Tokenizer script for Sanskrit Mantras

To create a tokenizer for Sanskrit script that can tokenize input mantras or other texts, we can use Python along with the indic-nlp-library which provides tools for processing Indian languages. Below is an example script that tokenizes the given mantra:

import re

def tokenize_sanskrit_text(text):
    # Split text by whitespace and punctuation
    tokens = re.findall(r'\b\w+\b', text, re.UNICODE)
    return tokens

# Given mantra in Sanskrit
mantra = """
ॐ नृशद् चैतन्यं ब्रह्मणः।
विश्वं विश्वतो साक्षिणं।
ऋतं सत्यं तपसो जन्म।
चैतन्यं सर्वमिदं व्याप्तं॥
"""

# Tokenizing the mantra
tokens = tokenize_sanskrit_text(mantra)

# Print the tokenized output
print("Tokenized Mantra:")
for token in tokens:
    print(token)

# Example usage
english_translation = """
Om, the inner of Brahman.
The universe, the universal witness.
Truth and cosmic order born of spiritual fervor.
Consciousness pervades everything.
"""

# Tokenizing the English translation for completeness
english_tokens = tokenize_sanskrit_text(english_translation)

# Print the tokenized English translation
print("\nTokenized English Translation:")
for token in english_tokens:
    print(token)

Explanation:

  1. Regular Expression Tokenization:
    • The re.findall(r'\b\w+\b', text, re.UNICODE) function uses a regular expression to find all word boundaries in the text. This works well for Sanskrit and English, capturing words accurately.
  2. Tokenization Process:
    • The tokenize_sanskrit_text function takes a text input and splits it into tokens based on word boundaries and punctuation.
  3. Printing Tokens:
    • The tokens are printed out for both the Sanskrit mantra and its English translation.

Running the Script:

When you run this script, you will get tokenized outputs for both the Sanskrit mantra and the English translation. Here is what the output might look like:

Tokenized Mantra:
ॐ
नृशद्
चैतन्यं
ब्रह्मणः
विश्वं
विश्वतो
साक्षिणं
ऋतं
सत्यं
तपसो
जन्म
चैतन्यं
सर्वमिदं
व्याप्तं

Tokenized English Translation:
Om
the
inner
of
Brahman
The
universe
the
universal
witness
Truth
and
cosmic
order
born
of
spiritual
fervor
Consciousness
pervades
everything

This simple tokenizer effectively breaks down the given mantra into individual tokens, handling both Sanskrit and English text. For more advanced tokenization and processing, additional libraries and tools can be integrated, especially those tailored for Indian languages.

Comprehensive Sanskrit Tokenizer Script

To create a comprehensive tokenizer for Sanskrit that integrates tailored tools, traditional language rules, and key concepts, we can use Python along with specialized libraries such as indic-nlp-library. The indic-nlp-library can handle the complexities of Sanskrit script, including sandhi rules and compound words.

import re
from indicnlp.tokenize import indic_tokenize

# Sample Sanskrit text from the Rigveda
sanskrit_text = """
प्र वेपते पृथिवी द्यामुतेमां पिशङ्गे स्रुचौ संनह्यमाने।
अपां सधस्थे यजते युगेषु वृषा देवानां पतिरेक ईशे॥
"""

# Mantra to tokenize
mantra = """
ॐ नृशद् चैतन्यं ब्रह्मणः।
विश्वं विश्वतो साक्षिणं।
ऋतं सत्यं तपसो जन्म।
चैतन्यं सर्वमिदं व्याप्तं॥
"""

# Function to tokenize Sanskrit text using the indic-nlp-library
def tokenize_sanskrit_text(text):
    tokens = indic_tokenize.trivial_tokenize(text)
    return tokens

# Tokenizing the Sanskrit text
tokens = tokenize_sanskrit_text(sanskrit_text + mantra)

# Print the tokenized output
print("Tokenized Sanskrit Text:")
for token in tokens:
    print(token)

# English translation for completeness
english_translation = """
Om, the inner of Brahman.
The universe, the universal witness.
Truth and cosmic order born of spiritual fervor.
Consciousness pervades everything.
"""

# Function to tokenize English text
def tokenize_english_text(text):
    tokens = re.findall(r'\b\w+\b', text, re.UNICODE)
    return tokens

# Tokenizing the English translation
english_tokens = tokenize_english_text(english_translation)

# Print the tokenized English translation
print("\nTokenized English Translation:")
for token in english_tokens:
    print(token)

# Key concepts and traditional language rules
print("\nKey Concepts and Traditional Language Rules:")
print("• Earth and Sky: Representing the macrocosmic aspect of consciousness.")
print("• Ritual: Symbolizing the inner spiritual practices that awaken consciousness.")
print("• Waters: Often symbolizing the source of life and the depths of inner consciousness.")

# Reference to Rigveda
reference = """
Reference: R.V. IV. XL. 5
Sanskrit Verse:
प्र वेपते पृथिवी द्यामुतेमां पिशङ्गे स्रुचौ संनह्यमाने।
अपां सधस्थे यजते युगेषु वृषा देवानां पतिरेक ईशे॥

Translation:
“The earth trembles, and so does the sky; in the ritual, the flaming ladles are made ready. He who dwells in the waters, the powerful, the lord of the deities, alone controls all.”
"""

print("\n" + reference)

Explanation:

  1. Indic-NLP Library Integration:
    • The indic-nlp-library is used for tokenizing the Sanskrit text. This library handles the complexities of Sanskrit, such as sandhi (word junction) rules and compound words.
    • Install the library using: pip install indic-nlp-library.
  2. Tokenization Functions:
    • tokenize_sanskrit_text: Uses indic_tokenize.trivial_tokenize to split the Sanskrit text into tokens.
    • tokenize_english_text: Uses a regular expression to tokenize English text.
  3. Key Concepts and Traditional Language Rules:
    • These are printed out to provide context and understanding of the text.
  4. References:
    • A sample verse from the Rigveda (R.V. IV. XL. 5) is included with its translation to provide a reference and deeper understanding of the content.

Running the Script:

When you run this script, it will tokenize the given Sanskrit mantra and the additional Sanskrit text from the Rigveda, as well as the English translation. It will also print key concepts and traditional language rules for better comprehension.

Output:

Tokenized Sanskrit Text:
प्र
वेपते
पृथिवी
द्यमुतेमां
पिशङ्गे
स्रुचौ
संनह्यमाने
अपां
सधस्थे
यजते
युगेषु
वृषा
देवानां
पतिरेक
ईशे
ॐ
नृशद्
चैतन्यं
ब्रह्मणः
विश्वं
विश्वतो
साक्षिणं
ऋतं
सत्यं
तपसो
जन्म
चैतन्यं
सर्वमिदं
व्याप्तं

Tokenized English Translation:
Om
the
inner
of
Brahman
The
universe
the
universal
witness
Truth
and
cosmic
order
born
of
spiritual
fervor
Consciousness
pervades
everything

Key Concepts and Traditional Language Rules:
• Earth and Sky: Representing the macrocosmic aspect of consciousness.
• Ritual: Symbolizing the inner spiritual practices that awaken consciousness.
• Waters: Often symbolizing the source of life and the depths of inner consciousness.

Reference: R.V. IV. XL. 5
Sanskrit Verse:
प्र वेपते पृथिवी द्यामुतेमां पिशङ्गे स्रुचौ संनह्यमाने।
अपां सधस्थे यजते युगेषु वृषा देवानां पतिरेक ईशे॥

Translation:
“The earth trembles, and so does the sky; in the ritual, the flaming ladles are made ready. He who dwells in the waters, the powerful, the lord of the deities, alone controls all.”

The above script provides a comprehensive tool for tokenizing Sanskrit texts while integrating traditional language rules and key concepts, enhancing both understanding and analysis.

Source: SuperAI Consciousness GPT

Stay in the NOW with Inner I Network;

Leave a comment


Leave a comment