In the previous tutorial we created the data frame that hold the Arabic Quran called q
str(q)
## 'data.frame': 6236 obs. of 3 variables:
## $ sura: int 1 1 1 1 1 1 1 2 2 2 ...
## $ aya : int 1 2 3 4 5 6 7 1 2 3 ...
## $ text: chr "بسم الله الرحمن الرحيم" "الحمد لله رب العالمين" "الرحمن الرحيم" "مالك يوم الدين" ...
I wanted to experiment a bit with the tm
package. Please install and load the package.
library(tm)
## Loading required package: NLP
The first step is to create a corpus consisting of the raw Arabic verses as VectorSource
qCorpus = Corpus(VectorSource(q$text))
Lets inspect
the content of this Corpus
inspect(qCorpus[1:5])
## <<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## بسم الله الرحمن الرحيم
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## الحمد لله رب العالمين
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## الرحمن الرحيم
##
## [[4]]
## <<PlainTextDocument (metadata: 7)>>
## مالك يوم الدين
##
## [[5]]
## <<PlainTextDocument (metadata: 7)>>
## إياك نعبد وإياك نستعين
We will do some more annotation work using meta
later. For now, let us create term document matrix
qTerms = DocumentTermMatrix(qCorpus)
qTerms
## <<DocumentTermMatrix (documents: 6236, terms: 14766)>>
## Non-/sparse entries: 63181/92017595
## Sparsity : 100%
## Maximal term length: 11
## Weighting : term frequency (tf)
This produces a long matrix of documents (i.e., verses) against Quranic terms. Let us for example see a portion of this matrix by looking into documents 1 to 7 (i.e., sura Fateha) and terms say 1000 to 1005
inspect(qTerms[1:7,1000:1005])
## <<DocumentTermMatrix (documents: 7, terms: 6)>>
## Non-/sparse entries: 0/42
## Sparsity : 100%
## Maximal term length: 6
## Weighting : term frequency (tf)
##
## Terms
## Docs أعيدوا أعيذها أعين أعينكم أعينهم أعينهن
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
This tells us that none of these five terms appears in any of the first 7 documents. Sparsity is a known issue in document term matrices.
Lets us find some common terms in the Quran. What are terms used 100 or more times in the Quran?
findFreqTerms(qTerms,100)
## [1] "إذا" "إلا" "الأرض" "الحق" "الدنيا" "الذي"
## [7] "الذين" "السماء" "السماوات" "الكتاب" "الله" "النار"
## [13] "الناس" "إلى" "آمنوا" "إنا" "إنما" "إنه"
## [19] "إني" "أولئك" "أيها" "بالله" "بعد" "بما"
## [25] "حتى" "خير" "ذلك" "ربك" "ربكم" "ربنا"
## [31] "ربهم" "شيء" "عذاب" "على" "عليكم" "عليم"
## [37] "عليه" "عليهم" "عند" "فإن" "فلا" "فلما"
## [43] "فيه" "فيها" "قال" "قالوا" "قبل" "قوم"
## [49] "كان" "كانوا" "كفروا" "كنتم" "لكم" "لله"
## [55] "لهم" "مما" "منكم" "منهم" "موسى" "هذا"
## [61] "وإذا" "والأرض" "والذين" "والله" "وإن" "ولا"
## [67] "ولقد" "ولكن" "ولو" "وما" "ومن" "وهم"
## [73] "وهو" "يشاء" "يوم"
Interesting to see prophet Musa (Moses) “موسى” among the list.
Note that since we did not do any stemming root words are repeated with various affixes as different words.
Even we can create a list of most freq terms and store it in a data frame
freq = sort(colSums(as.matrix(qTerms)),decreasing = T)
head(freq, 10)
## الله الذين على إلا ولا وما قال إلى لهم ومن
## 2153 810 670 664 658 646 416 405 373 342
wf = data.frame(word=names(freq), freq=freq)
Why not plot them using ggplot2
package?
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
#take the most freq in a separate data frame
wfplot = subset(wf,freq>300)
ggplot(wfplot, aes(word, freq)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=45, hjust = 1))
No Wonder, Allah الله is the most frequent word. May HE be exalted!
Now let us do some more cool visualization with Word Cloud using the package wordcloud
.Please review the package and adjust various parameters to choose the right scale and color brewer and percentage of words to rotate.
library(wordcloud)
## Loading required package: RColorBrewer
#I will set a seed so you can reproduce this result
set.seed(114)
wordcloud(names(freq), freq, min.freq=50, scale=c(5,.5),colors=brewer.pal(6,"Dark2"), rot.per=0.2)
Alayws the beatiful name ALLAH (الله) pops up in your face! Exalted be He.
We would like to know more about the word length in the Quran.
First lets get all words in a data frame words
and word length in wLen
words = as.matrix(colnames(qTerms))
wLen = data.frame(nletters=nchar(words))
Let us produce some visualization out of this.
ggplot(wLen, aes(x=nletters))+
geom_histogram(binwidth=1) +
geom_vline(xintercept=mean(nchar(words)),
colour="green", size=1, alpha=.5)+
labs(x="Number of Letters", y="Number of Words")
This shows that on average word sizes are close to 5 letters. Remember we are not talking here about root words, rather raw words with all prefixes and suffixes.
Since we gone that far, let us conclude with analyzing the frequency of letters. First a number of packages need to be installed.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
##
## Attaching package: 'qdapRegex'
##
## The following objects are masked from 'package:dplyr':
##
## escape, explain
##
## The following object is masked from 'package:ggplot2':
##
## %+%
##
## Loading required package: qdapTools
##
## Attaching package: 'qdapTools'
##
## The following object is masked from 'package:dplyr':
##
## id
##
## WARNING: Rtools is required to build R packages, but is not currently installed.
##
## Please download and install Rtools 3.1 from http://cran.r-project.org/bin/windows/Rtools/ and then run find_rtools().
##
## Attaching package: 'qdap'
##
## The following object is masked from 'package:dplyr':
##
## %>%
##
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
##
## The following object is masked from 'package:base':
##
## Filter
letter = str_split(words,"")
letter=sapply(letter, function(x) x[-1])
letter = unlist(letter)
letter = dist_tab(letter)
So, letter
is a nice data frame that gives a list of letters with their frequency and cumulitive freq percentages. Let us produce a graph out of it
letterMutate = mutate(letter,Letter=factor(letter$interval, levels=letter$interval[order(letter$freq)]))
ggplot(letterMutate, aes(letterMutate$Letter, weight=percent)) +
geom_bar()+
coord_flip()+
ylab("Proportion")+
xlab("Letter")+
scale_y_continuous(breaks=seq(0,12,2),
label=function(x) paste0(x,"%"),
expand=c(0,0), limits=c(0,12))