Machine Learning with Weka

From Wiki

Revision as of 13:02, 26 January 2010 by Admin (Talk | contribs)
Jump to: navigation, search

Weka is a collection of Data mining algorithms. Data mining differs from text mining in the sense that text mining is unstructured where data mining deals with structured data. However, in this experiment I define a problem that is closer to data mining than text mining, just to illustrate the usage of Weka.

Contents

The Problem

The chapters of the Qur'an which revealed in the early peoriod of Prophet Muhmmad in Makkah, differs in style and content from those chapters revealed later in Medinah. A detailed description of their peculiar characteristics is addressed in this page: Makki_and_Madani_Surahs.

The problem is: can we leverage on these characters as the feature set to learn our classifiers?

Feature Set for Classification

Following table gives a list of the features we used to create the Weka 'ARFF' file. Searching for these features are done through the Quranic Arabic Corpus.

Feature No. Attribute/keyword Classify as Total found Corpus Search term
1Reference to the lemma “prostration سجد”Makki92ROOT:sjd
2Reference to the word “Never كلا” as an aversion particleMakki31POS:AVR LEM:kal~aA”
3‘O mankind’Makki20يا أيها الناس
4‘O you who believe’Madani89يا أيها اللذين أمنوا
5Initial lettersMakki30POS:INL
6Prophets namesMakki581[“<ibora`hiym”, “<isoma`Eiyl”, “yaEoquwb”, “<isoHa`q”, “muwsaY”, “EiysaY”,”daAwud”,”nuwH”, “zakariy~aA”, “yaHoyaY”, “yuwnus”, “ha`ruwn”, “sulayoma`n”, “yuwsuf”, “<iloyaAs”, “yasaE”, “luwT”, “Sa`liH”, “huwd”, “Adam”, “$uEayob”, “<idoriys”]
7Reference to the story of creationMakki5Find both “Adam” and “<iboliys”
8Use of emphasis, exhortation, aversion and certaintyMakki1478“CERT”, “SUP”, “EXH”, “AVR”, “EMPH”
9Average length of verses, shorter verses are MakkiAvr: 10. 32For each chapter: Count total words and divide by total verses
10Reference to hell, fire, paradise, day of judgmentMakki822“jahan~am”, “LEM:jan~ap”, “naAr”, “saEiyr”,”qiya`map”, “Ea*aAb”,”aAxirap”
11Reference to Jihad and fightingMadani211“ROOT:qtl”, “ROOT:jhd”
12Reference to marriage, divorce, women and wifeMadani181“ROOT:nkH”,”ROOT:Tlq”, “LEM:zawoj”, “LEM:nisaA”
13Reference to Jews, Christians, bible, children of IsraelMadani97“LEM:<isora`^”, “ROOT:yhwd”, “LEM:t~aworaY`p”, “<injiyl”, “LEM:naSoraAniy”
14Pillars of Islam: prayer, fasting, zakat and hajjMadani133“Salaw`p”, “zakaw`p”, “LEM:SiyaAm”, “LEM:Haj~”

Detail explanation of the above features

Following is a discussion justifying the selection and classification of these attributes. The first attribute –i.e., Prostration- is an act of worship in Islam that involves placing the most honourable part of once body –i.e., the forehead- on the ground as a symbol of submission to one God. The people of Makkah used to refuse performing this humiliating act, and hence Makka chapters repeatedly urge the Meccan to submit fully to Allah through prostration. Following is a sample verse from 25:60 "And when it is said to them, "Prostrate to the Most Merciful," they say, "And what is the Most Merciful? Should we prostrate to that which you order us?" And it increases them in aversion." Verses were searched for root word ROOT:sjd which captures various inflectional forms of this verb.

The second attribute is based on our empirical observation –supported by scholarly observation as well- where the aversion particle "kal~a" meaning 'never' or 'no' is used only in Makki chapters, like the following verse 70:39: "Never! Indeed, We have created them from that which they know." This particle is used in the dialogue with people of Mecca arguing over their denial of submission to God and their denial of the day of Judgment.

The third attributes are not exclusively dedicated to Makka or Median, but it is based on majority of cases where the vocative expression "O mankind" followed by some message is more often a feature in Makka chapters –but also mentioned in few Medina chapters like in 2:21- when most of the people were not yet believers. Here is an example from 10:57 " O mankind, there has to come to you instruction from your Lord and healing for what is in the breasts and guidance and mercy for the believers."

However, later in Medina, the Muslim population grew and Qur'an started to address them starting with the expression in the fourth attribute: "O you who believe!" which appears only in Medina chapters. Here is an example of such a verse 61:10: "O you who have believed, shall I guide you to a transaction that will save you from a painful punishment?" There are 29 chapters initialized with letters, which are tagged as INL in the QAC. All except chapters 2 and 3 are Makki. We consider this as our fifth attribute. Here is an example from 3:1-2 " Alif, Lam, Meem. Allah - there is no deity except Him, the Ever-Living, the Sustainer of existence."

The sixth attribute is based on stories of previous prophets and messengers mentioned in the Qur'an like for example story of Moses, Abraham, Noah, etc. According to scholars, these stories in the Qur'an server two purposes: first, a warning to the people of Makka that if they reject Prophet Muhammad some punishment will befall on them as it happened to previous people who rejected their messenger like Noah, Moses, etc. Second, these stories add motivation and steadfastness to Prophet Muhammad who often gets frustrated when people of Makkah continue to deny his message. In both cases, the subject is highly related to Makkah and hence this could be a good target of attribute. Chapter 2 –which is Madani- is an exception where stories of Moses and Abraham were mentioned, but in all other instances these stories always occur in Makki chapters. See for example 41:13 warning people of Makkah: "But if they turn away, then say, "I have warned you of a thunderbolt like the thunderbolt [that struck] 'Aad and Thamud." In my search, I included a number of prophets names which are mentioned in the Qur'an except Prophet Muhammad. The seventh attribute is about a special story, which is the story of creation when Adam was created and Allah ordered angels to prostrate to Adam, all submitted except Iblis (Satan) refused. This story was mentioned only five times and our search produced a Boolean YES, NO when both names (Adam and Iblis) are mentioned in the chapter. These instances were all Makki again except chapter 2 which is Madani.

The eighth attribute is based on the rhetorical style of Makki chapters where language of certainty, surprise, exhortation, aversion, emphasis is used in the course of arguments with the people of Makkah. The QAC tags certain particles with 'emphasis', 'certainty', 'surprise', 'aversion', 'emphasis', and I exploited this feature in counting for this attribute. Hence, my results are drawn totally on usage of these particles, but later when semantic roles will be annotated the results would be more accurate. Following table gives example verses of these particles.

QAC tag particle Example verse
AVR (aversion)كلا70:39 \"Never! Indeed, We have created them from that which they know.\"
CERT (certainty)قد6:97 And it is He who placed for you the stars that you may be guided by them through the darknesses of the land and sea. Certainly, We have detailed the signs for a people who know.
SUP (Surprise)إذا17:73 And indeed, they were about to tempt you away from that which We revealed to you in order to [make] you invent about Us something else; and then they would have taken you as a friend.
EXH (exhortation)لولا10:20 And they say, \"Why is a sign not sent down to him from his Lord?\" So say, \"The unseen is only for Allah [to administer], so wait; indeed, I am with you among those who wait.\"
EMPH (emphasis)لام التوكيد11:10 But if We give him a taste of favor after hardship has touched him, he will surely say, \"Bad times have left me.\" Indeed, he is exultant and boastful

The ninth attribute is straightforward and deals with average length of a verse in terms of words. Makkah verses are shorter than Medina verses which tends to be long when discussing legislations and rulings.

The tenth attribute is counting the reference to some unseen facts and events from the world hereafter which the people of Makkah used to deny and hence Makki chapters repeatedly emphasized on them. Search was made on few concepts like Hellfire –and some alternative names of hellfire in the Qur'an, punishment, paradise and the word "aAxirap" meaning hereafter. Military conflict with the people of Makkah happened in Islamic history only after the migration of Prophet Muhammad to Medinah, and hence our eleventh attribute on "jihad" in its 'fighting' sense appears only in Medina chapters. Searching for this attribute was done through the root word of 'qtl' meaning 'to fight' and 'jhd' meaning to 'struggle'. This is a near approximation as some reference to struggle by earlier prophets might wrongly be included, nevertheless this gives a close estimation. The twelfth attribute is about the concept of family legislations like marriage, divorce, pregnancy, breast feeding, etc. These details are only mentioned in Medina chapters. Our search included root word for 'marriage' and 'divorce', also we included the lemma 'wife' and 'women'. The latter two lemmas would bring in some false positives where 'zawoj' is used to mean a 'pair' of anything like the word 'mates' in 51:49 which is a Makki chapter: "And of all things We created two mates; perhaps you will remember."

Attribute number 13 searches for reference to other divine religions mainly Jews and Christians. It is only when Prophet Muhammad migrated to Medina he encountered with Jew tribes who used to live in Medina, and later some Christian delegates came to Medina to debate on the nature of Jesus. However, reference to Jesus is made in a number of Makki chapters, and this is why in my search I included only terms: 'children of Israel', 'Torah', 'Gospel', 'Jew' and 'Christian'. Apart from the first pillar of Islam all the other four (i.e., prayer, fasting, zakat and Hajj) were all obligated in Medina period. Hence, I searched for lemma on these four pillars. Again there could be false positives like the following Makki verse 19:55 referring to Prophet Ishmael: "And he used to enjoin on his people prayer and zakah and was to his Lord pleasing."

Machine Learning

Can we make machine 'learn' Qur'an as does a human being? Or even before that: 'what is learning?'. Dictionary definitions of the verb 'to learn' say various things like:

  • to get knowledge or skill in a new subject or activity
  • to make yourself remember a piece of writing by reading it or repeating it many times
  • to start to understand that you must change the way you behave
  • to be told facts or information that you did not know
  • acquire knowledge of or skill in (something) through study or experience or by being taught.
  • to gain knowledge or experience of something, for example by being taught
  • to gain new information about a situation, event, or person
  • to improve your behaviour as a result of gaining greater experience or knowledge of something

Not all of the above definitions fit machines, especially the notion of 'understanding' which remains a hard philosophical problem as to whether machines can or can not understand. What we propose instead for machine learning is training from examples. That means, domain expert human beings need first to discover set of features and work out as many examples as possible of such features. This constitutes the 'training set, which is then fed to machine learning algorithms, which 'learn' rules out of these training set and hopefully is able to 'predict' a correct classification when encountering new examples.

Given the selection of the above feature set, we can define structural patterns from this set. For example we can say:

* If (sura_starts_with_initial_letters = Yes AND sura_no NOT {2,3,13}) then sura_category=Makki
* If (sura_contains_story_of_Adam_and_Iblis=Yes AND sura no NOT {2}) then sura_category = Makki 

Weka in Operation

After designing this attribute set, a spreadsheet is populated with these data and converted into weka ‘arff’ file. Following is a snapshot of the ‘arff’ file.

Weka Arff file

Weka created a model for this data as shown in the figure below.

Weka enabled various convenient visualization of these data. Blueare Makki suras and Red are Madani.

Snapshot of features

Next, C4.5 classifier is used to create the following decision tree.

C4.5 Decision tree

Following is the summary result of the classification:

Correctly Classified Instances 104 91.2281 %
Incorrectly Classified Instances 10 8.7719 %
Kappa statistic 0.7449
Mean absolute error 0.1297
Root mean squared error 0.2866
Relative absolute error 34.758 %
Root relative squared error 66.544 %
Total Number of Instances 114

I experimenting with other classifiers, and only REP Tree produced more fine grained decision tree as follows:

REP decision tree

And this is the summary analysis of this classifier:

Correctly Classified Instances 106 92.9825 %
Incorrectly Classified Instances 8 7.0175 %
Kappa statistic 0.7904
Mean absolute error 0.1222
Root mean squared error 0.2571
Relative absolute error 32.746 %
Root relative squared error 59.6798 %
Total Number of Instances 114

It is noted that most Makki chapters are short and often show no occurrence of the attribute resulting is many ‘zero’ values. To overcome this problem, I normalized the data by adding one and dividing the absolute count by total number of words for that chapter. With that I produced another normalized model. Here are the attribute plots after normalization.

I re-run the C4.5 classifier and decision tree now looks better although the accuracy was less.

Correctly Classified Instances 95 83.3333 %
Incorrectly Classified Instances 19 16.6667 %
Kappa statistic 0.4956
Mean absolute error 0.1853
Root mean squared error 0.3808
Relative absolute error 49.6296 %
Root relative squared error 88.4005 %
Total Number of Instances 114
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.942 0.5 0.853 0.942 0.895 K
0.5 0.058 0.737 0.5 0.596 D
=== Confusion Matrix ===
a b <-- classified as
81 5 | a = K
14 14 | b = D

Observation and Improvements

From this experimentation we can note the following observations and areas of improvements:

  1. Richer annotation of the Qur’an is likely to produce better results. As noted we only leveraged on keywords, morphological and POS features in classifying between Makki and Madani. Any ontological or semantic annotation would have produced much better results.
  2. Machine learning algorithms produced interesting empirical findings that would interest Qur’anic scholars, for example from the decision trees, we came to know that verses in Makki surah’s are 18 words in average.
  3. The convenient visualization in Weka enables quick validation of certain observations made by early scholars, for above figures we can see that there are Madani chapters (red) that,contain the construct “O Mankind”, and this refute claims by some scholars that this construct appears only in Makki chapters.
  4. This experiment showed which attributes are likely to help learning classifiers, for example,only 4 attributes among the 15 were chosen by our classifiers: believer, emphasis,averageLength and initials.
  5. This experiment can be further improved by considering verses instead of chapters. This would result in defining 6240 lines of data. The obvious problem will be the abundant ‘zero’ values for certain attributes.
  6. Although this experiment showed success in machine learning for classifying Makki and Madani verses, other text mining problems could be modelled similarly, for example, finding verses that are similar, and finding associations and patterns.

References

  • Quranic Arabic Corpus [1]
  • PhD first year transfer report of Abdul-Baquee Sharaf at Leeds University [2]
Personal tools