Tuesday, December 21, 2010

Work done in the period of November 20th – December 3rd

Within this period I managed to serialize the whole drools rules file as a Knowledge Base Object, and observed a significant improvement (approximately 10s) but still far less than the execution time of the current RelEx2Frame. Also that serialization needed JVM stack size to be increased to 2MB. Still we felt that the performance is not up to the requirement so we decided to split the drools rules file according to 100 rules per file basis and Nisansa did that task. Danaja came up with a design which is focused on applying concurrency and parallelism for the RelEx2Frame system, and was accepted by all the members of the team as the basic design which will be altered and improved after further analysis.

In the current RelEx2Frame there is a significant limitation of the concepts or the words that are detected. Statistical learning methods can be used to reduce this limitation. One approach is to use an existing application and the other would be to implement statistical learner. Google Sets [1] is one of the existing applications that we are considering. During this period I have implemented an application which accesses Google Sets and generate new set of words for a given combination of words (<4). I have used an existing library called ‘XGoogle’ [2] written in Python programming language which provides an interface to access Google Sets. Since I was not familiar with Python, I had to learn Python and which I successfully managed to do. We will keep the results came out from this application and will compare with the results from our statistical learner to choose the most appropriate set of words.

Preparing the design document was the major work that we had done during this period, since it was due on 3rd December. All of us contributed to the design document in several ways and I contributed by writing design constraints, design decisions and designing rule learning component. Design constraints part involved basically three sub sections, namely Operating Environment, End-user Environment and Performance Requirements. Design decisions consisted with major decisions, some of which were already taken and others yet to be taken. Programming language selection, rule engine selection, caching knowledge bases, statistical learning of concepts and selecting the best suited data mining algorithm were the main design considerations discussed there.

Designing the rule learner was the most challenging task to me. I read many documents [3-5] on existing rule learners, existing rule induction algorithms, data mining techniques etc.  After considerable amount of literature survey I came up with the architecture for the statistical rule learner using data mining techniques, which will be altered and improved as it requires. Chamilka reviewed it and made few suggestions.

We were successfully managed to submit the design document on 3rd of December.
  

[1] “Google sets labs,” [Online]. Available: http://labs.google.com/sets
[2] “XGoogle,” [Online]. Available: http://www.catonmat.net/blog/python-library-for-google-sets
[3] K. Mhashilkar. “Data Mining Technology,” [Online]. Available: http://www.executionmih.com/data-mining/technology-architecture-application-frontend.php
[4] J. Grzymala-Busse, “Three strategies to rule induction from data with numerical attributes,” presented at the International Workshop on Rough Sets in Knowledge Discovery (RSKD 2003), associated with the European Joint Conferences on Theory and Practice of Software 2003, Warsaw, Poland, April 5–13, 2003.
[5] “Rule Learner,” [Online]. Available: http://openrules.com/RuleLearner.htm

No comments: