الفهرس | Only 14 pages are availabe for public view |
Abstract The N-gram model provides a valuable and powerful model in information retrieval and speech recognition. It can be used to measure the probability of how frequently a sequence of words may occur in some context. However, the massive increase of textual datasets of texts exposes the computational power cost and the storage space challenges in building such models. This dissertation is concerned with in proposing a Hadoopbased framework and a Cloudera Impala framework to overcome these challenges. |