We implement the webpage classification algorithm by
combining the three techniques mentioned previously 1)
Segmenting. Visual Boundaries 2) Breath First Search 3)
Ontology. First, of all we identify the visual boundaries
of HTML tags using. Information provided by the browser
rendering engine. We parse and traverse the HTML page
using Breadth First Search, algorithm. If a particular level
.Of a tree contains at least five HTML tags with sufficient
visual boundaries (e.g. Having area more than 500), we
take. These HTML Tags as regions. Once the segmentation
is done we tokenize, the TextNodes into words and then
we select the first. Two regions merge, group, them and
same words together. When a word, matches another the
first word will form a cluster. Of size one.
.After segmentation and merging of the first 2 regions are
carried out we will, perform the tokenization of
TextNode to. Each of the, remaining regions and obtain
the root word for each of the tokenized words. For
example the root, word of "oxen." Is "ox", the root word
of "fishes." is "fish", and so on. After, that we measure
the semantic similarity of each word in. The remaining
.Regions with the words in the merged region using Lin 's
algorithm. If a pair of words obtains a semantic similarity
score. Of more than 0.7 from a scale of 0.0, to 1.0 the
words will be grouped into their respective cluster. The
counter of the. Cluster group will be increased by one each
time a match is found. A pair of words which returns a
value of less than 0.7 will. Be, Finally ignored.We will
have a list of clusters with their own words. We will then
match these keywords with the predefined keywords to.
การแปล กรุณารอสักครู่..
