Thursday, November 6, 2008

Site Keyword Crawler

Recently I have been interested in PhP array functions and Regular Expressions (abbreviated as RegExp). I have found it is very easy to find the major keywords that are inside a string of text using the following code:

function getKeyWords($content){
$tokens=explode(" ", $content);
$keywords=array("num_words"=>1);
foreach ($tokens as $word){
if (!array_key_exists($word, $keywords)){
//echo "word not found! $word
";
$keywords[$word]=1;
}else
$keywords[$word]++;
$keywords['num_words']++;
}
return $keywords;
}


Basically it breaks apart the content into an array of tokens and checks for duplicates. As it stands this will grab anything with a space between as a token so things like 'alt="text' will be tagged as a token. Some modifications are needed for this to work on an HTML document like some fancy reg exp searches =). I will keep posting with any updates to this function...

No comments: