Skip to content

Instantly share code, notes, and snippets.

@cedias
Last active December 16, 2015 07:59
Show Gist options
  • Select an option

  • Save cedias/5402415 to your computer and use it in GitHub Desktop.

Select an option

Save cedias/5402415 to your computer and use it in GitHub Desktop.
Map reduce to count DF

#Map-Reduce

##Objectif Calculer la "Document Frequency" (DF) des documents.
Document Frequency: Nombre de documents dans lesquelles apparait chaque mots.

##Principe Map: JSON -> Liste (key,value) Reduce: Liste (key,value) -> aggregation sur les valeurs (i.e addition)

##Exemple Map: {A,A,B,C} -> (A: 1),(A,1),(B,1),(C,1) Reduce -> A:2 B:1 C:1

##implementation

function map(){
	var text = this.comment;
	var words = text.match('/\w+/g');

	if(words == null)
		return;

	var df=[];
	for(var i = 0;i< words.length;i++)
		df[words[i]]=1;

	}

	for(var mot in df){
		emit(mot,{df:1});
	}
}

function reduce(key, values){
		var total =0;
	for (var i =0; i< values.length;i++)
	{
		total += values[i].df;
	}
	return {df:total}
	}
}

Term Frequency: TF(Word,Document) Relevent Score Value(d,v) := TF(w,d)*log(N/DF(w)) (+PR(d) := google)

Mysql: + 2 tables

mot | docu | tf

mot | df

function map(){
	var text = this.comment;
	var words = text.match('/\w+/g');

	if(words == null)
		return;

	var tf=[];
	for(var i = 0;i< words.length;i++)
		if(tf[words[i]]==undefined)
			tf[words[i]]=1;
		else
			tf[words[i]]++

	}

	for(var mot in df){
		emit(mot,{tf:words[i],doc:this._id});
	}
}

function reduce(key, values){
	
	return {word:key,tfs:values}
	
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment