simHash类的算法更适合长文本的相似度判断,而短文本可考虑一下几种方法:
一、编辑距离+jacard距离
import com.hankcs.hanlp.HanLP
import com.hankcs.hanlp.seg.common.Term
import java.util.Properties
import scala.collection.JavaConverters._
object Test extends Serializable {
def main(args: Array[String]): Unit = {
val props=new Properties()
props.setProperty("deduplicateMinJaccardDistance","0.2")
val s1="山东今天大雨"
val s2="云南今天大雨"
println(isSimilar(props,s1,s2))
}
/**
* 获取有效实体
*
* @param text
* @return
*/
def getEfficientNorms(text: String): List[String] = {
val terms: List[Term] = HanLP.newSegment.seg(text).asScala.toList
terms.filter(term => term.word.length > 1 && term.nature.startsWith
("n")).map(term => term.word) //n开头为名词
}
/**
* 获取Jaccard系数
*
* @param array1
* @param array2
* @return
*/
def getJaccardCoefficient(array1: Seq[String], array2: Seq[String]) = {
val s1 = array1.toSet
val s2 = array2.toSet
s1.intersect(s2).size.toDouble / s1.union(s2).size.toDouble
}
/**
* 计算编辑距离Levenshtein距离:插入、删除和替换
*
* @param word1
* @param word2
* @return
*/
def getLevenshtein(word1: String, word2: String): Int = {
val m = word1.length
val n = word2.length
val dp = Array.ofDim[Int](m + 1, n + 1)
for (i <- 0 to m) dp(i)(0) = i
for (j <- 0 to n) dp(0)(j) = j
for (i <- 1 to m; j <- 1 to n) {
if (word1(i - 1) == word2(j - 1)) dp(i)(j) = dp(i - 1)(j - 1)
else dp(i)(j) = (dp(i - 1)(j - 1) + 1).min((dp(i - 1)(j) + 1).min(dp(i)(j - 1) + 1))
}
dp(m)(n)
}
/**
* 判断两字符串是否相似
* @param props
* @param text1
* @param text2
* @return
*/
def isSimilar(props: Properties, text1: String, text2: String): Boolean = {
val maxDistance = props.getProperty("deduplicateMaxEditDistance", "10").toInt
val minDistanceRate = props.getProperty("deduplicateMinDistanceRate", "0.3").toFloat
val minJaccardDistance = props.getProperty("deduplicateMinJaccardDistance", "0.1").toFloat
val lDis = getLevenshtein(text1, text2)
val score = 1 - lDis.toDouble/ Math.max(text1.length, text2.length)
val jDis = getJaccardCoefficient(getEfficientNorms(text1), getEfficientNorms(text2))
jDis > minJaccardDistance && score > minDistanceRate && lDis < maxDistance
}
}
对于dataframe,getLevenshtein可利用原生的levenshtein函数文章来源:https://www.toymoban.com/news/detail-613853.html
df.withColumn("editDistance", levenshtein(col("text1"), col("text2")))
.withColumn("score", lit(1) - col("editDistance") / greatest(length(col("text1")), length(col("text2"))))
二、md5
三、语义向量模型
其他思路
## 根据经验,ratio() 值超过 0.6 就意味着两个序列是近似匹配的,1表示完全相同
difflin.SequenceMatcher(None,str1,str2).quik_ratio() #原理类似jacard距离
参考
python的difflib使用文章来源地址https://www.toymoban.com/news/detail-613853.html
到了这里,关于scala 短文本相似度计算的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!