Prasad's Technical Diary: Spark Solr Integration on Windows

Prerequisite :

Install Spark 1.5.X (see here for details http://prasi82.blogspot.com/2015/11/setup-spark-on-windows.html )

Install zookeeper and Solr 5.3.X

If you have solr 4.X you will need to upgrade your solr 4.X indexed data to Solr 5.X.

( See here for details http://prasi82.blogspot.com/2015/11/migrating-solr-4x-index-data-to-solr-5x.html )

1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME

https://github.com/LucidWorks/spark-solr

set SS_HOME=d:\spark-solr-master

2) Download maven and set following env variables

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;

3) open a command prompt, make sure m2_home is set and and added to path. Run command

cd %SS_HOME%
mvn install -DskipTests

This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.

4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)

https://github.com/freissmann/SolrWithSparks

Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master

The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions

Open the pom.xml and add following if not already specified :

If you are running eclipse with JAVA_HOME pointing to a JRE path instead of JDK path, you will get 1.6 tools.jar missing error. To resolve this error, restart eclipse with following changes in your eclipse_home\eclipse.ini file :

See link below for additional info :
http://stackoverflow.com/questions/11118070/buiding-hadoop-with-eclipse-maven-missing-artifact-jdk-toolsjdk-toolsjar1

package de.blogspot.qaware.spark;

import com.lucidworks.spark.SolrRDD;
import de.blogspot.qaware.spark.common.ContextMaker;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;

import java.util.Arrays;

public class SparkSolrJobApp {

private static final String ZOOKEEPER_HOST_AND_PORT = "zkhost:zkport";
private static final String SOLR_COLLECTION = "collection1";
private static final String QUERY_ALL = "*:*";

public static void main(String[] args) throws Exception {
String zkHost = ZOOKEEPER_HOST_AND_PORT;
String collection = SOLR_COLLECTION;
String queryStr = QUERY_ALL;

JavaSparkContext javaSparkContext = ContextMaker.makeJavaSparkContext("Querying Solr");

SolrRDD solrRDD = new SolrRDD(zkHost, collection);
final SolrQuery solrQuery = SolrRDD.toQuery(queryStr);
JavaRDD solrJavaRDD = solrRDD.query(javaSparkContext.sc(), solrQuery);

JavaRDD titleNumbers = solrJavaRDD.flatMap(doc -> {
Object possibleTitle = doc.get("title");
String title = possibleTitle != null ? possibleTitle.toString() : "";
return Arrays.asList(title);
}).filter(s -> !s.isEmpty());

System.out.println("\n# of found titles: " + titleNumbers.count());

// Now use schema information in Solr to build a queryable SchemaRDD
SQLContext sqlContext = new SQLContext(javaSparkContext);

// Pro Tip: SolrRDD will figure out the schema if you don't supply a list of field names in your query
DataFrame tweets = solrRDD.asTempTable(sqlContext, queryStr, "documents");

// SQL can be run over RDDs that have been registered as tables.
DataFrame results = sqlContext.sql("SELECT * FROM documents where id LIKE 'one%'");

// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
JavaRDD resultsRDD = results.javaRDD();

System.out.println("\n\n# of documents where 'id' starts with 'one': " + resultsRDD.count());

javaSparkContext.stop();
}
}

5) Build Assembly Jar of your code :

Modify pom.xml to build the assembly jar of your example code .. i.e. jar with all the dependencies included :

Open a command prompt , make sure m2_home is set as indicated step2) and "%m2_home%\bin"included in the path.

cd D:\SolrWithSparks-master
mvn clean
mvn package -DskipTests

This will build the jar of your code with all its dependency classes included in it in the "target" folder.

6) Run solr spark example :

Before you run this program, you will need to start your zookeeper and solrcloud shards.

Go to command prompt and run following :

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
set java_home=d:\apps\Java\jdk1.8.0_65
set jre_home=%java_home%\jre
set jdk_home=%JAVA_HOME%
set path=%java_home%\bin;%path%
set SPARK_HOME=D:\udu\hk\spark-1.5.1
set SPARK_CONF_DIR=%SPARK_HOME%\conf
call %SPARK_HOME%\bin\load-spark-env.cmd

Now that spark environment is set, Run the following command to execute your spark job :

%spark_home%\bin\spark-submit.cmd --class org.sparkexample.SparkSolrJobApp file://d:/spark-solr-master/target/spark-test-id-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Here note that you will need to specify the path to your spark job assembly jar starting with "file://" and use '/' as path separator instead of '\'
After the jar path you may specify any additional parameters required by your spark job.

7) Troubleshooting :

If you get Connection time out error, make sure that in your SPARK_HOME\conf\spark-env.cmd :
SPARK_LOCAL_IP is set.

set SPARK_LOCAL_IP=127.0.0.1
REM set ip address of spark master
REM set SPARK_MASTER_IP=127.0.0.1

If you get error of following type :

Exception Details:
Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:
bci: @62
flags: { }
locals:

{ 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' }

stack:

{ 'org/apache/http/impl/client/DefaultHttpClient' }

Bytecode:
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)

Download the following jars (just google for it) and copy them to your %hadoop_home%\share\hadoop\common\lib folder.

httpclient-4.4.1.jar
httpcore-4.4.1.jar

See here for details : https://issues.apache.org/jira/browse/SOLR-7948

8) FAQ

How to build solr Queries in Spark Job :

https://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0/client/java/solrj/test/org/apache/solr/client/solrj/SolrExampleTests.java

How to sort a JavaRDD

http://stackoverflow.com/questions/27151943/sortby-in-javardd

How to invoke a spark job from Java Program :

https://github.com/spark-jobserver/spark-jobserver

http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-td13426.html