Spark Solr Integration on Windows

Prerequisite : 
  • Install zookeeper and Solr 5.3.X 
  • If you have solr 4.X you will need to upgrade your solr 4.X indexed data to Solr 5.X.
( See here for details http://prasi82.blogspot.com/2015/11/migrating-solr-4x-index-data-to-solr-5x.html )


1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME

https://github.com/LucidWorks/spark-solr

set SS_HOME=d:\spark-solr-master

2) Download maven and set following env variables

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;

3) open a command prompt, make sure m2_home is set and and added to path. Run command

cd %SS_HOME%
mvn install -DskipTests

This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.


4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)

https://github.com/freissmann/SolrWithSparks

Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master

The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions

Open the pom.xml and add following if not already specified :

 
 

If you are running eclipse with JAVA_HOME pointing to a JRE path instead of JDK path, you will get 1.6 tools.jar missing error. To resolve this error, restart eclipse with following changes in your eclipse_home\eclipse.ini file :


See link below for additional info :
http://stackoverflow.com/questions/11118070/buiding-hadoop-with-eclipse-maven-missing-artifact-jdk-toolsjdk-toolsjar1


package de.blogspot.qaware.spark;

import com.lucidworks.spark.SolrRDD;
import de.blogspot.qaware.spark.common.ContextMaker;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;

import java.util.Arrays;

public class SparkSolrJobApp {

    private static final String ZOOKEEPER_HOST_AND_PORT = "zkhost:zkport";
    private static final String SOLR_COLLECTION = "collection1";
    private static final String QUERY_ALL = "*:*";

    public static void main(String[] args) throws Exception {
        String zkHost = ZOOKEEPER_HOST_AND_PORT;
        String collection = SOLR_COLLECTION;
        String queryStr = QUERY_ALL;

        JavaSparkContext javaSparkContext = ContextMaker.makeJavaSparkContext("Querying Solr");

        SolrRDD solrRDD = new SolrRDD(zkHost, collection);
        final SolrQuery solrQuery = SolrRDD.toQuery(queryStr);
        JavaRDD solrJavaRDD = solrRDD.query(javaSparkContext.sc(), solrQuery);

        JavaRDD titleNumbers = solrJavaRDD.flatMap(doc -> {
            Object possibleTitle = doc.get("title");
            String title = possibleTitle != null ? possibleTitle.toString() : "";
            return Arrays.asList(title);
        }).filter(s -> !s.isEmpty());

        System.out.println("\n# of found titles: " + titleNumbers.count());

        // Now use schema information in Solr to build a queryable SchemaRDD
        SQLContext sqlContext = new SQLContext(javaSparkContext);

        // Pro Tip: SolrRDD will figure out the schema if you don't supply a list of field names in your query
        DataFrame tweets = solrRDD.asTempTable(sqlContext, queryStr, "documents");

        // SQL can be run over RDDs that have been registered as tables.
        DataFrame results = sqlContext.sql("SELECT * FROM documents where id LIKE 'one%'");

        // The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
        // The columns of a row in the result can be accessed by ordinal.
        JavaRDD resultsRDD = results.javaRDD();

        System.out.println("\n\n# of documents where 'id' starts with 'one': " + resultsRDD.count());

        javaSparkContext.stop();
    }
}


5) Build Assembly Jar of your code :

Modify pom.xml to build the assembly jar of your example code .. i.e. jar with all the dependencies included :


   

Open a command prompt , make sure m2_home is set as indicated step2) and "%m2_home%\bin"included in the path.

cd D:\SolrWithSparks-master
mvn clean
mvn package -DskipTests

This will build the jar of your code with all its dependency classes included in it in the "target" folder.

6) Run solr spark example :

Before you run this program, you will need to start your zookeeper and solrcloud shards.

Go to command prompt and run following :

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
set java_home=d:\apps\Java\jdk1.8.0_65
set jre_home=%java_home%\jre
set jdk_home=%JAVA_HOME%
set path=%java_home%\bin;%path%
set SPARK_HOME=D:\udu\hk\spark-1.5.1
set SPARK_CONF_DIR=%SPARK_HOME%\conf
call %SPARK_HOME%\bin\load-spark-env.cmd

Now that spark environment is set, Run the following command to execute your spark job :

%spark_home%\bin\spark-submit.cmd --class org.sparkexample.SparkSolrJobApp file://d:/spark-solr-master/target/spark-test-id-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Here note that you will need to specify the path to your spark job assembly jar starting with "file://" and use '/' as path separator instead of '\'
After the jar path you may specify any additional parameters required by your spark job.

7) Troubleshooting :


If you get Connection time out error, make sure that in your SPARK_HOME\conf\spark-env.cmd :
SPARK_LOCAL_IP is set.

set SPARK_LOCAL_IP=127.0.0.1
REM set ip address of spark master
REM set SPARK_MASTER_IP=127.0.0.1


If you get error of following type :

Exception Details:
Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:

bci: @62
flags: { }
locals:
{ 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' }
stack:
{ 'org/apache/http/impl/client/DefaultHttpClient' }
Bytecode:
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)

Download the following jars (just google for it) and copy them to your %hadoop_home%\share\hadoop\common\lib folder.

httpclient-4.4.1.jar
httpcore-4.4.1.jar

See here for details : https://issues.apache.org/jira/browse/SOLR-7948

8) FAQ

How to build solr Queries in Spark Job :

https://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0/client/java/solrj/test/org/apache/solr/client/solrj/SolrExampleTests.java


How to sort a JavaRDD

http://stackoverflow.com/questions/27151943/sortby-in-javardd


How to invoke a spark job from Java Program :

https://github.com/spark-jobserver/spark-jobserver

http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-td13426.html

Migrating Solr 4.X Index data to Solr 5.X index

To upgrade your solr 4.X indexed data to Solr 5.X , run the following command

java -cp D:\solr-5.3.1\server\solr-webapp\webapp\WEB-INF\lib\* org.apache.lucene.index.IndexUpgrader D:\solr-4.4.0\example\solr\collection1\data\index

Here assuming "D:\solr-4.4.0\awc\solr\collection1\data\index" is the directory that contains indexed data that you want to upgrade to Solr 5.3.X.

After this you can copy your Solr 4.4.X colleciton / core directory (e.g. D:\solr-4.4.0\example\solr\collection1 in above command) to Solr 5.3.X home directory.

After copying to Solr5.3.X home directory, you will need to make few changes in the schema.xml and solrConfig.xml of your collection :

In your collection\conf\solrConfig.xml, comment out following :









In your collection\conf\schema.xml, change following, field values to include "Trie" Prefix


Setup Apache Spark On Windows


download spark and unzip E.g
spark_Home=D:\udu\hk\spark-1.5.1

spark needs hadoop jars. download hadoop binaries for windows (hadoop 2.6.0) from

    http://www.barik.net/archive/2015/01/19/172716/

unzip hadoop at some locaiton e.g.

hadoop_home=D:\udu\hk\hadoop-2.6.0

If your java_home or hadoop_home path contains space charcters in it ' ', you will need to convert to path to short paths:

  • Create a batch script with following contents

@ECHO OFF
echo %~s1

  • Run the above batch script file from java_home directory to get the short path for java-home
  • Run the above batch script file from hadoop_home directory to get the short path for hadoop_home

set java_home=short path obtained from above command
set hadoop_home=short path obatained from above command.

Run following command and copy the classpath generated by the command for next step

         %HADOOP_HOME%\bin\hadoop classpath

under spark_home\conf, create a file named "spark-env.cmd" like below

@echo off
set HADOOP_HOME=D:\Utils\hadoop-2.7.1
set PATH=%HADOOP_HOME%\bin;%PATH%
set SPARK_DIST_CLASSPATH=

on Command prompt

          cd %spark_home%\bin
          set SPARK_CONF_DIR=%SPARK_HOME%\conf
          load-spark-env.cmd
          spark-shell.cmd //To start spark shell
          spark-submit.cmd   //To submit spark job

Refer below to create a spark word count example
          http://www.robertomarchetto.com/spark_java_maven_example

To run a spark job (written using Java) word count example from above URL

          spark-submit --class org.sparkexample.WordCount --master local[2]  your_spark_job_jar  Any_additional_parameters_needed_by_your_job_jar



References :

http://stackoverflow.com/questions/30906412/noclassdeffounderror-com-apache-hadoop-fs-fsdatainputstream-when-execute-spark-s

https://blogs.perficient.com/multi-shoring/blog/2015/05/07/setup-local-standalone-spark-node/

http://nishutayaltech.blogspot.com/2015/04/how-to-run-apache-spark-on-windows7-in.html