Spark Solr Integration on Windows

Prerequisite : 
  • Install zookeeper and Solr 5.3.X 
  • If you have solr 4.X you will need to upgrade your solr 4.X indexed data to Solr 5.X.
( See here for details http://prasi82.blogspot.com/2015/11/migrating-solr-4x-index-data-to-solr-5x.html )


1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME

https://github.com/LucidWorks/spark-solr

set SS_HOME=d:\spark-solr-master

2) Download maven and set following env variables

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;

3) open a command prompt, make sure m2_home is set and and added to path. Run command

cd %SS_HOME%
mvn install -DskipTests

This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.


4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)

https://github.com/freissmann/SolrWithSparks

Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master

The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions

Open the pom.xml and add following if not already specified :

 
 

If you are running eclipse with JAVA_HOME pointing to a JRE path instead of JDK path, you will get 1.6 tools.jar missing error. To resolve this error, restart eclipse with following changes in your eclipse_home\eclipse.ini file :


See link below for additional info :
http://stackoverflow.com/questions/11118070/buiding-hadoop-with-eclipse-maven-missing-artifact-jdk-toolsjdk-toolsjar1


package de.blogspot.qaware.spark;

import com.lucidworks.spark.SolrRDD;
import de.blogspot.qaware.spark.common.ContextMaker;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;

import java.util.Arrays;

public class SparkSolrJobApp {

    private static final String ZOOKEEPER_HOST_AND_PORT = "zkhost:zkport";
    private static final String SOLR_COLLECTION = "collection1";
    private static final String QUERY_ALL = "*:*";

    public static void main(String[] args) throws Exception {
        String zkHost = ZOOKEEPER_HOST_AND_PORT;
        String collection = SOLR_COLLECTION;
        String queryStr = QUERY_ALL;

        JavaSparkContext javaSparkContext = ContextMaker.makeJavaSparkContext("Querying Solr");

        SolrRDD solrRDD = new SolrRDD(zkHost, collection);
        final SolrQuery solrQuery = SolrRDD.toQuery(queryStr);
        JavaRDD solrJavaRDD = solrRDD.query(javaSparkContext.sc(), solrQuery);

        JavaRDD titleNumbers = solrJavaRDD.flatMap(doc -> {
            Object possibleTitle = doc.get("title");
            String title = possibleTitle != null ? possibleTitle.toString() : "";
            return Arrays.asList(title);
        }).filter(s -> !s.isEmpty());

        System.out.println("\n# of found titles: " + titleNumbers.count());

        // Now use schema information in Solr to build a queryable SchemaRDD
        SQLContext sqlContext = new SQLContext(javaSparkContext);

        // Pro Tip: SolrRDD will figure out the schema if you don't supply a list of field names in your query
        DataFrame tweets = solrRDD.asTempTable(sqlContext, queryStr, "documents");

        // SQL can be run over RDDs that have been registered as tables.
        DataFrame results = sqlContext.sql("SELECT * FROM documents where id LIKE 'one%'");

        // The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
        // The columns of a row in the result can be accessed by ordinal.
        JavaRDD resultsRDD = results.javaRDD();

        System.out.println("\n\n# of documents where 'id' starts with 'one': " + resultsRDD.count());

        javaSparkContext.stop();
    }
}


5) Build Assembly Jar of your code :

Modify pom.xml to build the assembly jar of your example code .. i.e. jar with all the dependencies included :


   

Open a command prompt , make sure m2_home is set as indicated step2) and "%m2_home%\bin"included in the path.

cd D:\SolrWithSparks-master
mvn clean
mvn package -DskipTests

This will build the jar of your code with all its dependency classes included in it in the "target" folder.

6) Run solr spark example :

Before you run this program, you will need to start your zookeeper and solrcloud shards.

Go to command prompt and run following :

set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
set java_home=d:\apps\Java\jdk1.8.0_65
set jre_home=%java_home%\jre
set jdk_home=%JAVA_HOME%
set path=%java_home%\bin;%path%
set SPARK_HOME=D:\udu\hk\spark-1.5.1
set SPARK_CONF_DIR=%SPARK_HOME%\conf
call %SPARK_HOME%\bin\load-spark-env.cmd

Now that spark environment is set, Run the following command to execute your spark job :

%spark_home%\bin\spark-submit.cmd --class org.sparkexample.SparkSolrJobApp file://d:/spark-solr-master/target/spark-test-id-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Here note that you will need to specify the path to your spark job assembly jar starting with "file://" and use '/' as path separator instead of '\'
After the jar path you may specify any additional parameters required by your spark job.

7) Troubleshooting :


If you get Connection time out error, make sure that in your SPARK_HOME\conf\spark-env.cmd :
SPARK_LOCAL_IP is set.

set SPARK_LOCAL_IP=127.0.0.1
REM set ip address of spark master
REM set SPARK_MASTER_IP=127.0.0.1


If you get error of following type :

Exception Details:
Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:

bci: @62
flags: { }
locals:
{ 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' }
stack:
{ 'org/apache/http/impl/client/DefaultHttpClient' }
Bytecode:
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)

Download the following jars (just google for it) and copy them to your %hadoop_home%\share\hadoop\common\lib folder.

httpclient-4.4.1.jar
httpcore-4.4.1.jar

See here for details : https://issues.apache.org/jira/browse/SOLR-7948

8) FAQ

How to build solr Queries in Spark Job :

https://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0/client/java/solrj/test/org/apache/solr/client/solrj/SolrExampleTests.java


How to sort a JavaRDD

http://stackoverflow.com/questions/27151943/sortby-in-javardd


How to invoke a spark job from Java Program :

https://github.com/spark-jobserver/spark-jobserver

http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-td13426.html

Migrating Solr 4.X Index data to Solr 5.X index

To upgrade your solr 4.X indexed data to Solr 5.X , run the following command

java -cp D:\solr-5.3.1\server\solr-webapp\webapp\WEB-INF\lib\* org.apache.lucene.index.IndexUpgrader D:\solr-4.4.0\example\solr\collection1\data\index

Here assuming "D:\solr-4.4.0\awc\solr\collection1\data\index" is the directory that contains indexed data that you want to upgrade to Solr 5.3.X.

After this you can copy your Solr 4.4.X colleciton / core directory (e.g. D:\solr-4.4.0\example\solr\collection1 in above command) to Solr 5.3.X home directory.

After copying to Solr5.3.X home directory, you will need to make few changes in the schema.xml and solrConfig.xml of your collection :

In your collection\conf\solrConfig.xml, comment out following :









In your collection\conf\schema.xml, change following, field values to include "Trie" Prefix


Setup Apache Spark On Windows


download spark and unzip E.g
spark_Home=D:\udu\hk\spark-1.5.1

spark needs hadoop jars. download hadoop binaries for windows (hadoop 2.6.0) from

    http://www.barik.net/archive/2015/01/19/172716/

unzip hadoop at some locaiton e.g.

hadoop_home=D:\udu\hk\hadoop-2.6.0

If your java_home or hadoop_home path contains space charcters in it ' ', you will need to convert to path to short paths:

  • Create a batch script with following contents

@ECHO OFF
echo %~s1

  • Run the above batch script file from java_home directory to get the short path for java-home
  • Run the above batch script file from hadoop_home directory to get the short path for hadoop_home

set java_home=short path obtained from above command
set hadoop_home=short path obatained from above command.

Run following command and copy the classpath generated by the command for next step

         %HADOOP_HOME%\bin\hadoop classpath

under spark_home\conf, create a file named "spark-env.cmd" like below

@echo off
set HADOOP_HOME=D:\Utils\hadoop-2.7.1
set PATH=%HADOOP_HOME%\bin;%PATH%
set SPARK_DIST_CLASSPATH=

on Command prompt

          cd %spark_home%\bin
          set SPARK_CONF_DIR=%SPARK_HOME%\conf
          load-spark-env.cmd
          spark-shell.cmd //To start spark shell
          spark-submit.cmd   //To submit spark job

Refer below to create a spark word count example
          http://www.robertomarchetto.com/spark_java_maven_example

To run a spark job (written using Java) word count example from above URL

          spark-submit --class org.sparkexample.WordCount --master local[2]  your_spark_job_jar  Any_additional_parameters_needed_by_your_job_jar



References :

http://stackoverflow.com/questions/30906412/noclassdeffounderror-com-apache-hadoop-fs-fsdatainputstream-when-execute-spark-s

https://blogs.perficient.com/multi-shoring/blog/2015/05/07/setup-local-standalone-spark-node/

http://nishutayaltech.blogspot.com/2015/04/how-to-run-apache-spark-on-windows7-in.html

How to split a solr core into mutiple shards on different machine

download zookeeper
make a copy of conf\zk_sample.cfg as zoo.cfg
creaeta zkdata folder.
in zoo.cfg, uncomment datadir and provide path to zkdata folder (replace '\' in the path with '/' )
Modify zk_home/bin/zk_env.cmd to replace ~dp0.. with actual paths like below

REM assuming Zookeeper is installed at : D:\zk1
set ZOOCFGDIR=D:\zk1\conf
set ZOO_LOG_DIR=D:\zk1
set ZOO_LOG4J_PROP=INFO,CONSOLE
set CLASSPATH=%ZOOCFGDIR%
SET CLASSPATH=D:\zk1\*;D:\zk1\lib\*;%CLASSPATH%
SET CLASSPATH=D:\zk1\build\classes;D:\zk1\build\lib\*;%CLASSPATH%
set ZOOCFG=%ZOOCFGDIR%\zoo.cfg


Lets assume you have a stand alone solr with "collection1" core and we want to convert this into soldCloud cluster and split this core into multiple shards on diff machine.

1) First start zookeepr, run following command from zk_home/bin path:

zkServer.cmd

2) Next upload & link your standalone core config to Zookeeper (Assuming Solr is installed at : D:\solr-4.4.0\mySolr1)
   make sure to use '/' as path seprate instead of '\' for the confdir path below
 
cd D:\solr-4.4.0\mySolr1\cloud-scripts
zkcli.bat -zkhost localhost:2181 -cmd upconfig -confdir D:/solr-4.4.0/mySolr1/solr/collection1/conf -confname collection1cfg
zkcli.bat -zkhost localhost:2181 -cmd linkconfig -collection collection1 -confname collection1cfg

3) Next start your solr with following command (assuming zookeeper is running on localhost:2181) and solr is installed at D:\solr-4.4.0\mySolr1

java -Xmx2g -Djetty.port=8983 -DzkHost=localhost:2181 -Dsolr.solr.home=D:\solr-4.4.0\mySolr1\solr -jar start.jar

4) Go to localhost:8989/solr and verify the # of documents in your core with following URL:

http://localhost:8983/solr/collection1/select?q=*:*&rows=0

make a note of numFound value in the response returned.
   
    Also verify that following URL shows the "collection1" core we have as "shard1"
   
    http://localhost:8983/solr/#/~cloud

5) Next in the browser run the following command to split your core "collection1" into multiple shards (assuming shard1 is the name your shard and collection1 is the name of your core)

http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1&async=5

This should generated 2 shards / cores
collection1_shard1_1_replica1
collection1_shard1_0_replica1

6) Next on another machine (e.g. machine2) , start solr with following command (assuming zookeeper is running on <>:2181 (replace "machine1" with name or ip address of machine1)
   and solr on machine 2 is installed at : D:/solr-4.4.0/mySolr2
 
java -Xmx2g -Djetty.port=7574 -DzkHost=machine1:2181 -Dsolr.solr.home=D:/solr-4.4.0/mySolr2/solr -jar start.jar

   This will start solr on port 7574 on machine 2
 
7) On machine2, Run following command to create a new Core (collection1_shard1_1_core2) specifying the collection=collection1 (same as original core name on machine1 before splitting into shards)
   and shard=shard1_1 (same as what shown on http://localhost:8983/solr/#/~cloud page for the shard that you want to move to machine2)
 
    http://localhost:7574/solr/admin/cores?action=CREATE&collection=collection1&shard=shard1_1&name=collection1_shard1_1_core2

8) now shutdown solr (not zookeeper) running on machine1 by pressing ctrl + C in the running solr window on machine 1.
   Move the shard1_1_replica1 folder from solr installation, to some other location outside SOLR_HOME\solr directory.
   And re-run the solr on machine 1
 
    java -Xmx2g -Djetty.port=8983 -DzkHost=localhost:2181 -Dsolr.solr.home=D:\solr-4.4.0\mySolr1\solr -jar start.jar
 
   Verify now shard on machine 2 is shown as shard leader for shard1_1.
 
   We have now sucessfully converted a single core named 'collection1' into a distributed collection 'collection1' with shard 'shard1_0' on original machine1 and moved shard1_1 to another machine2.
 
   

Java Regx to parse RGBA Color Values



String keywords_color_regex = "^[a-z]*$";
    String hex_color_regex = "^#[0-9a-f]{3}([0-9a-f]{3})?$";
    String rgb_color_regex = "^rgb\\(\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*\\)$";
    String rgba_color_regex = "^rgba\\(\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*((0.[1-9])|[01])\\s*\\)$";
    String hsl_color_regex = "^hsl\\(\\s*(0|[1-9]\\d?|[12]\\d\\d|3[0-5]\\d)\\s*,\\s*((0|[1-9]\\d?|100)%)\\s*,\\s*((0|[1-9]\\d?|100)%)\\s*\\)$";

Source : http://stackoverflow.com/questions/12385500/regex-pattern-for-rgb-rgba-hsl-hsla-color-coding





Mock it..

If you try to mock a method that returns primitive types, eclipse gives following compiler Error :

The expression of type boolean is boxed into Boolean

to get rid of this error :

Globally: Window-> Preferences: Java-> Compiler-> Errors / Warnings-> Potential programming problems-> Boxing and unboxing conversions. 
For the project: Properties: Java Compiler-> Errors / Warnings-> further as well.

Source :

http://translate.google.co.in/translate?hl=en&sl=ru&u=http://www.sql.ru/forum/1059132/mockito-strannaya-konstrukciya-vnutri-when&prev=search