Data analysis using pivot faceting

Pivot faceting (also termed as Decision tree faceting) can be defined as tool that let us do certain automated operations such as sorting, computing count, sum and average of the information stored in a table. As a result, it displays the summarized information in another table. If you are looking for the summarized data (similar to statistics) probably for the purpose of data analysis or reporting, use of pivot faceting is the one for you. 

Before we understand pivot faceting using an example, let us go through the parameters that would help us experiment this concept better. These parameters are:

  • facet.pivot – It defines the fields that we intend to use for the pivot. We can include multiple values to this parameter, each list separated by comma, which would result in availability of multiple “facet_pivot” sections in the response.
  • facet.pivot.mincount – It defines the minimum document count that needs to match so that facet can be included in the result. This also means that if the number of documents present is less than the value assigned to this parameter, the facet wouldn’t be included in the search result. It holds the value 1 by default.

Let us consider an example of a music composition online store. Assume that there are two categories of music listed in the store, orchestra and vocal categories. The field that stores the category name is wm_cat. Moreover, we add one more field wm_stock which is a flag that determines whether the product is available in the stock or not. 

To start with, let us define the following index structure in our schema.xml file wherein we have chosen wm_cat and wm_stock fields for faceting:

<field name="wm_id" type="string" indexed="true" stored="true" required="true" />
<field name="wm_name" type="text" indexed="true" stored="true" />
<field name="wm_cat" type="string" indexed="true" stored="true" />
<field name="wm_stock" type="boolean" indexed="true" stored="true" required="true" />

Let us index the following sample data:

<add>
  <doc>
    <field name="wm_id">1</field>
    <field name="wm_name">Orchestra A</field>
    <field name="wm_cat">orchestra</field>
    <field name="wm_stock">true</field>
  </doc>
  <doc>
    <field name="wm_id">2</field>
    <field name="wm_name">Orchestra B</field>
    <field name="wm_cat">orchestra</field>
    <field name="wm_stock">true</field>
  </doc>
  <doc>
    <field name="wm_id">3</field>
    <field name="wm_name">Vocal A</field>
    <field name="wm_cat">vocal</field>
    <field name="wm_stock">false</field>
  </doc>
  <doc>
    <field name="wm_id">4</field>
    <field name="wm_name">Vocal B</field>
    <field name="wm_cat">vocal</field>
    <field name="wm_stock">true</field>
  </doc>
</add>

Let us assume that we are only interested in retrieving the product count of each category that are in stock and the count that are out-of-stock. In order to achieve our purpose, we run the following query:

http://localhost:8983/solr/select?q=*:*&rows=0&facet=true&facet.pivot=wm_cat,wm_stock

The preceding query results in the following response:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">34</int>
    <lst name="params">
      <str name="facet">true</str>
      <str name="indent">true</str>
      <str name="facet.pivot">wm_cat,wm_stock</str>
      <str name="q">*:*</str>
      <str name="rows">0</str>
    </lst>
  </lst>
  <result name="response" numFound="4" start="0">
  </result>
  <lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields"/>
  <lst name="facet_dates"/>
  <lst name="facet_ranges"/>
  <lst name="facet_pivot">
    <arr name="wm_cat,wm_stock">
      <lst>
        <str name="field">wm_cat</str>
        <str name="value">orchestra</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">wm_stock</str>
            <bool name="value">true</bool>
            <int name="count">2</int>
          </lst>
        </arr>
      </lst>
      <lst>
        <str name="field">wm_cat</str>
        <str name="value">vocal</str>
        <int name="count">2</int>
        <arr name="pivot">
          <lst>
            <str name="field">wm_stock</str>
            <bool name="value">false</bool>
            <int name="count">1</int>
          </lst>
          <lst>
            <str name="field">wm_stock</str>
            <bool name="value">true</bool>
            <int name="count">1</int>
          </lst>
        </arr>
      </lst>

Now let us understand what we did and how it worked.

We have defined four fields (wm_id, wm_name, wm_cat and wm_stock) at schema.xml file. Since our index structure and data is quite similar, we will not discuss them in details and skip to our query.

In our query, we instructed Solr to match all the documents (q=*.* parameter), on the other hand, we don’t want any document (rows=0 parameter) to be returned in the response. Moreover, we want the faceting calculation to be activated (facet=true parameter) that uses pivot faceting. Additionally, we want the pivot faceting to work on wm_cat and wm_stock fields such that wm_cat falls at the top level of the tree and wm_stock field gets nested within wm_cat field. We can even add any number of fields in the nested tree by appending the desired field in facet.pivot parameter, separated by comma. For instance, we want to add another field (say, wm_price) to pivot facets, specify section in our query would look like facet.pivot=wm_cat,wm_stock,wm_price instead of facet.pivot=wm_cat,wm_stock.

In the response, we can see that each nested faceting calculation falls within <arr name="pivot"> XML tag. As we can see in the response structure, the first level of our facet pivot tree is based on wm_cat field where we can find two products (<int name="count">2</int>) that fall under orchestra category (<str name="value">orchestra</str>). At the second level, we find that these products have wm_stock field (<str name="field">wm_stock</str>) set to true (<bool name="value">true</bool>).

On the other hand, in case of vocal category, the situation seems to slightly differ. We can see two different sections; one for the products with wm_stock field value as false, and the other as true.

By the end of this section, we can conclude that the statistics we retrieved as a response to our query is absolutely appropriate.