Robert A Decker Programming Repository


Notes and articles that will reduce the pain

Basic Apache Sling Development Patterns: Configurations

Robert Decker - Wednesday, July 01, 2015

Apache Sling Configurations

Here's a real-world example of using Sling's OSGi configurations to control the garbage collection schedule that I presented in the last entry.

I had a project where we were processing millions of short pieces of text per day. The documents were generated in another system and saved to a folder served over WebDAV by Apache Sling where we processed the text with various engines, saved the results, and then deleted the document and its intermediary documents. It was a bit of a hack but surprisingly stable and speedy and worked well enough for what we needed.

One problem we found is that in Sling / Jackrabbit (and Adobe CQ5) when a node is deleted the data is not removed from the hard drive. This was compounded by when modifying a document over WebDAV we had to delete the original and rewrite a new document with its modifications, and so every change of a document was generating new nodes in Jackrabbit while unlinking the original node.

This caused our system quickly build up gigabytes of unused data on the filesystem and so we had to set up a Jackrabbit repository garbage collection to run periodically. Normally you would run the garbage collection as infrequently as once per week even on a large CQ5 installation with hundreds of authors. However, because we were creating, modifying, and deleting millions of documents a day we found we had to run garbage collection several times a day.

 

 

pom.xml maven-bundle-plugin settings:

sling-initial-content: text here

 

image

 

 text

Basic Apache Sling Development Patterns: Handlers, Services, Servlets, Schedulers

Robert Decker - Thursday, April 24, 2014

Basic Apache Sling Development Patterns

After about a one-year hiatus I’m starting a new Apache Sling project. In order to prepare I’m reviewing some basic patterns that we found in the past to work well for Apache Sling development and OSGi development in general.

These patterns fall into the area of implementation strategy patterns, since they are more focused on program organization and parallel execution. By following common patterns your project is more predictable and easy to understand by other developers.

A caveat - here I am describing very basic patterns. I won’t be talking about things like the whiteboard pattern even though it’s something that is used frequently in Sling. What I’m writing about here are the most basic bits of code that we found ourselves repeating many times in our projects. This is code that any Sling/CQ5 developer is probably already familiar with, but for someone new to Sling/CQ5 this will hopefully make your initial study much shorter.

Also, I will write only briefly about sling events in this entry because I plan on covering this topic in more detail later. And in future entries I will describe other patterns we commonly use in Apache Sling development.

Handlers, Services, Servlets, Schedulers


The most basic components that we write over and over again in Apache Sling are Handlers, Services, Servlets, and Schedulers. The more that you can spread functionality between these and the more you take advantage of the Sling eventing system to communicate between these components the better off you will be in the long run.

Handlers Handlers are called through the Sling eventing system. Handlers may interact directly with Sling Services. By using the Sling eventing system we are able to take advantage of threading, thread pools, distributed processing, and other features that come from an event-driven architecture.
Services Standard OSGi services that do most of our work and will usually be called from the Handlers.
Servlets A Servlet is a component that can be interacted with through through the http protocol, usually REST but not always. Your servlet code can do work directly but it’s best to create an event and have the work handled somewhere else. However, because of the responsive nature of Servlets this alway isn’t possible.
Schedulers Schedulers are a Job type that is automatically executed periodically. They are part of the Apache Sling event system.

When adding a piece of new functionality to a Sling/CQ5 project you must take some time to plan how you can divide its functionality among these component types.

Example - Apache Sling Repository Garbage Collection

For this example we will create a repository garbage collection system for Sling which will remove Jackrabbit nodes that are no longer in use. This is a feature in Sling that doesn't come out-of-the-box unless you buy Adobe's CQ5 product.

We will have the garbage collection run automatically periodically using a Scheduler and we will also allow it to be run manually through a Servlet URL. Both the Scheduler and the Servlet won’t run the garbage collection directly but instead will interact with an event Handler that will then interact with a garbage collection Service directly. 

The DatastoreGCServiceImpl does the actual garbage collection work - its java interface is DatastoreGCService. The DatastoreGCService is called directly from the DatastoreGCHandler. The DatastorePeriodicGC periodically fires off the <<gc>> event which is handled be the DatastoreGCHandler. The DatastoreGCServlet is a servlet that can either interact directly with the DatastoreGCService or indirectly through the DatastoreGCHandler, depending on how responsive the servlet must be. In my code examples below I went with the indirect method.

You will never call a handler directly from your own code.

 Also, the Apache Sling event management system recently diverged from the base OSGi event system. My examples use the new Apache Sling system.



Java Interface DatastoreGCService:
package com.astracorp.examples.patterns.services;

import java.util.HashMap;
import java.util.Map;

/*
 * Interface class to the DatastoreGCService 
*/
public interface DatastoreGCService {

    // The event topic that is used to request a garbage collection
    public static final String TOPIC_DATASTORE_GC_REQUESTED = "com/astracorp/core/datastore/gc/requested";

    // Properties of the datastore garbage collection job. For now this is empty but would normally contain
    // fields describing the thread priority, job queue name, etc
    public static final Map<String, Object> DATASTORE_GC_REQUESTED_JOB_PROPERTIES = JobConstants.datastoreGCJobProperties();

    // the method that does the garbage collection work
    public void runDatastoreGarbageCollection();

    // Inner class for providing job properties
    public class JobConstants {
        public static Map<String, Object> datastoreGCJobProperties() {
            Map<String, Object> props = new HashMap<String,Object>();
            return props;
        }
    }
}


Implementation of DatastoreGCService:
package com.astracorp.examples.patterns.services.impl;

import com.astracorp.examples.patterns.services.DatastoreGCService;
import org.apache.felix.scr.annotations.*;
import org.apache.jackrabbit.api.management.DataStoreGarbageCollector;
import org.apache.jackrabbit.api.management.RepositoryManager;
import org.osgi.service.component.ComponentContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.jcr.RepositoryException;

/*
 * Implementation of the DatastoreGCService
*/
@Component(immediate = true, metatype = false, label = "Astracorp astra-example Datastore Garbage Collector Service", description = "Provides methods for datastore garbage collection and other repository cleanups")
@Service(value = DatastoreGCService.class)
public class DatastoreGCServiceImpl implements DatastoreGCService {
    private static final Logger LOGGER = LoggerFactory.getLogger(DatastoreGCService.class);

    @Reference
    private RepositoryManager repositoryManager = null;

    //Runs a datastore garbage collection to clean up old files in the repository. Should be run periodically or more frequently if you are doing a lot of WebDAV operations
    @Override
    public void runDatastoreGarbageCollection() {
        LOGGER.debug("DatastoreGCService gc called. repositoryManager:" + repositoryManager);
        long time = System.currentTimeMillis();
        DataStoreGarbageCollector gc = null;
        try {
            gc = repositoryManager.createDataStoreGarbageCollector();
            LOGGER.debug("gc:" + gc);
            gc.mark();
            gc.sweep();
        } catch (RepositoryException e) {
            LOGGER.error("Error running the garbage collection", e);
        } finally {
            if (gc != null) {
                gc.close();
            }
        }
        LOGGER.debug("DatastoreGCService ran gc in " + ((System.currentTimeMillis() - time)/1000) + " seconds");
    }
}


The DatastoreGCHandler:
package com.astracorp.examples.patterns.handlers;

import com.astracorp.examples.patterns.services.DatastoreGCService;
import org.apache.felix.scr.annotations.*;
import org.apache.sling.event.jobs.Job;
import org.apache.sling.event.jobs.consumer.JobConsumer;
import org.osgi.service.component.ComponentContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/*
 * Handles datastore garbage collection-related events
*/
@Component(enabled = true, immediate = true, metatype = false, label = "Astracorp astra-example Datastore GC Handler", description = "Datastore garbage collector for the Jackrabbit repository")
@Service(value = JobConsumer.class)
@Property(name = JobConsumer.PROPERTY_TOPICS, value = DatastoreGCService.TOPIC_DATASTORE_GC_REQUESTED)
public class DatastoreGCHandler implements JobConsumer {
    public static final Logger LOGGER = LoggerFactory.getLogger(DatastoreGCHandler.class);

    @Reference
    private DatastoreGCService datastoreGCService = null;

    @Override
    public JobResult process(final Job job) {
        LOGGER.debug("process job called. about to call the gcservice:" + datastoreGCService);
        datastoreGCService.runDatastoreGarbageCollection();
        LOGGER.debug("finished calling the gcservice.");
        return JobResult.OK;
    }
}


The DatastorePeriodicGC:
currently set to every 25 seconds for debugging purposes. You would normally only run this once per week, or more frequently if you use WebDAV in your system
package com.astracorp.examples.patterns.schedulers;

import com.astracorp.examples.patterns.services.DatastoreGCService;
import org.apache.felix.scr.annotations.*;
import org.apache.sling.commons.scheduler.Job;
import org.apache.sling.commons.scheduler.JobContext;
import org.apache.sling.event.jobs.JobManager;
import org.osgi.service.component.ComponentContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/*
 * Periodically sends the garbage collection event
*/
@Component(enabled = true, immediate = true, metatype = false, label = "Astracorp Datastore GC", description = "Creates a datastore GC event periodically")
@Service(value = Job.class)
@Properties({@Property(name="scheduler.expression", value="0/25 * * * * ?"), @Property(name="scheduler.concurrent", boolValue=false)})
public class DatastorePeriodicGC implements Job { // this is a scheduler.job
    public static final Logger LOGGER = LoggerFactory.getLogger(DatastorePeriodicGC.class);

    @Reference
    private JobManager jobManager = null;

    @Override
    public void execute(JobContext jobContext) {
        this.sendGarbageCollectEvent();
    }

    private void sendGarbageCollectEvent() {
        LOGGER.debug("sendGarbageCollectEvent called. sending " + DatastoreGCService.TOPIC_DATASTORE_GC_REQUESTED + ":" + DatastoreGCService.DATASTORE_GC_REQUESTED_JOB_PROPERTIES);
        org.apache.sling.event.jobs.Job job = jobManager.addJob(DatastoreGCService.TOPIC_DATASTORE_GC_REQUESTED, DatastoreGCService.DATASTORE_GC_REQUESTED_JOB_PROPERTIES);
        //LOGGER.debug("job:" + job);
    }
}


The DatastoreGCServlet:
package com.astracorp.examples.patterns.servlets;

import com.astracorp.examples.patterns.services.DatastoreGCService;
import org.apache.felix.scr.annotations.Reference;
import org.apache.felix.scr.annotations.sling.SlingServlet;
import org.apache.sling.api.SlingHttpServletRequest;
import org.apache.sling.api.SlingHttpServletResponse;
import org.apache.sling.api.servlets.SlingSafeMethodsServlet;
import org.apache.sling.event.jobs.JobManager;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/*
 * Servlet for starting the datastore garbage collection through a url
*/
@SlingServlet(paths={DatastoreGCServlet.DATASTORE_GC_URL_PATH}, methods = {"GET"})
public class DatastoreGCServlet extends SlingSafeMethodsServlet {
    public static final Logger LOGGER = LoggerFactory.getLogger(DatastoreGCServlet.class);

    public static final String DATASTORE_GC_URL_PATH = "/bin/util/gc";

    @Reference
    private JobManager jobManager = null;

    @Override
    protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) {
        // fire off event that a datastore garbage collection was requested
        LOGGER.debug("doGet called. sending " + DatastoreGCService.TOPIC_DATASTORE_GC_REQUESTED + ":" + DatastoreGCService.DATASTORE_GC_REQUESTED_JOB_PROPERTIES);
        org.apache.sling.event.jobs.Job job = jobManager.addJob(DatastoreGCService.TOPIC_DATASTORE_GC_REQUESTED, DatastoreGCService.DATASTORE_GC_REQUESTED_JOB_PROPERTIES);
    }
}

You will access the servlet gc method at http://localhost:8080/bin/util/gc

 

Conclusion

This example provides you with the ability to do datastore garbage collections, something normally not automatically handled by Apache Sling. You are able to run the garbage collection on a schedule and you are also able to manually launch the garbage collection through a servlet action.

As a side note, I found that if you use Apache Sling's WebDAV feature you can potentially end up with a large number of unused nodes in your repository. It looks like every save to a file opened through WebDAV produces new nodes (as seen by some of the events being sent in Apache Sling), and so if you're doing a lot of automated WebDAV actions you will probably need to run this garbage collection more than once a week. When we were doing automated file processing over WebDAV we ended up having to run the garbage collection every hour.

Further reading:
https://sling.apache.org/documentation/bundles/apache-sling-eventing-and-job-handling.html
http://en.wikipedia.org/wiki/Event-driven_architecture