Sunday, 22 May 2011

Spring Batch in a Web Container

In this post I will show how to use Spring Batch in a web container (Tomcat). I will upload vacancy related data from a flat file to the database using Spring Batch. Before I show how I have done this, a brief introduction to Spring Batch is necessary.

Spring Batch - An Introduction

Spring Batch is a lightweight batch processing framework. Spring Batch is designed for bulk processing to perform business operations. Moreover it also provides logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. The below diagram shows the processing strategy provided by Spring Batch (source: http://static.springsource.org/spring-batch/reference/html/whatsNew.html)


A batch Job has one or more step(s).

A JobInstance is a representation of a Job. JobInstances are distinguished from each other with the help of JobParameter. JobParameters is a set of parameters used to start a batch job. Each run of of a JobInstance is a JobExecution.

A Step contains all of the information necessary to define and control the actual batch processing. In our case the "vacancy_step" is responsible to upload vacancy data from a flat file to database.

ItemReader is responsible retrieval of input for a Step, one item at a time, whereas ItemWriter represents the output of a Step, one batch or chunk of items at a time.

JobLauncher is used to launch a Job with a given set of JobParameters.

JobRepository is used to to store runtime information related to the batch execution.

A tasklet is an object containing any custom logic to be executed as a part of a job.

I have used SpringSource Tool Suite (STS) and Spring Roo to develop a simple web application which is responsible for initiating the batch processing upon receiving a request from a user. Below figure shows how batch processing will be started upon receiving the request (source: http://static.springsource.org/spring-batch/reference/html/)




Spring Roo is very good to develop a prototype application in a short period of time using Spring best practices. You can also use Eclipse to implement this.

If you have Spring STS then open it and create Spring Roo Project.

File -> New -> Spring Roo Project.

Give project name and top level package name.

Now open the Roo shell in your STS and execute the below commands:

roo > persistence setup --database MYSQL --provider HIBERNATE
roo > entity --class ~.model.Vacancy --testAutomatically
roo > field string --fieldName referenceNo
roo > field string --fieldName title
roo > field string --fieldName salary

Here is my Vacancy Entity Class

@RooJavaBean
@RooToString
@RooEntity
public class Vacancy {

      private String referenceNo;

      private String title;

      private String salary;
}

I have used MYSQL as my backend database (you can use any database). I have created "batchsample" database. So please create a database and enter the below details in the "database.properties"  file

database.password=admin
database.url=jdbc\:mysql\://localhost\:3306/batchsample
database.username=root
database.driverClassName=com.mysql.jdbc.Driver

I have also written a simple integration test to find out whether my database configuration is ok or not.

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = "classpath:/META-INF/spring/applicationContext.xml")
@Transactional
public class VacancyIntegrationTest {

     private SimpleJdbcTemplate jdbcTemplate;

    @Autowired
    public void initializeJdbcTemplate(DataSource ds){
            jdbcTemplate = new SimpleJdbcTemplate(ds);
    }

   @Test
   public void testBatchDbConfig() {
           Assert.assertEquals(0, jdbcTemplate.queryForInt("select count(0) from vacancy"));
    }
}

Run this test. If the test is passed then execute the below roo command to create web infrastructure for this application.

roo > controller all --package ~.web

Roo will create necessary web structure. A controller called "VacancyController" will also be created by Roo to handle the request.

I have slightly modified the VacancyController to meet my needs. Here is the controller:


@Controller
@RequestMapping("/vacancy/*")
public class VacancyController {
   
    private static Log log = LogFactory.getLog(VacancyController.class);
   
    @Autowired
    private ApplicationContext context;
   
    @RequestMapping("list")
    public String list(Model model) {
       
        model.addAttribute("vacancies", Vacancy.findAllVacancys());
       
        return "vacancy/list";
    }
   
    @RequestMapping("handle")
    public String jobLauncherHandle(){
       
           JobLauncher jobLauncher = (JobLauncher)context.getBean("jobLauncher");

           Job job = (Job)context.getBean("vacancyjob");
       
           log.info(jobLauncher);
           log.info(job);
       
           ExitStatus exitStatus = null;
       
           try {
           

                       JobExecution jobExecution = jobLauncher.run(
                                            job,
                                            new JobParametersBuilder()
                                            .addDate("date", new Date())
                                            .toJobParameters()
                                      );
           
                  exitStatus = jobExecution.getExitStatus();
           
                  log.info(exitStatus.getExitCode());
        }
        catch(JobExecutionAlreadyRunningException jobExecutionAlreadyRunningException) {
            log.info("Job execution is already running.");
        }   
        catch(JobRestartException jobRestartException) {
            log.info("Job restart exception happens.");
        }
        catch(JobInstanceAlreadyCompleteException jobInstanceAlreadyCompleteException) {
            log.info("Job instance is already completed.");
        }
        catch(JobParametersInvalidException jobParametersInvalidException){
            log.info("Job parameters invalid exception");
        }
        catch(BeansException beansException) {
            log.info("Bean is not found.");
        }
       
        return "vacancy/handle";
    }
}


Now it is the time to include the batch configuration in the applicationContext.xml.

applicationContext.xml

<context:property-placeholder location="classpath*:META-INF/spring/*.properties">

<context:spring-configured>

<context:component-scan base-package="com.mega">
<context:exclude-filter expression=".*_Roo_.*" type="regex">
<context:exclude-filter expression="org.springframework.stereotype.Controller" type="annotation">
</context:exclude-filter></context:exclude-filter></context:component-scan>
<bean class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close" id="dataSource">
<property name="driverClassName" value="${database.driverClassName}">
<property name="url" value="${database.url}">
<property name="username" value="${database.username}">
<property name="password" value="${database.password}">
<property name="validationQuery" value="SELECT 1 FROM DUAL">
<property name="testOnBorrow" value="true">
</property></property></property></property></property></property></bean>
<bean class="org.springframework.orm.jpa.JpaTransactionManager" id="transactionManager">
<property name="entityManagerFactory" ref="entityManagerFactory">
</property></bean>
<tx:annotation-driven mode="aspectj" transaction-manager="transactionManager">
<bean class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean" id="entityManagerFactory">
<property name="dataSource" ref="dataSource">
</property></bean>

<import resource="classpath:/META-INF/spring/batch-context.xml">

<bean class="org.springframework.batch.core.launch.support.SimpleJobLauncher" id="jobLauncher">
<property name="jobRepository" ref="jobRepository">
<property name="taskExecutor">
<bean class="org.springframework.core.task.SimpleAsyncTaskExecutor">
</bean></property>
</property></bean>

<bean class="org.springframework.batch.core.repository.support.JobRepositoryFactoryBean" id="jobRepository" p:datasource-ref="dataSource" p:tableprefix="BATCH_" p:transactionmanager-ref="transactionManager">
<property name="isolationLevelForCreate" value="ISOLATION_DEFAULT">
</property></bean>
</import></tx:annotation-driven></context:spring-configured></context:property-placeholder>

I have kept batch job related configuration in a sperate file "batch-context.xml"

batch-context.xml

<description>Batch Job Configuration</description>

<job id="vacancyjob" xmlns="http://www.springframework.org/schema/batch">
<step id="vacancy_step" parent="simpleStep">
<tasklet>
<chunk reader="vacancy_reader" writer="vacancy_writer"/>
</tasklet>
</step>
</job>

<bean id="vacancy_reader" class="org.springframework.batch.item.file.FlatFileItemReader">
<property name="resource" value="classpath:META-INF/data/vacancies.csv"/>
<property name="linesToSkip" value="1" />
<property name="lineMapper">
<bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
<property name="lineTokenizer">
<bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
<property name="names" value="reference,title,salary"/>
</bean>
</property>
<property name="fieldSetMapper">
<bean class="com.mega.batch.fieldsetmapper.VacancyMapper"/>
</property>
</bean>
</property>
</bean>

<bean id="vacancy_writer" class="com.mega.batch.item.VacancyItemWriter" />

<bean id="simpleStep"
class="org.springframework.batch.core.step.item.SimpleStepFactoryBean"
abstract="true">
<property name="transactionManager" ref="transactionManager" />
<property name="jobRepository" ref="jobRepository" />
<property name="startLimit" value="100" />
<property name="commitInterval" value="1" />
</bean>

I have written VacancyItemWriter to save the vacancy related data in the Database.

public class VacancyItemWriter implements ItemWriter<Vacancy> {

    private static final Log log = LogFactory.getLog(VacancyItemWriter.class);
   
    /**
     * @see ItemWriter#write(Object)
     */
    public void write(List<? extends Vacancy> vacancies) throws Exception {
       
        for (Vacancy vacancy : vacancies) {
            log.info(vacancy);
            vacancy.persist();
            log.info("Vacancy is saved.");
        }
   
    }

You will find other additional helper classes such as VacancyMapper, ProcessorLogAdvice, SimpleMessageApplicationEvent etc. in the attached ZIP file. Once the configuration is completed please run the application in your tc / tomcat server. 

In this article I have demonstrated Spring Batch in a web container by building a simple Spring application. Additional information is available in Spring Batch Reference Document. Please download the application by clicking the below link and have fun !!!! 


Note: Spring Batch related monitoring tables can be created by executing the commands found in "schema-mysql.sql" file available in spring-batch-core-2.1.1.RELEASE.jar in your mysql command prompt.

References:

1. http://static.springsource.org/spring-batch/reference/html/
2. http://java.dzone.com/news/spring-batch-hello-world-1
3. http://static.springsource.org/spring-roo/reference/html/

 

Sunday, 15 May 2011

Java Garbage Collection Process

Efficient memory management is important to run a software system smoothly. In this article I will write my understanding of Java garbage collection process. Any feedback is welcome.  I hope that you will enjoy reading it.

So, What is garbage collection?

An application can create a large amount of short lived objects during its life span. These objects consume memory and memory is not unlimited. Garbage collection (GC) is a process of reclaiming memory occupied by objects that are no longer needed (garbage) and making it available for new objects. An object is considered garbage when it can no longer be reached from any pointer in the running program.

Heap plays a very important role in this process. Objects are allocated on the heap. In fact to understand how Java garbage collection works we need to know how heap is designed in the Java Virtual Machine (JVM).

Heap

Heap is the memory area within the virtual machine where objects are born, live and die. It is divided into two parts:

(a) First part - young space  contains recent objects, also called children.
(b) Second part - tenured space holds objects with a long life span, also called ancestors.

There is another particular memory area next to the heap is called Perm, in which the binary code of each class loaded.

Eden Survivor Survivor Virtual Objects Virtual Virtual
Young Tenured Perm

The young space is divided into Eden and two survivor spaces. Whenever a new object is allocated to the heap, the JVM puts it in the Eden. GC treats two survivors as temporary storage buckets. Young space/ generation is for recent objects and tenured space/generation is for old objects.  Both the young and tenured space contain a virtual space - a zone of memory available to the JVM but free of any data. This means those spaces might grow and shrink with time.

How the garbage collection (GC) process works:

Memory is managed in “generations” or memory pools holding objects of different ages. Garbage collection occurs in each generation when the generation fills up. The vast majority of objects are allocated in a pool dedicated to young objects (the young generation/space), and most objects die there. When the young generation fills up it causes a “minor collection”. During a minor collection, the GC runs through every object in both Eden and the occupied survivor space to determine which ones are still alive, in other words which still have external references to themselves. Each one of them will then be copied into empty survivor space.

At the end of a minor collection, both the Eden and the explored survivor space are considered empty. As minor collection are performed, living objects proceed from one survivor space to the other. As an object reaches a given age, dynamically defined at runtime by HotSpot, or as the survivor space gets too small, a copy is made to the tenured space. Yet most objects are still born and die right in the young space.

Eventually, the tenured space/generation will fill up and must be collected, resulting in a major collection, in which the entire heap is collected. It is done with the help of the Mark-Sweep-Compact algorithm. During this process the GC will run through all the objects in the heap, mark the candidates for memory reclaiming and run through the heap again to compact remaining objects and avoid memory fragmentation. At the end of this cycle, all living objects exist side by side in the tenured space.

Performance

Throughput and Pauses are the two primary measures of garbage collection performance.

Throughput is the percentage of total time not spent in garbage collection, considered over long periods of time. Throughput includes time spent in allocation.

Pauses are the times when an application appears unresponsive because garbage collection is happening.

For example: in an interactive graphics program short pauses may negatively affect user experience whereas pauses during garbage collection may be tolerable in a web server.

Other two issues should be taken into considerations: Footprint and Promptness.

Footprint is the working set of a process, measured in pages and cache lines.

Promptness is the time between when  an object becomes dead and when the memory becomes available.

A very large young generation may maximize throughput at the expense of footprint, promptness and pause times. On the other hand young generation pauses can be minimized by using a small young generation at the expense of throughput.

There is no one right way to size generations. The best choice is determined by the way the application uses memory as well as user requirements.

Available Collectors

The Java HotSpot VM includes three different collectors, each with different performance characteristics:

(1) Serial Collector

  • it uses a single thread to perform all garbage collection work.
  • there is no communication overhead between threads.
  • it is best-suited to single processor machine.
  • it can be useful on multiprocessors for applications with small data sets (up to approximately 100MB).
  • it can be explicitly enabled with the option -XX:+UseSerialGC.

(2) Parallel/Throughput Collector

  • it performs minor collections in parallel, which can significantly reduce garbage collection overhead.
  • it is useful for applications with medium-to large-sized data sets that are run on multiprocessor or multithreaded hardware.
  • it can be explicitly enabled with the option -XX:+UseParallelGC

(3) Concurrent Collector

  • it performs most of its work concurrently (i.e. while the application is still running) to keep garbage collection pauses short.
  • it is designed for applications with medium-to large-sized data sets for which response time is important than overall throughput.
  • it can be explicitly enabled with the option -XX:+UseConcMarkSweepGC.

Default Settings

By default the following selections were made in the J2SE platform version 1.4.2

    - Serial Garbage Collector
    - Initial heap size of 4 Mbyte
    - Maximum heap size of 64 Mbyte
    - Client runtime compiler

In the J2SE platform version 1.5 a class of machine referred to as a server-class machine has been defined as a machine with

    1.  >= 2 physical processors
    2.  >= 2 Gbytes of physical memory

Default settings for this type of machine:

    - Throughput Garbage Collector
    - Initial heap size of 1/64 of physical memory up to 1Gbyte
    - Maximum heap size of ¼ of physical memory up to 1 Gbyte
    - Server runtime compiler

Some of the HotSpot VM options can be used for tuning:

-Xms <size>

    - specifies the minimal size of the heap.
    - this option is used to avoid frequent resizing of the heap when the application needs a lot of memory.

-Xmx <size>

    - specifies the maximum size of the heap.
    - this option is used mainly by server side applications that sometimes need several gigs of memory.

So the heap is allowed to grow and shrink between these two values defined by -Xms and -Xmx.

-XX:NewRatio = < a number>

    - specifies the size ratio between the tenured and young space.

For example: -XX:NewRatio = 2 would yield a 64 MB tenured space and a 32 MB young space, together a 96 MB heap.

-XX:SurvivorRatio = < a number >

    - specifies the size ratio between the eden and one survivor space.

For example: with a ratio of 2 and a young space of 64 MB, the eden will need 32 MB of memory whereas each survivor space will use 16MB.

-XX:+PrintGCDetails

    - causes additional information about the collections to be printed.

-XX:MaxPermSize=<N>

    - using this option the maximum permanent generation size can be increased

References:

1. Know Your Worst Friend, the Garbase Collector by Romain Guy.

2. Virtual Machine Garbage Collection Tuning

3. Ergonomics in the 5.0 JavaTM Virtual Machine