Linear list sorting using weighted digraph possible?

In the shower this morning revisited the sorting subject. The prior post presented a non-efficient, though interesting (to me)) approach using a topological ordering graph algorithm. Is it possible to make it efficient?

Linear graph creation
So I went back to my initial idea of creating the graph with three types of edges: less than (LT), greater than (GT), and equal (EQ). Thus, it would be a linear process to iterate the list of objects and create the graph based on the relationship between neighbors. That is, compare the first with the second, the second with the third and so forth. With the original list [9, 2, 12, 40, 12, 3, 1]

The resulting graph is:

list transform

Embedded sub graphs
Is there any way to take that and break and merge it so that the algorithmic efficiency is high? With the labeling of edges we created a weighted digraph.
What we have is potentially three digraphs. Listed by taking the inverse of each edge (sink,source):
The less than graph LT: (2,9),(12,40),(3,12),(1,3)

separate less than digraph

The greater than graph GT: (2,12),(12x,40)

separate greater than digraph
separate greater than digraph

The equals graph EQ: in this example empty, but the 12 -> 12x equality is in the original data.

Sub digraph sorting and merge
One approach to sorting the complete list is taking each subgraph (LT, GT, EQ), sorting each and then merging the results.

The LT graph sorted is:

sorted LT DAG
sorted LT DAG

The GT graph sorted is:

sorted GT subgraph
sorted GT subgraph

Then we just apply a merge algorthm. The gain with this approach is that we broke the graph into three or two subgraphs. We can reapply this recursively, but this is just a rediscovery of existing sorts like Merge Sort?

Continued graph augmentation?
Combined into one graph we get:

embedded DAGs

Note that if we take the sinks end of each edge we have:
less direction: 2, 12, 3, 1
greater direction: 12x, 2

Compare this to the original graph from previous blog post where we we compared every node to each other:

presorted digraph
presorted digraph

Just noticed that in the presorted digraph above there is a link from vertex 1 to 2 then to 3, but there is also a link from 1 to 3. If we remove the shortest link and keep the longest multiple links (the reverse of some graph algorithm goals) we are left with the correct sequence 1, 2, 3. Hmmm.

No sort, just links between subgraphs?
Could be possible to do a merge of the later separated graphs? Or can the original graph be augmented to capture extra information? Lets see, if I walk GT, and compare each source with LT, and the first greater than or equal to source vertex create a link to it and to the sink of that node? If the vertexes already appear skip? Something like:

DAG with Internal links
DAG with Internal links

But this one shows an ambiguity ‘2’ points to both ‘9’ and ’12’.

Updates

Further reading

  1. Weighted graphs and networks
  2. M. Jessup’s sample code on StackOverflow
  3. Topological.java
  4. topological_sort
  5. 4.2 Directed Graphs. Text book, part of course content at Princeton
  6. Directed graph
  7. “JUnit Tests as Inner Classes”
  8. An informal introduction to O(N) notation
  9. Bogosort
  10. Algorithm of the Week: Dijkstra Shortest Path in a Graph
  11. Design Distributed Digraph Algorithms using MapReduce; Journal of Computational Information Systems 7: 7 (2011) 2267-2276. http://www.jofcis.com/publishedpapers/2011_7_7_2267_2276.pdf.

Simple Java Data File

In a prior post “A very simple data file metaformat” I presented an idea for an INI type of file and used an example implementation in the Groovy language.

I rewrote it in Java. One change was the use of section terminators. I’m still not sure about specifying of the data type in the section header.

Note the only testing of this is the embedded JUnit tests as shown in the source.
Update: A version of this is being successfully used to store test data for a suite of unit tests. Each file holds several sections with hundreds of lines each of XML and other types.

API

  • load(): Load all sections into a map.
  • load(String id, String subsection): Load a specific section.
  • T loadAs(final Class type, final String id, final String subsection): load a section as the given class type
  • next(): Pull event style reading.
  • void parse(final BufferedReader reader,
    final Callable callable)
    : Push event style reading.
  • void parse(final String filePath, final Runnable runnable): Push event style reading.

Update 1 20121208T1809-5
Java Jar manifest files have a very useful property. Attributes which are in the ‘main’ section are inherited by individual entries unless redefined in the entry. For the application I envisioned this metaformat, that would be ideal. One way of doing this is that any data contained in a section ‘x’ is also applicable or in scope for any section x/y. And, so forth, x/y/z would have the data in x/y available.

Update 2 20130107T2000-5
Added method Map load() throws IOException. This is more easy to use method of accessing inix file section data. This is only in the source Gist repo.

Update 3 20130113T1300-5
Added loadAs(…) to make loading specific types easier.

Update 4 20131028T0939-5
Doh, forgot about mentioning Apache Commons Configuration. It does support hierarchical INI files. However, the sections in that format only support property mappings.

The test data file is:

# Example very simple data file
! Another comment

#[>map:/junk]
#[<]

[>list:credit/tests]
one
two
three
[<]

[>credit/report]
one,two,three
[<]

[>properties:credit/config]
one=alpha
two=beta
three=charlie
[<]
    [>xml:credit/beans]
	<description>
	    <item>one</item>
	    <item>two</item>
	    <item>three</item>
	</description>
	[<]
	
[>notype/sub]
[<]	

[>credit/alerts]
["one","two","three"]
[<]
[>credit]
one
two
three
[<]

[>credit/coverages]
["one","two","three"]
[<]


 

Further Reading

  1. INI file
  2. Java Properties file format
  3. Data File Metaformats
  4. Here document
  5. JSON configuration file format
  6. Groovy Object Notation (GrON) for Data
    Interchange
  7. Cloanto Implementation of INI File Format
  8. http://groovy.codehaus.org/Tutorial+5+-+Capturing+regex+groups
  9. URI
  10. RFC822
  11. MIME
[expand title=”Click to expand implementation source”]

[/expand]

A very simple data file metaformat

What is the simplest data file metaformat you can create and yet be able to handle future complexity? I started puzzling about this yet again.

Also see follow up post: Simple Java Data File
An example application is given here: Java testing using XStream, AspectJ, Vsdf

Scenario
I had some maintenance to do. So, to reduce the big ball of mud I decided to use external data files. This is where the complexity came in. If I have d data files for each component, and c “components” then the total number of data files is d*c. Future maintenance of so many files is not optimal.

One thing about the required data files in this scenario, some would contain lists, others would be key, value pairs, and so forth. Could these be combined into one data file? I looked at JSON, YAML, XML, and even GRON. Though good they seemed excessive. What if, for example, I needed a simple list? In a simple text file this could be stored with an item per line, or using simple separators. In the aforementioned metaformats not so.

Solution
I revisited the Windows INI file format and just added metadata to a section. A section, indicated with a header “[…]”, also indicates what its data format is. Also, we allow subsections: [type:identifier/section]. This is similar to a URI. The subsection, which can be a hierarchical ‘path’, is optional. The type is optional, default being list (update: text). If the file has no sections, it is just a line oriented file of data in a list. (Update: line oriented string data).

In the original ini file format, the section data were key=value pairs. Here we follow the freedom of a HEREDOC.

The data type indication is practical when standard collections are being created such as lists, map, arrays, and so forth. We can use a generic “text’ type for a non-typed string payload. Since a host application will know what data it is extracting from a data file, the higher level types such as XML, JSon, and others are of limited value.

The use of subsections in the section name allows scoping, but this was also possible in the original INI file format, just not “formalized”. True subsections should probably be nested sections, i.e. hierarchy. But, then we are now losing the simplicity.

Subsections (though not nested) allow the use of cascading data. Data in a section is automatically reused or available in matching subsections.
See Cascading Configuration Pattern.

Example

# Example very simple data file
#
[>list:credit/tests]
one
two
three
[>csv:credit/report]
one,two,three
[>properties:credit/config]
one=alpha
two=beta
three=charlie
[>xml:credit/beans]
<description>
	<item>one</item>
	<item>two</item>
	<item>three</item>
</description>
[>json:credit/alerts]
["one","two","three"]
[>credit]
one
two
three
[>gron:credit/coverages]
["one","two","three"]

 

Section Production Rule
**** Note: this is an incorrect production *****
file ::= section* ;
section ::= ‘[>’ (type:)? identifier (‘/’ subsection)? ‘]’ (data)+ sectionTerminator;
data ::= (line lineEnd)*;
identifier ::= name
subsection ::= name [/name]* ;
sectionTerminator ::= ‘[<' identifier ('/' subsection)? name ::= [a-zA-Z0-9-_]+

 

What we have now is a line oriented data file that can contain other data formats, and with no sections the file is just a line oriented list. Listing three is a demo in the form of a Groovy language JUnit 4 test.

[expand title=”Listing 2, Groovy language JUnit test as a demo”]

import com.octodecillion.vsdf.*
import static com.octodecillion.vsdf.Vsdf.EventType.*
import org.junit.Test
import static org.junit.Assert.assertEquals

/** Test Vsdf */
class VsdfTest /*extends GroovyTestCase*/ {
	def LINESEP =  System.properties.get("line.separator")
	
	@Test
	void testshouldGetListData(){
		def reader = new Vsdf()
		reader.reader = new BufferedReader(
			new FileReader(new File("data-1.vsdf")))
		
		def theEvent = reader.next()		
		
		while(theEvent != Vsdf.EventType.END){
			def event = reader.getEvent()
			
			if(isSectionCreditWithList(event)){
				def data = event.text.split(LINESEP)				
				assert data.size() == 3				
				assert ( (data[0] == 'one') && 
					     (data[1] == 'two') && 
					     (data[2] == 'three') )				
			}
			
			theEvent = reader.next()
		}
			
	}	
	
	/** */
	def isSectionCreditWithList(evt){		
		return evt.id == 'credit' && evt.dataType == 'list'
	}	

}

[/expand]

 

Sample run:

groovy -cp . VsdfTest.groovy
JUnit 4 Runner, Tests: 1, Failures: 0, Time: 281

Limitations
Not quite correct yet. One issue is that file encoding format. If we want to include other formats they have their own requirements, Java properties, JSON, XML, and so forth. For example, JSON is Unicode. I don’t think this is a major issue, this solution is meant for config data, so ASCII files are adequate.

Also, should the sections have terminators? Right now, the end of a section is simply the start of another. (Update: the version of this concept in actual use is terminator based, i.e., [<] or [<id/subsection...]) Implementation
Below in listing 3 is a very simple implementation in the Groovy language to show how easy this data file is to use. Note this is just a proof of concept and has not been thoroughly tested. I don’t think the use of mark and reset in the file reading is robust; how do you determine the correct read ahead buffer? To make it easier to parse I think the format will need to have section terminators as does HEREDOCS in Linux.

Source code available as a gist.

[expand title=”Listing 3, Groovy implementation”]

// File Vsdf.groovy
// Author: Josef Betancourt
//

package com.octodecillion.vsdf
import groovy.transform.TypeChecked;
import java.text.BreakDictionary;
import java.util.regex.Matcher
import org.codehaus.groovy.control.io.ReaderSource;

/**
 * @author Josef Betancourt
 *
 */
class Vsdf {
	String currentFolder
	String iniFilePath
	Reader reader
	int lineNum
	int sectionNum
	VsdfEvent data
	int READAHEADSIZE = 8*1024
	def LINESEP = System.properties.get("line.separator")

	enum State {
		INIT, ACCEPT, SHIFT, END
	}
	
	public enum EventType {
		COMMENT, SECTION, END
	}
	

	def state = State.INIT

	public Vsdf(){
		currentFolder = new File(".").getAbsolutePath()
	}
	
	/**
	 * Value object for parsed sections.
	 *  
	 */
	class VsdfEvent {
		EventType event
		String dataType
		String namespace
		String id
		String text
		int lineNum
		int sectionNum
		String sectionString
	}
	
	VsdfEvent getEvent(){
		return data		
	}

	/**
	 * 
	 * @return
	 */
	@TypeChecked
	EventType next(){
		String line = reader.readLine();
		lineNum++
		
		if(line == null){
			return EventType.END
		}
		
		String type =''
		String namespace = ''
		String id = ''
		data = new VsdfEvent()
		EventType eventType
		
		def isBlank = !line.trim() 
		
		// skip blank lines	
		if( isBlank){
			while(true){
				line = reader.readLine()
				lineNum++
				if(line  || line == null){
					break;
				}
			}
		}	
		
		def isComment = line =~ /^\s*#/
		if(isComment){
			data.text = line
			data.event = EventType.COMMENT
			data.lineNum = lineNum
			eventType = EventType.COMMENT			
		}
		
		if( line.trim() =~ /^\[>.*\]/){ // section?
			eventType = EventType.SECTION
			data.sectionString = line
			sectionNum++
			processSection(line, sectionNum, data)			
		} // end if section head

		return eventType 
	}
	
	/**
	 * 
	 * @param line
	 * @return
	 */
	def processSection(String line, int sectionNum, VsdfEvent data){
		data.event = EventType.SECTION
		data.sectionNum = sectionNum
		
		Matcher m = (line.trim() =~ /^\[>(.*)\]/)
		String mString = m[0][1]
		def current = mString.trim()
		if(!current){
			def msg = "section $sectionNum is blank"
			throw new IllegalArgumentException(msg)
		}
		
		def parts = (current =~ /^(.*):(.*)\/(.*)$/)
		if(!parts){
			data.id = current
			data.dataType='list'
		}else{
			long size = ((String[])parts[0]).length
			data.dataType = size > 0? parts[0][1] : ''
			data.id = size > 1  ? parts[0][2] : ''
			data.namespace = size > 2 ? parts[0][3] : ''
		}	
		
		String readData = readSectionData()
		data.text = readData	
	}
	
	/**
	 * 
	 * @return
	 */
	String readSectionData(){
		StringBuilder buffer = new StringBuilder(READAHEADSIZE)		

		while(true){
			reader.mark(READAHEADSIZE)
			String line = reader.readLine();
			lineNum++
			
			if(line == null){
				reader.reset()
				break
			}
			
			if( line.trim() =~ /^\[>.*\]/){ // section?				
				reader.reset()	
				break
			}else{
				buffer.append(line + LINESEP)	
			}
		}	
		
		return buffer.toString()	
	}
}

[/expand]

Further Reading

  1. INI file
  2. Data File Metaformats
  3. Here document
  4. JSON configuration file format
  5. Groovy Object Notation (GrON) for Data
    Interchange
  6. Cloanto Implementation of INI File Format
  7. http://groovy.codehaus.org/Tutorial+5+-+Capturing+regex+groups
  8. URI
  9. Designing a simple file format
  10. The Universal Design Pattern