## Linear list sorting using weighted digraph possible?

In the shower this morning revisited the sorting subject. The prior post presented a non-efficient, though interesting (to me)) approach using a topological ordering graph algorithm. Is it possible to make it efficient?

Linear graph creation
So I went back to my initial idea of creating the graph with three types of edges: less than (LT), greater than (GT), and equal (EQ). Thus, it would be a linear process to iterate the list of objects and create the graph based on the relationship between neighbors. That is, compare the first with the second, the second with the third and so forth. With the original list [9, 2, 12, 40, 12, 3, 1]

The resulting graph is:

Embedded sub graphs
Is there any way to take that and break and merge it so that the algorithmic efficiency is high? With the labeling of edges we created a weighted digraph.
What we have is potentially three digraphs. Listed by taking the inverse of each edge (sink,source):
The less than graph LT: (2,9),(12,40),(3,12),(1,3)

The greater than graph GT: (2,12),(12x,40)

The equals graph EQ: in this example empty, but the 12 -> 12x equality is in the original data.

Sub digraph sorting and merge
One approach to sorting the complete list is taking each subgraph (LT, GT, EQ), sorting each and then merging the results.

The LT graph sorted is:

The GT graph sorted is:

Then we just apply a merge algorthm. The gain with this approach is that we broke the graph into three or two subgraphs. We can reapply this recursively, but this is just a rediscovery of existing sorts like Merge Sort?

Continued graph augmentation?
Combined into one graph we get:

Note that if we take the sinks end of each edge we have:
less direction: 2, 12, 3, 1
greater direction: 12x, 2

Compare this to the original graph from previous blog post where we we compared every node to each other:

Just noticed that in the presorted digraph above there is a link from vertex 1 to 2 then to 3, but there is also a link from 1 to 3. If we remove the shortest link and keep the longest multiple links (the reverse of some graph algorithm goals) we are left with the correct sequence 1, 2, 3. Hmmm.

No sort, just links between subgraphs?
Could be possible to do a merge of the later separated graphs? Or can the original graph be augmented to capture extra information? Lets see, if I walk GT, and compare each source with LT, and the first greater than or equal to source vertex create a link to it and to the sink of that node? If the vertexes already appear skip? Something like:

But this one shows an ambiguity ‘2’ points to both ‘9’ and ’12’.

1. Weighted graphs and networks
2. M. Jessup’s sample code on StackOverflow
3. Topological.java
4. topological_sort
5. 4.2 Directed Graphs. Text book, part of course content at Princeton
6. Directed graph
7. “JUnit Tests as Inner Classes”
8. An informal introduction to O(N) notation
9. Bogosort
10. Algorithm of the Week: Dijkstra Shortest Path in a Graph
11. Design Distributed Digraph Algorithms using MapReduce; Journal of Computational Information Systems 7: 7 (2011) 2267-2276. http://www.jofcis.com/publishedpapers/2011_7_7_2267_2276.pdf.

## Simple Java Data File

In a prior post “A very simple data file metaformat” I presented an idea for an INI type of file and used an example implementation in the Groovy language.

I rewrote it in Java. One change was the use of section terminators. I’m still not sure about specifying of the data type in the section header.

Note the only testing of this is the embedded JUnit tests as shown in the source.
Update: A version of this is being successfully used to store test data for a suite of unit tests. Each file holds several sections with hundreds of lines each of XML and other types.

API

• T loadAs(final Class type, final String id, final String subsection): load a section as the given class type
• next(): Pull event style reading.
final Callable callable)
• void parse(final String filePath, final Runnable runnable): Push event style reading.

Update 1 20121208T1809-5
Java Jar manifest files have a very useful property. Attributes which are in the ‘main’ section are inherited by individual entries unless redefined in the entry. For the application I envisioned this metaformat, that would be ideal. One way of doing this is that any data contained in a section ‘x’ is also applicable or in scope for any section x/y. And, so forth, x/y/z would have the data in x/y available.

Update 2 20130107T2000-5
Added method Map load() throws IOException. This is more easy to use method of accessing inix file section data. This is only in the source Gist repo.

Update 3 20130113T1300-5

Update 4 20131028T0939-5
Doh, forgot about mentioning Apache Commons Configuration. It does support hierarchical INI files. However, the sections in that format only support property mappings.

The test data file is:

```# Example very simple data file
! Another comment

#[>map:/junk]
#[<]

[>list:credit/tests]
one
two
three
[<]

[>credit/report]
one,two,three
[<]

[>properties:credit/config]
one=alpha
two=beta
three=charlie
[<]
[>xml:credit/beans]
<description>
<item>one</item>
<item>two</item>
<item>three</item>
</description>
[<]

[>notype/sub]
[<]

["one","two","three"]
[<]
[>credit]
one
two
three
[<]

[>credit/coverages]
["one","two","three"]
[<]

```

[expand title=”Click to expand implementation source”]

[/expand]

## A very simple data file metaformat

What is the simplest data file metaformat you can create and yet be able to handle future complexity? I started puzzling about this yet again.

Also see follow up post: Simple Java Data File
An example application is given here: Java testing using XStream, AspectJ, Vsdf

Scenario
I had some maintenance to do. So, to reduce the big ball of mud I decided to use external data files. This is where the complexity came in. If I have d data files for each component, and c “components” then the total number of data files is d*c. Future maintenance of so many files is not optimal.

One thing about the required data files in this scenario, some would contain lists, others would be key, value pairs, and so forth. Could these be combined into one data file? I looked at JSON, YAML, XML, and even GRON. Though good they seemed excessive. What if, for example, I needed a simple list? In a simple text file this could be stored with an item per line, or using simple separators. In the aforementioned metaformats not so.

Solution
I revisited the Windows INI file format and just added metadata to a section. A section, indicated with a header “[…]”, also indicates what its data format is. Also, we allow subsections: [type:identifier/section]. This is similar to a URI. The subsection, which can be a hierarchical ‘path’, is optional. The type is optional, default being list (update: text). If the file has no sections, it is just a line oriented file of data in a list. (Update: line oriented string data).

In the original ini file format, the section data were key=value pairs. Here we follow the freedom of a HEREDOC.

The data type indication is practical when standard collections are being created such as lists, map, arrays, and so forth. We can use a generic “text’ type for a non-typed string payload. Since a host application will know what data it is extracting from a data file, the higher level types such as XML, JSon, and others are of limited value.

The use of subsections in the section name allows scoping, but this was also possible in the original INI file format, just not “formalized”. True subsections should probably be nested sections, i.e. hierarchy. But, then we are now losing the simplicity.

Subsections (though not nested) allow the use of cascading data. Data in a section is automatically reused or available in matching subsections.

Example

```# Example very simple data file
#
[>list:credit/tests]
one
two
three
[>csv:credit/report]
one,two,three
[>properties:credit/config]
one=alpha
two=beta
three=charlie
[>xml:credit/beans]
<description>
<item>one</item>
<item>two</item>
<item>three</item>
</description>
["one","two","three"]
[>credit]
one
two
three
[>gron:credit/coverages]
["one","two","three"]
```

Section Production Rule
**** Note: this is an incorrect production *****
file ::= section* ;
section ::= ‘[>’ (type:)? identifier (‘/’ subsection)? ‘]’ (data)+ sectionTerminator;
data ::= (line lineEnd)*;
identifier ::= name
subsection ::= name [/name]* ;
sectionTerminator ::= ‘[<' identifier ('/' subsection)? name ::= [a-zA-Z0-9-_]+

What we have now is a line oriented data file that can contain other data formats, and with no sections the file is just a line oriented list. Listing three is a demo in the form of a Groovy language JUnit 4 test.

[expand title=”Listing 2, Groovy language JUnit test as a demo”]

```import com.octodecillion.vsdf.*
import static com.octodecillion.vsdf.Vsdf.EventType.*
import org.junit.Test
import static org.junit.Assert.assertEquals

/** Test Vsdf */
class VsdfTest /*extends GroovyTestCase*/ {
def LINESEP =  System.properties.get("line.separator")

@Test
void testshouldGetListData(){

while(theEvent != Vsdf.EventType.END){

if(isSectionCreditWithList(event)){
def data = event.text.split(LINESEP)
assert data.size() == 3
assert ( (data[0] == 'one') &&
(data[1] == 'two') &&
(data[2] == 'three') )
}

}

}

/** */
def isSectionCreditWithList(evt){
return evt.id == 'credit' && evt.dataType == 'list'
}

}
```

[/expand]

Sample run:

```groovy -cp . VsdfTest.groovy
JUnit 4 Runner, Tests: 1, Failures: 0, Time: 281
```

Limitations
Not quite correct yet. One issue is that file encoding format. If we want to include other formats they have their own requirements, Java properties, JSON, XML, and so forth. For example, JSON is Unicode. I don’t think this is a major issue, this solution is meant for config data, so ASCII files are adequate.

Also, should the sections have terminators? Right now, the end of a section is simply the start of another. (Update: the version of this concept in actual use is terminator based, i.e., [<] or [<id/subsection...]) Implementation
Below in listing 3 is a very simple implementation in the Groovy language to show how easy this data file is to use. Note this is just a proof of concept and has not been thoroughly tested. I don’t think the use of mark and reset in the file reading is robust; how do you determine the correct read ahead buffer? To make it easier to parse I think the format will need to have section terminators as does HEREDOCS in Linux.

Source code available as a gist.

[expand title=”Listing 3, Groovy implementation”]

```// File Vsdf.groovy
// Author: Josef Betancourt
//

package com.octodecillion.vsdf
import groovy.transform.TypeChecked;
import java.text.BreakDictionary;
import java.util.regex.Matcher

/**
* @author Josef Betancourt
*
*/
class Vsdf {
String currentFolder
String iniFilePath
int lineNum
int sectionNum
VsdfEvent data
def LINESEP = System.properties.get("line.separator")

enum State {
INIT, ACCEPT, SHIFT, END
}

public enum EventType {
COMMENT, SECTION, END
}

def state = State.INIT

public Vsdf(){
currentFolder = new File(".").getAbsolutePath()
}

/**
* Value object for parsed sections.
*
*/
class VsdfEvent {
EventType event
String dataType
String namespace
String id
String text
int lineNum
int sectionNum
String sectionString
}

VsdfEvent getEvent(){
return data
}

/**
*
* @return
*/
@TypeChecked
EventType next(){
lineNum++

if(line == null){
return EventType.END
}

String type =''
String namespace = ''
String id = ''
data = new VsdfEvent()
EventType eventType

def isBlank = !line.trim()

// skip blank lines
if( isBlank){
while(true){
lineNum++
if(line  || line == null){
break;
}
}
}

def isComment = line =~ /^\s*#/
if(isComment){
data.text = line
data.event = EventType.COMMENT
data.lineNum = lineNum
eventType = EventType.COMMENT
}

if( line.trim() =~ /^\[>.*\]/){ // section?
eventType = EventType.SECTION
data.sectionString = line
sectionNum++
processSection(line, sectionNum, data)
} // end if section head

return eventType
}

/**
*
* @param line
* @return
*/
def processSection(String line, int sectionNum, VsdfEvent data){
data.event = EventType.SECTION
data.sectionNum = sectionNum

Matcher m = (line.trim() =~ /^\[>(.*)\]/)
String mString = m[0][1]
def current = mString.trim()
if(!current){
def msg = "section \$sectionNum is blank"
throw new IllegalArgumentException(msg)
}

def parts = (current =~ /^(.*):(.*)\/(.*)\$/)
if(!parts){
data.id = current
data.dataType='list'
}else{
long size = ((String[])parts[0]).length
data.dataType = size > 0? parts[0][1] : ''
data.id = size > 1  ? parts[0][2] : ''
data.namespace = size > 2 ? parts[0][3] : ''
}

}

/**
*
* @return
*/

while(true){
lineNum++

if(line == null){
break
}

if( line.trim() =~ /^\[>.*\]/){ // section?
break
}else{
buffer.append(line + LINESEP)
}
}

return buffer.toString()
}
}
```

[/expand]