I’ve come across a lot of text based processing and analytics software is written in python, but not so much in ruby, so I’ve decided to share an example about how to algorithmically parse lots of text with plain ruby code – no dependencies required!

The problem statement is simple.

Given a new-line delimited text file, with each line being field separated by commas, determine the $m$ most common string of length $n$, accounting for multiple most common strings.

Let’s get hacking 🤓

Start off by defining a ruby class and some functions.

require 'csv'

class Pathfinder
def initialize
end

def delimited_data
end

def find_freq_path
end

def limiter
end
end


We’ll be using ruby’s built in csv library to do the file parsing. And defining a few functions.

• The initialize function will set a few attributes, :input_filename, :target_length ($n$), :col_delimiter, and :limit_results ($m$).
• The delimited_data function will basically return an object that ruby has read into memory such that we can iterate over the results. Let’s assume for the purposes of this post that the file can be read into memory.
• find_freq_path will use a rolling hash algorithm to build up a frequency counter of the most common strings of length $n$.
• limiter will limit our results to the top $m$ results, with the top $m$ being the strings that appear most frequent.
require 'csv'

class Pathfinder
attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results

def initialize(input_filename: 'data.txt',
col_delimiter: ',',
target_length: 5,
limit_results: 1)
@input_filename = input_filename
@col_delimiter = col_delimiter
@target_length = target_length
@limit_results = limit_results
end

# ...
end


Basically this is telling us that we’ll be parsing data.txt (row delimited by new lines and column delimited by commas) for the string of length 5 that appears the most.

Additionally, these fields can also be parameterized incase our delimiters change, we want to use another file, we change our target length, or change the number of results we want.

def delimited_data
end


And run that through our find_freq_path function

def find_freq_path
result = {}

delimited_data.each do |row|
keys = []
row.each do |col|
keys.push col
if keys.length == @target_length
increment_hash_for_key result, arr_to_str(keys)
keys.shift
end
end
end

limiter result
end


And now let’s write our limiter, increment, and hash key functions (arr_to_str)

def limiter(hash)
arr = []

@limit_results.times do
max = hash.max_by { |_key, value| value }
max_v = max[1]
arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
hash.delete_if { |_key, value| value == max_v }
end

arr
end

def increment_hash_for_key(hash, key)
!hash[key] ? hash[key] = 1 : hash[key] += 1
end

def arr_to_str(arr)
arr.map { |s| s.to_s }.join('')
end


When you’re all done it should look like this.

require 'csv'

class Pathfinder
attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results

def initialize(input_filename: 'data.txt',
col_delimiter: ',',
target_length: 5,
limit_results: 1)
@input_filename = input_filename
@col_delimiter = col_delimiter
@target_length = target_length
@limit_results = limit_results
end

def delimited_data
end

def find_freq_path
result = {}

delimited_data.each do |row|
keys = []
row.each do |col|
keys.push col
if keys.length == @target_length
increment_hash_for_key result, arr_to_str(keys)
keys.shift
end
end
end

limiter result
end

def limiter(hash)
arr = []

@limit_results.times do
max = hash.max_by { |_key, value| value }
max_v = max[1]
arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
hash.delete_if { |_key, value| value == max_v }
end

arr
end

def increment_hash_for_key(hash, key)
!hash[key] ? hash[key] = 1 : hash[key] += 1
end

def arr_to_str(arr)
arr.map { |s| s.to_s }.join('')
end
end



We can write a test for this class too. In a new file in the same dir as your pathfinder class, write the following.

require_relative 'pathfinder'

pathfinder = Pathfinder.new
pathfinder.find_freq_path


My data.txt is about 200KB and 10K rows, and running that script with the target_length and limit_results as mentioned above, I got the following result.

$time ruby pathfinder_test.rb ruby pathfinder_test.rb 0.20s user 0.05s system 96% cpu 0.258 total  Fairly quick. Let’s try with different parameters pathfinder = Pathfinder.new(target_length: 5, limit_results: 5) pathfinder.find_freq_path  $ time ruby pathfinder_test.rb
ruby pathfinder_test.rb  0.23s user 0.05s system 97% cpu 0.291 total


Negligible time differences here which is great. We could use limit_results to cache results frequently which we’re storing into a current state, frequently read datastore and run our text processing script whenever we’ve accumulated more rows in our file.

I hope this example of writing a sample text processing class in ruby has shown that you can write performant, statistical, analytics tools with ruby – so if you’re writing alot of api code in rails, you don’t always have to switch to using python based libraries for data analytics.