Skip to main content

Command Palette

Search for a command to run...

Chapter 3 : CSV Preprocessing - Building a Reliable In-Memory Dataset

Updated
5 min read

Why We Preprocess the CSV

The CSV file is an external input.
External inputs are always the least reliable part of any system.

Before we introduce Flow, we want to answer three questions:

  1. Is the file readable?

  2. Does it match the expected schema?

  3. Can every row be safely converted into a Kotlin object?

If the answer to any of these is “no”, the program should fail immediately, not halfway through a Flow pipeline.

CSV Preprocessing Strategy

We handle CSV preprocessing as a one-time operation.

We deliberately delegate CSV correctness to a library so that we can study Kotlin Flow without distractions

During preprocessing, we:

  • Verify that the file is not empty.

  • Validate the header exactly.

  • Ensure every row has the expected number of columns.

  • Parse numeric fields eagerly.

  • Fail fast if anything is malformed.

Once preprocessing succeeds, we convert the CSV into a List<Stock>

Every Flow example in this series can safely assume that the data is correct.

From this point onward:

  • The data is trusted.

  • The data is immutable.

  • The data is safe to use across threads.

  • No Flow example needs to repeat validation.

This List<Stock> becomes our in-memory database snapshot.

Why This Matters for Learning Flow

By separating preprocessing from Flow:

  • Every Flow example starts from the same clean state.

  • We avoid duplicating validation logic.

  • We can focus entirely on Flow behaviour.

  • Tests become deterministic and fast.

Most importantly, we do not confuse data preparation problems with reactive stream problems.

The Stock Model

We start with a simple immutable data class.

package org.kotlinflowlearner.stockflow.model

/**
 * Stock represents a single record loaded from the CSV file.
 *
 * Each Stock instance models one row of data and is immutable by design.
 * Immutability ensures thread safety and predictable behavior when the
 * data is later used inside Kotlin Flows.
 */
data class Stock(
    val rank: Int,
    val name: String,
    val symbol: String,
    val marketCap: Long,
    val priceUsd: Double,
    val country: String
)

Reading and Validating the CSV

The CSV loader is responsible for all parsing and validation logic.

package org.kotlinflowlearner.stockflow.csv

import java.io.File
import com.opencsv.CSVReaderBuilder
import com.opencsv.exceptions.CsvValidationException
import org.kotlinflowlearner.stockflow.model.Stock
import java.io.InputStreamReader
import java.nio.file.Files
import java.nio.file.Path

/**
 * CsvStockLoader loads and validates stock data from a CSV file.
 *
 * This implementation delegates CSV correctness to a open csv parser
 * so that Flow examples can focus purely on stream semantics.
 */
object CsvStockLoader {

    private val expectedHeader = listOf(
        "Rank", "Name", "Symbol", "marketcap", "price (USD)", "country"
    )

    fun load(path: Path): List<Stock> {
        Files.newInputStream(path).use { inputStream ->
            CSVReaderBuilder(InputStreamReader(inputStream))
                .withSkipLines(0)
                .build()
                .use { reader ->

                    val header = reader.readNext()
                        ?: error("CSV file is empty")

                    require(header.toList() == expectedHeader) {
                        "CSV header does not match expected format"
                    }

                    val stocks = mutableListOf<Stock>()
                    var row: Array<String>?

                    while (true) {
                        row = reader.readNext() ?: break

                        require(row.size == 6) {
                            "Invalid column count at row ${stocks.size + 2}"
                        }

                        stocks += Stock(
                            rank = row[0].toInt(),
                            name = row[1],
                            symbol = row[2],
                            marketCap = row[3].toLong(),
                            priceUsd = row[4].toDouble(),
                            country = row[5]
                        )
                    }

                    return stocks.toList()
                }
        }
    }
}

Run the Program to ascertain the validation of CSV

Copy and paste the 20 CSV records that we discussed in Chapter 2 in the resources folder src/main/resources

package org.kotlinflowlearner.stockflow.bootstrap

import org.kotlinflowlearner.stockflow.csv.CsvStockLoader
import java.nio.file.Path
import java.nio.file.Paths

/**
 * Main is a temporary verification entry point.
 *
 * It exists only to confirm that the CSV file can be read,
 * validated, and converted into an in-memory dataset.
 * Once this works, Flow examples can safely depend on the data.
 */
fun main() {
    val resource = requireNotNull(
        object {}.javaClass.classLoader.getResource("stocks.csv")
    ) {
        "stocks.csv not found on classpath"
    }

    val path: Path = Path.of(resource.toURI())

    val stocks = CsvStockLoader.load(path)

    println("Loaded ${stocks.size} stock records.")
    println("First record: ${stocks.first()}")
    println("Last record: ${stocks.last()}")
}

Run the Main.kt inside the IntelliJ IDE to see the result.

Loaded 20 stock records.
First record: Stock(rank=1, name=Aurora Systems, symbol=AURS, marketCap=1800000000000, priceUsd=215.4, country=United States)
Last record: Stock(rank=20, name=Polar Agriculture, symbol=PLAG, marketCap=225300000000, priceUsd=29.86, country=New Zealand)

At this point, we have a reliable in-memory dataset represented as a List<Stock>. This list acts as a snapshot of data that is safe to share across threads and collectors.

All CSV-related concerns are now isolated. No Flow use case will need to worry about file I/O, parsing, or validation.

What we have done so far

At this point, we have:

  • A clean Kotlin project.

  • A realistic but safe dataset.

  • A clear folder structure.

  • A reliable in-memory data source.

We are now ready to study Flow without distractions.

In the next chapter, we will finally write our first Kotlin Flow.

We will answer questions like:

  • What does “cold” actually mean?

  • When does a Flow start executing?

  • Why does nothing happen until we collect?

We will do this slowly, deliberately, and with full visibility into what is happening.

Chapter 4 introduces our first use case.