|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionOne would imagine that parsing CSV files is a straightforward and boring task. I was thinking that too, until I had to parse several CSV files of a couple GB each. After trying to use the OleDB JET driver and various regular expressions, I still ran into serious performance problems. At this point, I decided I would try the custom class option. I scoured the net for existing code, but finding a correct, fast, and efficient CSV parser and reader is not so simple, whatever platform/language you fancy. I say correct in the sense that many implementations merely use some splitting method like And so, this led to the CSV reader class I present in this article. Its design is based on the Benchmarking it against both OLEDB and regex methods, it performs about 15 times faster, and yet its memory usage is very low. To give more down to earth numbers, with a 45 MB CSV file containing 145 fields and 50,000 records, the reader was processing about 30 MB/sec. So all in all, it took 1.5 seconds! The machine specs were P4 3.0 GHz, 1024 MB. Supported FeaturesThis reader supports fields spanning multiple lines. The only restriction is that they must be quoted, otherwise it would not be possible to distinguish between malformed data and multi-line values. Basic data-binding is possible via the You can specify custom values for these parameters:
If the CSV contains field headers, they can be used to access a specific field. When the CSV data appears to be malformed, the reader will fail fast and throw a meaningful exception stating where the error occurred and providing the current content of the buffer. A cache of the field values is kept for the current record only, but if you need dynamic access, I also included a cached version of the reader, Latest updates (3.6.1 Release)Fixes
Benchmark and ProfilingYou can find the code for these benchmarks in the demo project. I tried to be fair and follow the same pattern for each parsing method. The regex used comes from Jeffrey Friedl's book, and can be found at page 271. It doesn't handle trimming and multi-line fields. The test file contains 145 fields, and is about 45 MB (included in the demo project as a RAR archive). I also included the raw data from the benchmark program and from the CLR Profiler for .NET 2.0.
Using the CodeThe class design follows Basic Usage Scenariousing System.IO;
using LumenWorks.Framework.IO.Csv;
void ReadCsv()
{
// open the file "data.csv" which is a CSV file with headers
using (CsvReader csv =
new CsvReader(new StreamReader("data.csv"), true))
{
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
Console.Write(string.Format("{0} = {1};",
headers[i], csv[i]));
Console.WriteLine();
}
}
}
Simple Data-Binding Scenario (ASP.NET)using System.IO;
using LumenWorks.Framework.IO.Csv;
void ReadCsv()
{
// open the file "data.csv" which is a CSV file with headers
using (CsvReader csv = new CsvReader(
new StreamReader("data.csv"), true))
{
myDataRepeater.DataSource = csv;
myDataRepeater.DataBind();
}
}
Complex Data-Binding Scenario (ASP.NET)Due to the way both the Anyway, even if it was possible, using the simple data-binding method is much more efficient. For the curious amongst you, the bug comes from the fact that the two grid controls completely ignore the property descriptors returned by the Complex Data-Binding Scenario (Windows Forms)using System.IO;
using LumenWorks.Framework.IO.Csv;
void ReadCsv()
{
// open the file "data.csv" which is a CSV file with headers
using (CachedCsvReader csv = new
CachedCsvReader(new StreamReader("data.csv"), true))
{
// Field headers will automatically be used as column names
myDataGrid.DataSource = csv;
}
}
Custom Error Handling Scenariousing System.IO;
using LumenWorks.Framework.IO.Csv;
void ReadCsv()
{
// open the file "data.csv" which is a CSV file with headers
using (CsvReader csv = new CsvReader(
new StreamReader("data.csv"), true))
{
// missing fields will not throw an exception,
// but will instead be treated as if there was a null value
csv.MissingFieldAction = MissingFieldAction.ReplaceByNull;
// to replace by "" instead, then use the following action:
//csv.MissingFieldAction = MissingFieldAction.ReplaceByEmpty;
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
Console.Write(string.Format("{0} = {1};",
headers[i],
csv[i] == null ? "MISSING" : csv[i]));
Console.WriteLine();
}
}
}
Custom Error Handling Using Events Scenariousing System.IO;
using LumenWorks.Framework.IO.Csv;
void ReadCsv()
{
// open the file "data.csv" which is a CSV file with headers
using (CsvReader csv = new CsvReader(
new StreamReader("data.csv"), true))
{
// missing fields will not throw an exception,
// but will instead be treated as if there was a null value
csv.DefaultParseErrorAction = ParseErrorAction.RaiseEvent;
csv.ParseError += new ParseErrorEventHandler(csv_ParseError);
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
Console.Write(string.Format("{0} = {1};",
headers[i], csv[i]));
Console.WriteLine();
}
}
}
void csv_ParseError(object sender, ParseErrorEventArgs e)
{
// if the error is that a field is missing, then skip to next line
if (e.Error is MissingFieldException)
{
Console.Write("--MISSING FIELD ERROR OCCURRED");
e.Action = eErrorAction.AdvanceToNextLine;
}
}
HistoryVersion 3.6.2 (2008-10-09)
Version 3.6.1 (2008-07-16)
Version 3.6 (2008-07-09)
Version 3.5 (2007-11-28)
Version 3.4 (2007-10-23)
Version 3.3 (2007-01-14)
Version 3.2 (2006-12-11)
Version 3.1.2 (2006-08-06)
Version 3.1.1 (2006-07-25)
Version 3.0 (2006-05-15)
Version 2.0 (2005-08-10)
Version 1.1 (2005-01-15)
Version 1.0 (2005-01-09)
| ||||||||||||||||||||