Asked  7 Months ago    Answers:  5   Viewed   39 times

I need to process a large file, around 400K lines and 200 M. But sometimes I have to process from bottom up. How can I use iterator (yield return) here? Basically I don't like to load everything in memory. I know it is more efficient to use iterator in .NET.

 Answers

39

Reading text files backwards is really tricky unless you're using a fixed-size encoding (e.g. ASCII). When you've got variable-size encoding (such as UTF-8) you will keep having to check whether you're in the middle of a character or not when you fetch data.

There's nothing built into the framework, and I suspect you'd have to do separate hard coding for each variable-width encoding.

EDIT: This has been somewhat tested - but that's not to say it doesn't still have some subtle bugs around. It uses StreamUtil from MiscUtil, but I've included just the necessary (new) method from there at the bottom. Oh, and it needs refactoring - there's one pretty hefty method, as you'll see:

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace MiscUtil.IO
{
    /// <summary>
    /// Takes an encoding (defaulting to UTF-8) and a function which produces a seekable stream
    /// (or a filename for convenience) and yields lines from the end of the stream backwards.
    /// Only single byte encodings, and UTF-8 and Unicode, are supported. The stream
    /// returned by the function must be seekable.
    /// </summary>
    public sealed class ReverseLineReader : IEnumerable<string>
    {
        /// <summary>
        /// Buffer size to use by default. Classes with internal access can specify
        /// a different buffer size - this is useful for testing.
        /// </summary>
        private const int DefaultBufferSize = 4096;

        /// <summary>
        /// Means of creating a Stream to read from.
        /// </summary>
        private readonly Func<Stream> streamSource;

        /// <summary>
        /// Encoding to use when converting bytes to text
        /// </summary>
        private readonly Encoding encoding;

        /// <summary>
        /// Size of buffer (in bytes) to read each time we read from the
        /// stream. This must be at least as big as the maximum number of
        /// bytes for a single character.
        /// </summary>
        private readonly int bufferSize;

        /// <summary>
        /// Function which, when given a position within a file and a byte, states whether
        /// or not the byte represents the start of a character.
        /// </summary>
        private Func<long,byte,bool> characterStartDetector;

        /// <summary>
        /// Creates a LineReader from a stream source. The delegate is only
        /// called when the enumerator is fetched. UTF-8 is used to decode
        /// the stream into text.
        /// </summary>
        /// <param name="streamSource">Data source</param>
        public ReverseLineReader(Func<Stream> streamSource)
            : this(streamSource, Encoding.UTF8)
        {
        }

        /// <summary>
        /// Creates a LineReader from a filename. The file is only opened
        /// (or even checked for existence) when the enumerator is fetched.
        /// UTF8 is used to decode the file into text.
        /// </summary>
        /// <param name="filename">File to read from</param>
        public ReverseLineReader(string filename)
            : this(filename, Encoding.UTF8)
        {
        }

        /// <summary>
        /// Creates a LineReader from a filename. The file is only opened
        /// (or even checked for existence) when the enumerator is fetched.
        /// </summary>
        /// <param name="filename">File to read from</param>
        /// <param name="encoding">Encoding to use to decode the file into text</param>
        public ReverseLineReader(string filename, Encoding encoding)
            : this(() => File.OpenRead(filename), encoding)
        {
        }

        /// <summary>
        /// Creates a LineReader from a stream source. The delegate is only
        /// called when the enumerator is fetched.
        /// </summary>
        /// <param name="streamSource">Data source</param>
        /// <param name="encoding">Encoding to use to decode the stream into text</param>
        public ReverseLineReader(Func<Stream> streamSource, Encoding encoding)
            : this(streamSource, encoding, DefaultBufferSize)
        {
        }

        internal ReverseLineReader(Func<Stream> streamSource, Encoding encoding, int bufferSize)
        {
            this.streamSource = streamSource;
            this.encoding = encoding;
            this.bufferSize = bufferSize;
            if (encoding.IsSingleByte)
            {
                // For a single byte encoding, every byte is the start (and end) of a character
                characterStartDetector = (pos, data) => true;
            }
            else if (encoding is UnicodeEncoding)
            {
                // For UTF-16, even-numbered positions are the start of a character.
                // TODO: This assumes no surrogate pairs. More work required
                // to handle that.
                characterStartDetector = (pos, data) => (pos & 1) == 0;
            }
            else if (encoding is UTF8Encoding)
            {
                // For UTF-8, bytes with the top bit clear or the second bit set are the start of a character
                // See http://www.cl.cam.ac.uk/~mgk25/unicode.html
                characterStartDetector = (pos, data) => (data & 0x80) == 0 || (data & 0x40) != 0;
            }
            else
            {
                throw new ArgumentException("Only single byte, UTF-8 and Unicode encodings are permitted");
            }
        }

        /// <summary>
        /// Returns the enumerator reading strings backwards. If this method discovers that
        /// the returned stream is either unreadable or unseekable, a NotSupportedException is thrown.
        /// </summary>
        public IEnumerator<string> GetEnumerator()
        {
            Stream stream = streamSource();
            if (!stream.CanSeek)
            {
                stream.Dispose();
                throw new NotSupportedException("Unable to seek within stream");
            }
            if (!stream.CanRead)
            {
                stream.Dispose();
                throw new NotSupportedException("Unable to read within stream");
            }
            return GetEnumeratorImpl(stream);
        }

        private IEnumerator<string> GetEnumeratorImpl(Stream stream)
        {
            try
            {
                long position = stream.Length;

                if (encoding is UnicodeEncoding && (position & 1) != 0)
                {
                    throw new InvalidDataException("UTF-16 encoding provided, but stream has odd length.");
                }

                // Allow up to two bytes for data from the start of the previous
                // read which didn't quite make it as full characters
                byte[] buffer = new byte[bufferSize + 2];
                char[] charBuffer = new char[encoding.GetMaxCharCount(buffer.Length)];
                int leftOverData = 0;
                String previousEnd = null;
                // TextReader doesn't return an empty string if there's line break at the end
                // of the data. Therefore we don't return an empty string if it's our *first*
                // return.
                bool firstYield = true;

                // A line-feed at the start of the previous buffer means we need to swallow
                // the carriage-return at the end of this buffer - hence this needs declaring
                // way up here!
                bool swallowCarriageReturn = false;

                while (position > 0)
                {
                    int bytesToRead = Math.Min(position > int.MaxValue ? bufferSize : (int)position, bufferSize);

                    position -= bytesToRead;
                    stream.Position = position;
                    StreamUtil.ReadExactly(stream, buffer, bytesToRead);
                    // If we haven't read a full buffer, but we had bytes left
                    // over from before, copy them to the end of the buffer
                    if (leftOverData > 0 && bytesToRead != bufferSize)
                    {
                        // Buffer.BlockCopy doesn't document its behaviour with respect
                        // to overlapping data: we *might* just have read 7 bytes instead of
                        // 8, and have two bytes to copy...
                        Array.Copy(buffer, bufferSize, buffer, bytesToRead, leftOverData);
                    }
                    // We've now *effectively* read this much data.
                    bytesToRead += leftOverData;

                    int firstCharPosition = 0;
                    while (!characterStartDetector(position + firstCharPosition, buffer[firstCharPosition]))
                    {
                        firstCharPosition++;
                        // Bad UTF-8 sequences could trigger this. For UTF-8 we should always
                        // see a valid character start in every 3 bytes, and if this is the start of the file
                        // so we've done a short read, we should have the character start
                        // somewhere in the usable buffer.
                        if (firstCharPosition == 3 || firstCharPosition == bytesToRead)
                        {
                            throw new InvalidDataException("Invalid UTF-8 data");
                        }
                    }
                    leftOverData = firstCharPosition;

                    int charsRead = encoding.GetChars(buffer, firstCharPosition, bytesToRead - firstCharPosition, charBuffer, 0);
                    int endExclusive = charsRead;

                    for (int i = charsRead - 1; i >= 0; i--)
                    {
                        char lookingAt = charBuffer[i];
                        if (swallowCarriageReturn)
                        {
                            swallowCarriageReturn = false;
                            if (lookingAt == 'r')
                            {
                                endExclusive--;
                                continue;
                            }
                        }
                        // Anything non-line-breaking, just keep looking backwards
                        if (lookingAt != 'n' && lookingAt != 'r')
                        {
                            continue;
                        }
                        // End of CRLF? Swallow the preceding CR
                        if (lookingAt == 'n')
                        {
                            swallowCarriageReturn = true;
                        }
                        int start = i + 1;
                        string bufferContents = new string(charBuffer, start, endExclusive - start);
                        endExclusive = i;
                        string stringToYield = previousEnd == null ? bufferContents : bufferContents + previousEnd;
                        if (!firstYield || stringToYield.Length != 0)
                        {
                            yield return stringToYield;
                        }
                        firstYield = false;
                        previousEnd = null;
                    }

                    previousEnd = endExclusive == 0 ? null : (new string(charBuffer, 0, endExclusive) + previousEnd);

                    // If we didn't decode the start of the array, put it at the end for next time
                    if (leftOverData != 0)
                    {
                        Buffer.BlockCopy(buffer, 0, buffer, bufferSize, leftOverData);
                    }
                }
                if (leftOverData != 0)
                {
                    // At the start of the final buffer, we had the end of another character.
                    throw new InvalidDataException("Invalid UTF-8 data at start of stream");
                }
                if (firstYield && string.IsNullOrEmpty(previousEnd))
                {
                    yield break;
                }
                yield return previousEnd ?? "";
            }
            finally
            {
                stream.Dispose();
            }
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }
}


// StreamUtil.cs:
public static class StreamUtil
{
    public static void ReadExactly(Stream input, byte[] buffer, int bytesToRead)
    {
        int index = 0;
        while (index < bytesToRead)
        {
            int read = input.Read(buffer, index, bytesToRead - index);
            if (read == 0)
            {
                throw new EndOfStreamException
                    (String.Format("End of stream reached with {0} byte{1} left to read.",
                                   bytesToRead - index,
                                   bytesToRead - index == 1 ? "s" : ""));
            }
            index += read;
        }
    }
}

Feedback very welcome. This was fun :)

Tuesday, June 1, 2021
 
SkyNet
answered 7 Months ago
78

Use an URL instead of File for any access that is not on your local computer.

URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt");
Scanner s = new Scanner(url.openStream());

Actually, URL is even more generally useful, also for local access (use a file: URL), jar files, and about everything that one can retrieve somehow.

The way above interprets the file in your platforms default encoding. If you want to use the encoding indicated by the server instead, you have to use a URLConnection and parse it's content type, like indicated in the answers to this question.


About your Error, make sure your file compiles without any errors - you need to handle the exceptions. Click the red messages given by your IDE, it should show you a recommendation how to fix it. Do not start a program which does not compile (even if the IDE allows this).

Here with some sample exception-handling:

try {
   URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt");
   Scanner s = new Scanner(url.openStream());
   // read from your scanner
}
catch(IOException ex) {
   // there was some connection problem, or the file did not exist on the server,
   // or your URL was not in the right format.
   // think about what to do now, and put it here.
   ex.printStackTrace(); // for now, simply output it.
}
Thursday, June 3, 2021
 
shin
answered 6 Months ago
59

Are you doing this at work? Is a corporate firewall or anti-virus program running? Check the windows event logs and any applications in your task notification icons area on your start bar for hints as to who is blocking this.

Wednesday, August 25, 2021
 
panthro
answered 3 Months ago
59

If WCF meets your needs, it's worth looking at. ZeroC and other alternative higher level libraries exist. Otherwise there are several different ways to work closer to the socket level if that's what you need.

TcpClient/UdpClient

These provide a relatively thin wrapper around the underlying sockets. It essentially provides a Stream over the socket. You can use the async methods on the NetworkStream (BeginRead, etc.). I don't like this one as the wrapper doesn't provide that much and it tends to be a little more awkward than using the socket directly.

Socket - Select

This provides the classic Select technique for multiplexing multiple socket IO onto a single thread. Not recommended any longer.

Socket - APM Style

The Asynchronous Programming Model (AKA IAsyncResult, Begin/End Style) for sockets is the primary technique for using sockets asynchronously. And there are several variants. Essentially, you call an async method (e.g., BeginReceive) and do one of the following:

  1. Poll for completion on the returned IAsyncResult (hardly used).
  2. Use the WaitHandle from the IAsyncResult to wait for the method to complete.
  3. Pass the BeginXXX method a callback method that will be executed when the method completes.

The best way is #3 as it is the usually the most convenient. When in doubt, use this method.

Some links:

  • MSDN Magazine Article on Sockets
  • A Jeffery Richter Article on the Asynchronous Programming Model

.NET 3.5 High Performance Sockets

.NET 3.5 introduced a new model for async sockets that uses events. It uses the "simplified" async model (e.g., Socket.SendAsync). Instead of giving a callback, you subscribe to an event for completion and instead of an IAsyncResult, you get SocketAsyncEventArgs. The idea is that you can reuse the SocketAsyncEventArgs and pre-allocate memory for socket IO. In high performance scenarios this can be much more efficient that using the APM style. In addition, if you do pre-allocate the memory, you get a stable memory footprint, reduced garbage collection, memory holes from pinning etc. Note that worrying about this should only be a consideration in the most high performance scenarios.

  • MSDN Magazine: Get Connected With The .NET Framework 3.5
  • MSDN Information with technique for pre-allocating memory

Summary

For most cases use the callback method of the APM style unless you prefer the style of the SocketAsyncEventArgs / Async method. If you've used CompletionPorts in WinSock, you should know that both of these methods use CompletionPorts under the hood.

Friday, September 10, 2021
 
Rüdiger Schulz
answered 3 Months ago
19

The following reads through a file, 1 line per clock cycle: expected data format is one decimal number per line.

integer               data_file    ; // file handler
integer               scan_file    ; // file handler
logic   signed [21:0] captured_data;
`define NULL 0    

initial begin
  data_file = $fopen("data_file.dat", "r");
  if (data_file == `NULL) begin
    $display("data_file handle was NULL");
    $finish;
  end
end

always @(posedge clk) begin
  scan_file = $fscanf(data_file, "%dn", captured_data); 
  if (!$feof(data_file)) begin
    //use captured_data as you would any other wire or reg value;
  end
end
Friday, September 17, 2021
 
Sheakspear Zitouni
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share