Asked  7 Months ago    Answers:  5   Viewed   69 times

I am trying to load and parse a JSON file in Python. But I'm stuck trying to load the file:

import json
json_data = open('file')
data = json.load(json_data)

Yields:

ValueError: Extra data: line 2 column 1 - line 225116 column 1 (char 232 - 160128774)

I looked at 18.2. json — JSON encoder and decoder in the Python documentation, but it's pretty discouraging to read through this horrible-looking documentation.

First few lines (anonymized with randomized entries):

{"votes": {"funny": 2, "useful": 5, "cool": 1}, "user_id": "harveydennis", "name": "Jasmine Graham", "url": "http://example.org/user_details?userid=harveydennis", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 2, "cool": 4}, "user_id": "njohnson", "name": "Zachary Ballard", "url": "https://www.example.com/user_details?userid=njohnson", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 0, "cool": 4}, "user_id": "david06", "name": "Jonathan George", "url": "https://example.com/user_details?userid=david06", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 6, "useful": 5, "cool": 0}, "user_id": "santiagoerika", "name": "Amanda Taylor", "url": "https://www.example.com/user_details?userid=santiagoerika", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 8, "cool": 2}, "user_id": "rodriguezdennis", "name": "Jennifer Roach", "url": "http://www.example.com/user_details?userid=rodriguezdennis", "average_stars": 3.5, "review_count": 12, "type": "user"}

 Answers

17

You have a JSON Lines format text file. You need to parse your file line by line:

import json

data = []
with open('file') as f:
    for line in f:
        data.append(json.loads(line))

Each line contains valid JSON, but as a whole, it is not a valid JSON value as there is no top-level list or object definition.

Note that because the file contains JSON per line, you are saved the headaches of trying to parse it all in one go or to figure out a streaming JSON parser. You can now opt to process each line separately before moving on to the next, saving memory in the process. You probably don't want to append each result to one list and then process everything if your file is really big.

If you have a file containing individual JSON objects with delimiters in-between, use How do I use the 'json' module to read in one JSON object at a time? to parse out individual objects using a buffered method.

Tuesday, June 1, 2021
 
njai
answered 7 Months ago
25

Update 2019-10-13: Rewritten the Utf8JsonStreamReader to use ReadOnlySequences internally, added wrapper for JsonSerializer.Deserialize method.


I have created a wrapper around Utf8JsonReader for exactly this purpose:

public ref struct Utf8JsonStreamReader
{
    private readonly Stream _stream;
    private readonly int _bufferSize;

    private SequenceSegment? _firstSegment;
    private int _firstSegmentStartIndex;
    private SequenceSegment? _lastSegment;
    private int _lastSegmentEndIndex;

    private Utf8JsonReader _jsonReader;
    private bool _keepBuffers;
    private bool _isFinalBlock;

    public Utf8JsonStreamReader(Stream stream, int bufferSize)
    {
        _stream = stream;
        _bufferSize = bufferSize;

        _firstSegment = null;
        _firstSegmentStartIndex = 0;
        _lastSegment = null;
        _lastSegmentEndIndex = -1;

        _jsonReader = default;
        _keepBuffers = false;
        _isFinalBlock = false;
    }

    public bool Read()
    {
        // read could be unsuccessful due to insufficient bufer size, retrying in loop with additional buffer segments
        while (!_jsonReader.Read())
        {
            if (_isFinalBlock)
                return false;

            MoveNext();
        }

        return true;
    }

    private void MoveNext()
    {
        var firstSegment = _firstSegment;
        _firstSegmentStartIndex += (int)_jsonReader.BytesConsumed;

        // release previous segments if possible
        if (!_keepBuffers)
        {
            while (firstSegment?.Memory.Length <= _firstSegmentStartIndex)
            {
                _firstSegmentStartIndex -= firstSegment.Memory.Length;
                firstSegment.Dispose();
                firstSegment = (SequenceSegment?)firstSegment.Next;
            }
        }

        // create new segment
        var newSegment = new SequenceSegment(_bufferSize, _lastSegment);

        if (firstSegment != null)
        {
            _firstSegment = firstSegment;
            newSegment.Previous = _lastSegment;
            _lastSegment?.SetNext(newSegment);
            _lastSegment = newSegment;
        }
        else
        {
            _firstSegment = _lastSegment = newSegment;
            _firstSegmentStartIndex = 0;
        }

        // read data from stream
        _lastSegmentEndIndex = _stream.Read(newSegment.Buffer.Memory.Span);
        _isFinalBlock = _lastSegmentEndIndex < newSegment.Buffer.Memory.Length;
        _jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment, _firstSegmentStartIndex, _lastSegment, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
    }

    public T Deserialize<T>(JsonSerializerOptions? options = null)
    {
        // JsonSerializer.Deserialize can read only a single object. We have to extract
        // object to be deserialized into separate Utf8JsonReader. This incures one additional
        // pass through data (but data is only passed, not parsed).
        var tokenStartIndex = _jsonReader.TokenStartIndex;
        var firstSegment = _firstSegment;
        var firstSegmentStartIndex = _firstSegmentStartIndex;

        // loop through data until end of object is found
        _keepBuffers = true;
        int depth = 0;

        if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
            depth++;

        while (depth > 0 && Read())
        {
            if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
                depth++;
            else if (TokenType == JsonTokenType.EndObject || TokenType == JsonTokenType.EndArray)
                depth--;
        }

        _keepBuffers = false;

        // end of object found, extract json reader for deserializer
        var newJsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(firstSegment!, firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex).Slice(tokenStartIndex, _jsonReader.Position), true, default);

        // deserialize value
        var result = JsonSerializer.Deserialize<T>(ref newJsonReader, options);

        // release memory if possible
        firstSegmentStartIndex = _firstSegmentStartIndex + (int)_jsonReader.BytesConsumed;

        while (firstSegment?.Memory.Length < firstSegmentStartIndex)
        {
            firstSegmentStartIndex -= firstSegment.Memory.Length;
            firstSegment.Dispose();
            firstSegment = (SequenceSegment?)firstSegment.Next;
        }

        if (firstSegment != _firstSegment)
        {
            _firstSegment = firstSegment;
            _firstSegmentStartIndex = firstSegmentStartIndex;
            _jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment!, _firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
        }

        return result;
    }

    public void Dispose() =>_lastSegment?.Dispose();

    public int CurrentDepth => _jsonReader.CurrentDepth;
    public bool HasValueSequence => _jsonReader.HasValueSequence;
    public long TokenStartIndex => _jsonReader.TokenStartIndex;
    public JsonTokenType TokenType => _jsonReader.TokenType;
    public ReadOnlySequence<byte> ValueSequence => _jsonReader.ValueSequence;
    public ReadOnlySpan<byte> ValueSpan => _jsonReader.ValueSpan;

    public bool GetBoolean() => _jsonReader.GetBoolean();
    public byte GetByte() => _jsonReader.GetByte();
    public byte[] GetBytesFromBase64() => _jsonReader.GetBytesFromBase64();
    public string GetComment() => _jsonReader.GetComment();
    public DateTime GetDateTime() => _jsonReader.GetDateTime();
    public DateTimeOffset GetDateTimeOffset() => _jsonReader.GetDateTimeOffset();
    public decimal GetDecimal() => _jsonReader.GetDecimal();
    public double GetDouble() => _jsonReader.GetDouble();
    public Guid GetGuid() => _jsonReader.GetGuid();
    public short GetInt16() => _jsonReader.GetInt16();
    public int GetInt32() => _jsonReader.GetInt32();
    public long GetInt64() => _jsonReader.GetInt64();
    public sbyte GetSByte() => _jsonReader.GetSByte();
    public float GetSingle() => _jsonReader.GetSingle();
    public string GetString() => _jsonReader.GetString();
    public uint GetUInt32() => _jsonReader.GetUInt32();
    public ulong GetUInt64() => _jsonReader.GetUInt64();
    public bool TryGetDecimal(out byte value) => _jsonReader.TryGetByte(out value);
    public bool TryGetBytesFromBase64(out byte[] value) => _jsonReader.TryGetBytesFromBase64(out value);
    public bool TryGetDateTime(out DateTime value) => _jsonReader.TryGetDateTime(out value);
    public bool TryGetDateTimeOffset(out DateTimeOffset value) => _jsonReader.TryGetDateTimeOffset(out value);
    public bool TryGetDecimal(out decimal value) => _jsonReader.TryGetDecimal(out value);
    public bool TryGetDouble(out double value) => _jsonReader.TryGetDouble(out value);
    public bool TryGetGuid(out Guid value) => _jsonReader.TryGetGuid(out value);
    public bool TryGetInt16(out short value) => _jsonReader.TryGetInt16(out value);
    public bool TryGetInt32(out int value) => _jsonReader.TryGetInt32(out value);
    public bool TryGetInt64(out long value) => _jsonReader.TryGetInt64(out value);
    public bool TryGetSByte(out sbyte value) => _jsonReader.TryGetSByte(out value);
    public bool TryGetSingle(out float value) => _jsonReader.TryGetSingle(out value);
    public bool TryGetUInt16(out ushort value) => _jsonReader.TryGetUInt16(out value);
    public bool TryGetUInt32(out uint value) => _jsonReader.TryGetUInt32(out value);
    public bool TryGetUInt64(out ulong value) => _jsonReader.TryGetUInt64(out value);

    private sealed class SequenceSegment : ReadOnlySequenceSegment<byte>, IDisposable
    {
        internal IMemoryOwner<byte> Buffer { get; }
        internal SequenceSegment? Previous { get; set; }
        private bool _disposed;

        public SequenceSegment(int size, SequenceSegment? previous)
        {
            Buffer = MemoryPool<byte>.Shared.Rent(size);
            Previous = previous;

            Memory = Buffer.Memory;
            RunningIndex = previous?.RunningIndex + previous?.Memory.Length ?? 0;
        }

        public void SetNext(SequenceSegment next) => Next = next;

        public void Dispose()
        {
            if (!_disposed)
            {
                _disposed = true;
                Buffer.Dispose();
                Previous?.Dispose();
            }
        }
    }
}

You can use it as replacement for Utf8JsonReader, or for deserializing json into typed objects (as wrapper around System.Text.Json.JsonSerializer.Deserialize).

Example of usage for deserializing objects from huge JSON array:

using var stream = new FileStream("LargeData.json", FileMode.Open, FileAccess.Read);
using var jsonStreamReader = new Utf8JsonStreamReader(stream, 32 * 1024);

jsonStreamReader.Read(); // move to array start
jsonStreamReader.Read(); // move to start of the object

while (jsonStreamReader.TokenType != JsonTokenType.EndArray)
{
    // deserialize object
    var obj = jsonStreamReader.Deserialize<TestData>();

    // JsonSerializer.Deserialize ends on last token of the object parsed,
    // move to the first token of next object
    jsonStreamReader.Read();
}

Deserialize method reads data from stream until it finds end of the current object. Then it constructs a new Utf8JsonReader with data read and calls JsonSerializer.Deserialize.

Other methods are passed through to Utf8JsonReader.

And, as always, don't forget to dispose your objects at the end.

Wednesday, June 9, 2021
 
steros
answered 6 Months ago
22
NSDictionary *dataDictionary = [NSJSONSerialization JSONObjectWithData:jsonData options:0 error:&error];
NSLog(@"%@", dataDictionary);
//NSArray *ar = [NSArray arrayWithObject:dataDictionary];//REMOVED
for (NSString *key in [dataDictionary allKeys]) {
    NSDictionary *dic=[dataDictionary objectForKey:key];//ADDED
    for (NSString *dickey in [dic allKeys]) { //MODIFIED
        NSDictionary *dict=[dic objectForKey:dicKey];//ADDED
        TickerData *t=[[TickerData alloc] init];//ALLOC INIT ?
        t.currency = key;//EDITED
        t.symbol = [dict objectForKey:@"symbol"];
        t.last = [dict objectForKey:@"last"];
        [_tickerArray addObject:t];
    }

}

Your data doesn't contain any array, its all dictionaries, try the above code see comments too..

Hope it works..

Edited:

Yes you have initialize the object too, as suggested above in other answers..

Saturday, August 7, 2021
 
bayman
answered 4 Months ago
14

The generator seems completely superfluous.

with open(root_dir + 'filename.json') as old, open(root_dir + 'output.csv', 'w') as csvfile:
    new = csv.writer(csvfile)
    for x in old:
        row = json.loads(x)
        new.writerow(row)

If one line of JSON does not simply produce an array of strings and numbers, you still need to figure out how to convert it from whatever structure is inside the JSON to something which can usefully be serialized as a one-dimensional list of strings and numbers of a fixed length.

If your JSON can be expected to reliably contain a single dictionary with a fixed set of keyword-value pairs, maybe try

from csv import DictWriter
import json

with open(jsonfile, 'r') as inp, open(csvfile, 'w') as outp:
    writer = DictWriter(outp, fieldnames=[
            'url', 'date', 'content', 'renderedContent', 'id', 'username',
            # 'user',  # see below
            'outlinks', 'outlinksss', 'tcooutlinks', 'tcooutlinksss',
            'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
            'conversationId', 'lang', 'source', 'media', 'retweetedTweet',
            'quotedTweet', 'mentionedUsers'])
    for line in inp:
        row = json.loads(line)
        writer.writerow(row)

I omitted the user field because in your example, this key contains a nested structure which cannot easily be transformed into CSV without further mangling. Perhaps you would like to extract just user.id into a new field user_id; or perhaps you would like to lift the entire user structure into a flattened list of additional columns in the main record like user_username, user_displayname, user_id, etc?

In some more detail, CSV is basically a two-dimensional matrix where every row is a one-dimensional collection of columns corresponding to one record in the data, where each column can contain one string or one number. Every row needs to have exactly the same number of columns, though you can leave some of them empty.

JSON which can trivially be transformed into CSV would look like

["Adolf", "1945", 10000000]
["Joseph", "1956", 25000000]
["Donald", null, 1000000]

JSON which can be transformed to CSV by some transformation (which you'd have to specify separately, like for example with the dictionary key ordering specified above) might look like

{"name": "Adolf", "dod": "1945", "death toll": 10000000}
{"dod": "1956", "name": "Joseph", "death toll": 25000000}
{"death toll": 1000000, "name": "Donald"}

(Just to make it more interesting, one field is missing, and the dictionary order varies from one record to the next. This is not typical, but definitely within the realm of valid corner cases that Python could not possibly guess on its own how to handle.)

Most real-world JSON is significantly more complex than either of these simple examples, to the point where we can say that the problem is not possible to solve in the general case.

Saturday, August 21, 2021
 
Sam Adamsh
answered 4 Months ago
66

First read the data from the file.

with open('data2.txt') as data_file:    
    old_data = json.load(data_file)

Then append your data to the old data

data = old_data + data

Then rewrite the whole file.

with open('data2.txt', 'w') as outfile:
    json.dump(data, outfile)
Monday, October 4, 2021
 
user123
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share