Asked  5 Months ago    Answers:  5   Viewed   2.1k times

How do I read the following (two columns) data (from a .dat file) with Pandas

TIME                      XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17

Column separator is (at least) 2 spaces.

I tried

df = pd.read_table("test.dat", sep="s+", usecols=['TIME', 'XGSM'])
print df

But it prints

   TIME  XGSM
   2004     6
   2004     6
   2004     6
   2004     6
   2004     6

 Answers

85

You can use parameter usecols with order of columns:

import pandas as pd
from pandas.compat import StringIO

temp=u"""TIME             XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), 
                 sep="s+", 
                 skiprows=1, 
                 usecols=[0,7], 
                 names=['TIME','XGSM'])

print (df)
   TIME  XGSM
0  2004     1
1  2004     5
2  2004     8
3  2004    11
4  2004    17

Edit:

You can use separator regex - 2 and more spaces and then add engine='python' because warning:

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from 's+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

import pandas as pd
from pandas.compat import StringIO

temp=u"""TIME              XGSM
2004 006 01 00 01 37 600   1
2004 006 01 00 02 32 800   5
2004 006 01 00 03 28 000   8
2004 006 01 00 04 23 200   11
2004 006 01 00 05 18 400   17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep=r's{2,}', engine='python')

print (df)
                       TIME  XGSM
0  2004 006 01 00 01 37 600     1
1  2004 006 01 00 02 32 800     5
2  2004 006 01 00 03 28 000     8
3  2004 006 01 00 04 23 200    11
4  2004 006 01 00 05 18 400    17
Saturday, July 3, 2021
 
dotoree
answered 5 Months ago
41

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

#print (df1)
df3 = df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Year':[2000,2003,2003,2004,2007],
                    'Code':[1,2,3,4,4],
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Year  Code  Age
0    Tom  2000     1   34
1   Sara  2003     2   18
2    Eva  2003     3   44
3   Jack  2004     4   27
4  Laura  2007     4   30

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F'],
                    'Year':[2001,2003,2003,2004,2007],
                    'Code':[1,2,3,5,3],
                    'Val':[21,34,23,44,67]})
print (df2)
       Name Sex  Year  Code  Val
0       Tom   M  2001     1   21
1      Paul   M  2003     2   34
2       Eva   F  2003     3   23
3      Jack   M  2004     5   44
4  Michelle   F  2007     3   67
#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
  Name_x  Year  Code  Age Name_y  Sex   Val
0    Tom  2000     1   34    NaN  NaN   NaN
1   Sara  2003     2   18   Paul    M  34.0
2    Eva  2003     3   44    Eva    F  23.0
3   Jack  2004     4   27    NaN  NaN   NaN
4  Laura  2007     4   30    NaN  NaN   NaN

#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
    Name  Year  Code  Age   Val
0    Tom  2000     1   34   NaN
1   Sara  2003     2   18  34.0
2    Eva  2003     3   44  23.0
3   Jack  2004     4   27   NaN
4  Laura  2007     4   30   NaN

If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Age
0    Tom   34
1   Sara   18
2    Eva   44
3   Jack   27
4  Laura   30

df3, df4 = df1.copy(), df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                    'Val': [1,2,3,4,5]})
print (df2)
       Name  Val
0       Tom    1 <-duplicated name Tom
1       Tom    2 <-duplicated name Tom
2       Eva    3
3      Jack    4
4  Michelle    5

s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom         1
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df1['New'] = df1['Name'].map(s)
print (df1)
    Name  Age  New
0    Tom   34  1.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
#add parameter for keep last value 
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom         2
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df3['New'] = df3['Name'].map(s)
print (df3)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}

df4['New'] = df4['Name'].map(d)
print (df4)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
Tuesday, June 1, 2021
 
PeterTheLobster
answered 7 Months ago
71

I think in this case concat is what you want:

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

Monday, June 28, 2021
 
Whakkee
answered 6 Months ago
58

The biggest clue is the rows are all being returned on one line. This indicates line terminators are being ignored or are not present.

You can specify the line terminator for csv_reader. If you are on a mac the lines created will end with rrather than the linux standard n or better still the suspenders and belt approach of windows with rn.

pandas.read_csv(filename, sep='t', lineterminator='r')

You could also open all your data using the codecs package. This may increase robustness at the expense of document loading speed.

import codecs

doc = codecs.open('document','rU','UTF-16') #open for reading with "universal" type set

df = pandas.read_csv(doc, sep='t')
Tuesday, July 27, 2021
 
cusejuice
answered 5 Months ago
57

The main problem lies in the way csv file of microsoft excel is actually saved. When the same csv file is opened in notepad it adds extra quotation marks in the lines which have quotes.

1) It adds quote at the starting and ending of the line.

2) It escapes the existing quotes with one more quote. Hence, when we import our csv file in pandas it takes the whole line as one string and thus it ends up all in the first column.

To tackle this-

I imported the csv file and corrected the csv by applying regex substitution and saved it as text file. Then i imported this text file as pandas dataframe. Problem Solved.

with open('csvdata.csv','r+') as csv_file:
for line in csv_file:
    # removing starting and ending quotes of a line
    pattern1 = re.compile(r'^"|"$',re.MULTILINE)
    line = re.sub(r'^"|"$',"",line)
    # substituting escaped quote with a single quote
    pattern2 = re.compile(r'""')
    line = re.sub(r'""','"',line)

    corrected_csv = open("new_csv.txt",'a')
    corrected_csv.write(line)
    corrected_csv.close()
Friday, November 5, 2021
 
codingJoe
answered 1 Month ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share